After the semantic analysis and symbol generation, the AST is converted to a tree of logical operators. This conversion is called planning and the tree of logical operators is called a plan. The whole planning process is done in the following steps.
The first step is to preprocess the AST by collecting information on filters, divide the query into parts, normalize patterns in MATCH
clauses, etc.
After the preprocess step, the planning can be done via 2 planners: VariableStartPlanner
and RuleBasedPlanner
. The first planner will generate multiple plans where each plan has different starting points for searching the patterns in MATCH
clauses. The second planner produces a single plan by mapping the query parts as they are to logical operators.
In this stage, we perform various transformations on the generated logical plan. Here we want to optimize the operations in order to improve performance during the execution. Naturally, transformations need to preserve the semantic behaviour of the original plan.
After the generation, the execution cost of each plan is estimated. This estimation is used to select the best plan which will be executed.
The implementation can be found in the query/plan
directory, with the public entry point being query/plan/planner.hpp
.
Each openCypher query consists of at least 1 single query. Multiple single queries are chained together using a query combinator. Currently, there is only one combinator, UNION
. The preprocessing step starts in the CollectQueryParts
function. This function will take a look at each single query and divide it into parts. Each part is separated with RETURN
and WITH
clauses. For example:
MATCH (n) CREATE (m) WITH m MATCH (l)-[]-(m) RETURN l
| | |
|------- part 1 -----------+-------- part 2 --------|
| |
|-------------------- single query -----------------|
Each part is created by collecting all MATCH
clauses and normalizing their patterns. Pattern normalization is the process of converting an arbitrarily long pattern chain of nodes and edges into a list of triplets (start node, edge, end node)
. The triplets should preserve the semantics of the match. For example:
MATCH (a)-[p]-(b)-[q]-(c)-[r]-(d)
is equivalent to:
MATCH (a)-[p]-(b), (b)-[q]-(c), (c)-[r]-(d)
With this representation, it becomes easier to reorder the triplets and choose different strategies for pattern matching.
In addition to normalizing patterns, all of the filter expressions in patterns and inside of the WHERE
clause (of the accompanying MATCH
) are extracted and stored separately. During the extraction, symbols used in the filter expression are collected. This allows for planning filters in a valid order, as the matching for triplets is being done. Another important benefit of having extra information on filters, is to recognize when a database index could be used.
After each MATCH
is processed, they are all grouped, so that even the whole MATCH
clauses may be reordered. The important thing is to remember which symbols were used to name edges in each MATCH
. With those symbols we can plan for cyphermorphism, i.e. ensure different edges in the search pattern of a single MATCH
map to different edges in the graph. This preserves the semantic of the query, even though we may have reordered the matching. The same steps are done for OPTIONAL MATCH
.
Another clause which needs processing is MERGE
. Here we normalize the pattern, since the MERGE
is a bit like MATCH
and CREATE
in one.
All the other clauses are left as is.
In the end, each query part consists of: