Logical Plan Execution

We implement classical iterator style operators. Logical operators define operations on database. They encapsulate the following info: what the input is (another LogicalOperator), what to do with the data, and how to do it.

Currently logical operators can have zero or more input operations, and thus a LogicalOperator tree is formed. Most LogicalOperator types have only one input, so we are mostly working with chains instead of full fledged trees. You can find information on each operator in src/query/plan/operator.lcp.

Cursor

Logical operators do not perform database work themselves. Instead they create Cursor objects that do the actual work, based on the info in the operator. Cursors expose a Pull method that gets called by the cursor’s consumer. The consumer keeps pulling as long as the Pull returns true (indicating it successfully performed some work and might be eligible for another Pull). Most cursors will call the Pull function of their input provided cursor, so typically a cursor chain is created that is analogue to the logical operator chain it’s created from.

Frame

The Frame object contains all the data of the current Pull chain. It serves for communicating data between cursors.

For example, in a MATCH (n) RETURN n query the ScanAllCursor places a vertex on the Frame for each Pull. It places it on the place reserved for the n symbol. Then the ProduceCursor can take that same value from the Frame because it knows the appropriate symbol. Frame positions are indexed by Symbol objects.

ExpressionEvaluator

Expressions results are not placed on the Frame since they do not need to be communicated between different Cursors. Instead, expressions are evaluated using an instance of ExpressionEvaluator. Since generally speaking an expression can be defined by a tree of subexpressions, the ExpressionEvaluator is implemented as a tree visitor. There is a performance sub-optimality here because a stack is used to communicate intermediary expression results between elements of the tree. This is one of the reasons why it’s planned to use Frame for intermediary expression results as well. The other reason is that it might facilitate compilation later on.

Cypher Execution Semantics

Cypher query execution has mostly well-defined semantics. Some are explicitly defined by openCypher and its TCK, while others are implicitly defined by Neo4j’s implementation of Cypher that we want to be generally compatible with.

These semantics can in short be described as follows: a Cypher query consists of multiple clauses some of which modify it. Generally, every clause in the query, when reading it left to right, operates on a consistent state of the property graph, untouched by subsequent clauses. This means that a MATCH clause in the beginning operates on a graph-state in which modifications by the subsequent SET are not visible.

The stated semantics feel very natural to the end-user, and Neo seems to implement them well. For Memgraph the situation is complex because LogicalOperator execution (through a Cursor) happens one Pull at a time (generally meaning all the query clauses get executed for every top-level Pull). This is not inherently consistent with Cypher semantics because a SET clause can modify data, and the MATCH clause that precedes it might see the modification in a subsequent Pull. Also, the RETURN clause might want to stream results to the user before all SET clauses have been executed, so the user might see some intermediate graph state. There are many edge-cases that Memgraph does its best to avoid to stay true to Cypher semantics, while at the same time using a high-performance streaming approach. The edge-cases are enumerated in this document along with the implementation details they imply.

Implementation Peculiarities

Once

An operator that does nothing but whose Cursor::Pull returns true on the first Pull and false on subsequent ones. This operator is used when another operator has an optional input, because in Cypher a clause will typically execute once for every input from the preceding clauses, or just once if there was no preceding input. For example, consider the CREATE clause. In the query CREATE (n) only one node is created, while in the query MATCH (n) CREATE (m) a node is created for each existing node. Thus in our CreateNode logical operator the input is either a ScanAll operator, or a Once operator.

storage::View

In the previous section, Cypher Execution Semantics, we mentioned how the preceding clauses should not see changes made in subsequent ones. For that reason, some operators take a storage::View enum value. This value determines which state of the graph an operator sees.

Consider the query MATCH (n)--(m) WHERE n.x = 0 SET m.x = 1. Naive streaming could match a vertex n on the given criteria, expand to m, update it’s property, and in the next iteration consider the vertex previously matched to m and skip it because it’s newly set property value does not qualify. This is not how Cypher works. To handle this issue properly, Memgraph designed the VertexAccessor class that tracks two versions of data: one that was visible before the current transaction+command, and the optional other that was created in the current transaction+command. The MATCH clause will be planned as ScanAll and Expand operations using storage::View::OLD value. This will ensure modifications performed in the same query do not affect it. The same applies to edges and the EdgeAccessor class.

Existing Record Detection

It’s possible that a pattern element has already been declared in the same pattern, or a preceding pattern. For example MATCH (n)--(m), (n)--(l) or a cycle-detection match MATCH (n)-->(n) RETURN n. Implementation-wise, existing record detection just checks that the expanded record is equal to the one already on the frame.