We implement classical iterator style operators. Logical operators define operations on database. They encapsulate the following info: what the input is (another LogicalOperator
), what to do with the data, and how to do it.
Currently logical operators can have zero or more input operations, and thus a LogicalOperator
tree is formed. Most LogicalOperator
types have only one input, so we are mostly working with chains instead of full fledged trees. You can find information on each operator in src/query/plan/operator.lcp
.
Logical operators do not perform database work themselves. Instead they create Cursor
objects that do the actual work, based on the info in the operator. Cursors expose a Pull
method that gets called by the cursor’s consumer. The consumer keeps pulling as long as the Pull
returns true
(indicating it successfully performed some work and might be eligible for another Pull
). Most cursors will call the Pull
function of their input provided cursor, so typically a cursor chain is created that is analogue to the logical operator chain it’s created from.
The Frame
object contains all the data of the current Pull
chain. It serves for communicating data between cursors.
For example, in a MATCH (n) RETURN n
query the ScanAllCursor
places a vertex on the Frame
for each Pull
. It places it on the place reserved for the n
symbol. Then the ProduceCursor
can take that same value from the Frame
because it knows the appropriate symbol. Frame
positions are indexed by Symbol
objects.
Expressions results are not placed on the Frame
since they do not need to be communicated between different Cursors
. Instead, expressions are evaluated using an instance of ExpressionEvaluator
. Since generally speaking an expression can be defined by a tree of subexpressions, the ExpressionEvaluator
is implemented as a tree visitor. There is a performance sub-optimality here because a stack is used to communicate intermediary expression results between elements of the tree. This is one of the reasons why it’s planned to use Frame
for intermediary expression results as well. The other reason is that it might facilitate compilation later on.
Cypher query execution has mostly well-defined semantics. Some are explicitly defined by openCypher and its TCK, while others are implicitly defined by Neo4j’s implementation of Cypher that we want to be generally compatible with.
These semantics can in short be described as follows: a Cypher query consists of multiple clauses some of which modify it. Generally, every clause in the query, when reading it left to right, operates on a consistent state of the property graph, untouched by subsequent clauses. This means that a MATCH
clause in the beginning operates on a graph-state in which modifications by the subsequent SET
are not visible.
The stated semantics feel very natural to the end-user, and Neo seems to implement them well. For Memgraph the situation is complex because LogicalOperator
execution (through a Cursor
) happens one Pull
at a time (generally meaning all the query clauses get executed for every top-level Pull
). This is not inherently consistent with Cypher semantics because a SET
clause can modify data, and the MATCH
clause that precedes it might see the modification in a subsequent Pull
. Also, the RETURN
clause might want to stream results to the user before all SET
clauses have been executed, so the user might see some intermediate graph state. There are many edge-cases that Memgraph does its best to avoid to stay true to Cypher semantics, while at the same time using a high-performance streaming approach. The edge-cases are enumerated in this document along with the implementation details they imply.
An operator that does nothing but whose Cursor::Pull
returns true
on the first Pull
and false
on subsequent ones. This operator is used when another operator has an optional input, because in Cypher a clause will typically execute once for every input from the preceding clauses, or just once if there was no preceding input. For example, consider the CREATE
clause. In the query CREATE (n)
only one node is created, while in the query MATCH (n) CREATE (m)
a node is created for each existing node. Thus in our CreateNode
logical operator the input is either a ScanAll
operator, or a Once
operator.
In the previous section, Cypher Execution Semantics, we mentioned how the preceding clauses should not see changes made in subsequent ones. For that reason, some operators take a storage::View
enum value. This value determines which state of the graph an operator sees.
Consider the query MATCH (n)--(m) WHERE n.x = 0 SET m.x = 1
. Naive streaming could match a vertex n
on the given criteria, expand to m
, update it’s property, and in the next iteration consider the vertex previously matched to m
and skip it because it’s newly set property value does not qualify. This is not how Cypher works. To handle this issue properly, Memgraph designed the VertexAccessor
class that tracks two versions of data: one that was visible before the current transaction+command, and the optional other that was created in the current transaction+command. The MATCH
clause will be planned as ScanAll
and Expand
operations using storage::View::OLD
value. This will ensure modifications performed in the same query do not affect it. The same applies to edges and the EdgeAccessor
class.
It’s possible that a pattern element has already been declared in the same pattern, or a preceding pattern. For example MATCH (n)--(m), (n)--(l)
or a cycle-detection match MATCH (n)-->(n) RETURN n
. Implementation-wise, existing record detection just checks that the expanded record is equal to the one already on the frame.