Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
We will use the LDBC Social Network Benchmark (LDBC SNB) data set. This data set models a typical social forum, where communities of persons can post messages on a forum to discuss a topic. It comes with a data generator, which allows you to generate data at different scale factors. Scale factor 1 generates roughly 1GB of raw data, scale factor 10 generates roughly 10GB of raw data, etc.
Figure 1 shows the schema (from the LDBC SNB specification). It models the activities and relationships of social forum participants. For example, a forum Member can publish Posts on a Forum, and other Members of the Forum can make a Comment on the Post or on someone else's Comment. A Person's home location is a hierarchy (Continent>Country>City), and a person can be affiliated with a University or a Company. A Tag can be used to classify a Forum, a Message, or a Person's interests. Tags can further be classified by TagClass. The relationships between entities are modeled as directed edges, except person KNOWS person is modeled as an undirected edge. For example, Person connects to Tag by the hasInterest edge. Forum connects to Person by two different edges, hasMember and hasModerator.
LDBC SNB schema uses inheritance to model certain relationships:
Message is the superclass of Post and Comment.
Place is the superclass of City, Country, and Continent.
Organization is the superclass of University and Company.
We do not use the superclasses in our graph model. When there is an edge type connecting an entity to a superclass, we instead create an edge type from the entity to each of the subclasses of the superclass. For example, Message has an isLocatedIn relationship to Country. Since Message has two subclasses, Post and Comment, both connected with Country by the isLocatedIn relationship, we create an edge type IS_LOCATED_IN, connecting both vertex type pairs using the compound edge type available in TigerGraph 3.0.
CREATE DIRECTED EDGE IS_LOCATED_IN ( FROM Comment, TO Country | FROM Post, TO Country )
This new DDL syntax allows a general edge type to be defined over multiple vertex pairs. For example, there are many relationships in the LDBC schema all called isLocatedIn which connect something to a geographical entity. We can model them all as a single edge type IS_LOCATED_IN. The result is a more succinct graph model and less GSQL code when expressing pattern matching queries.
CREATE DIRECTED EDGE IS_LOCATED_IN (FROM Comment, TO Country | FROM Post, TO Country | FROM Company, TO Country | FROM Person, TO City | FROM University, TO City) WITH REVERSE_EDGE="IS_LOCATED_IN_REVERSE"
Vertex Type
For each entity in Figure 1 (the rectangular boxes), we create a vertex type with the entity's name in UpperCamelCase.
Person is a person who participates in a forum.
Forum is a place where persons discuss topics.
City, Country, and Continent are geographic locations of other entities.
Company and University are organizations with which a person can be affiliated.
Comment and Post are the interaction messages created by persons in a forum.
Tag is a topic or a concept.
TagClass is a class or a category. TagClass can form a hierarchy of tags.
Edge Type
For each relationship type, we create an edge type with the relationship name (all capitalized and words are separated by an underscore). For example,
CONTAINER_OF: Forum is the container of posts.
HAS_INTERESTS: Person has interest in tag(s).
IS_SUBCLASS_OF: Tag is a subclass of another Tag.
When multiple relationships share the same semantics in Figure 1, we merge them into a single compound edge type. For example:
IS_LOCATED_IN: Comment or Post is located in Country, Company is located in Country, Person is located in City, and University is located in City.
Two GSQL scripts for defining the LDBC-SNB schema are shown below. Choose the one that serve your needs. They are not equivalent if you have different organizations using the same graph database instance.
In this method, global vertex/edge type containers are created first. Next, graphs are created to group them. In other words, the global vertex/edge type containers can be shared across graphs.
In this method, an empty graph is created first. Next, local vertex/edge type containers are added to the empty graph via a schema change job. The vertex/egde type containers added this way will be private to the graph, no other graph can see them.
Repeating the same hop is useful sometimes, but the real power of pattern matching comes from expressing multi-hop patterns, with specific characteristics for each hop. For example, the well-known product recommendation phrase "People who bought this product also bought this other product", is expressed by the following 2-hop pattern:
As you see, a 2-hop pattern is a simple concatenation and merging of two 1-hop patterns where the two patterns share a common endpoint. Below, Y:y is the connecting end point.
Similarly, a 3-hop pattern concatenates three 1-hop patterns in sequence, each pair of adjacent hops sharing one end point. Below, Y:y and Z:z are the connecting end points.
In general, we can connect N 1-hop patterns into a N-hop pattern. The database will search the graph topology to find subgraphs that match this N-hop pattern.
A multi-hop pattern has two endpoint vertex sets and one or more intermediate vertex sets. If the query does not need to express any conditions for an intermediate vertex set, then the vertex set can be omitted and the two surrounding edge sets can be joined with a simple ".". For example, in the 2-hop pattern example above, if we do not need to specify the type of the intermediate vertex Y, nor need to refer to it in any of the query's other clauses (such as WHERE or ACCUM), then the pattern can be reduced as follows:
Note that when we abbreviate that path in this way, we do not support aliases for the edges or intermediate vertices in the abbreviated section.
If a pattern has a Kleene star to repeat an edge, GSQL pattern matching selects only the shortest paths which match the pattern. If we did not apply this restriction, computer science theory tells us that the computation time could be unbounded or extreme (NP = non-polynomial, to be technical). If we instead matched ALL paths regardless of length when a Kleene star is used without an upper bound, there could be an infinite number of matches, if there are loops in the graph. Even without loops or with an upper bound, the number of paths to check grows exponentially with the number of hops.
For the pattern 1 - (_*) - 5
in Figure 3 above, you can see the following:
There are TWO shortest paths: 1-2-3-4-5 and 1-2-6-4-5
These have 4 hops, so we can stop searching after 4 hops. This makes the task tractable.
If we search for ALL paths which do not repeat any vertices:
There are THREE non-repeated-vertex paths: 1-2-3-4-5, 1-2-6-4-5, and 1-2-9-10-11-12-4-5
The actual number of matches is small, but the number of paths is theoretically very large.
If we search for ALL paths which do not repeat any edges:
There are FOUR non-repeated-edge paths: 1-2-3-4-5, 1-2-6-4-5, 1-2-9-10-11-12-4-5, and 1-2-3-7-8-3-4-5
The actual number of matches is small, but number of paths to consider is NP.
If we search for ALL paths with no restrictions:
There are an infinite number of matches, because we can go around the 3-7-8-3 cycle any number of times.
In the early version of Pattern Matching (TigerGraph v2.4 to v2.6), there were a number of restrictions on the WHERE, ACCUM and POST-ACCUM clauses In TigerGraph 3.0, most of these restrictions are lifted.
Each vertex set or edge set in a pattern (except edges with Kleene stars) can have an alias variable associated with it. When the query runs and finds matches, it associates, or binds, each alias to the matching vertices or edges in the graph.
The SELECT clause specifies the output vertex set of a SELECT statement. For a multiple-hop pattern, we can select any vertex alias in the pattern. The example below shows the 4 possible choices for the given pattern:
For a multiple-hop pattern, if you don't need to refer to the intermediate vertex points, you can just use "." to connect the edge patterns, giving a more succinct representation. For example, below we remove y and z, and connect E2>, <E3 and E4 using the period symbol. Note that you cannot have an alias for a multi-hop sequence like E2>.<E3.E4.
Beginning with TigerGraph v3.0, each predicate (simple true/false condition) can refer to any of the aliases in the path. As with any database query, more complex conditions may not be as performant as simpler queries with simpler, more local predicate conditions. Consider the pattern and query below:
GSQL's pattern matching syntax provides the essentials for a regular expression language for paths in graphs. Consider the three basic requirements for a regular expression language:
The empty set --> A path of length zero (no match)
Concatenation --> Form a path by adding one on two another. You can write an N-hop pattern, and M-hop pattern, and then combine them to have a (N+M)-hop pattern.
Alternation (either-or) --> You can use alternation for both vertex sets and edge sets, e.g.
FROM (Source1 | Source2) -(Edge1> | <Edge 2)- (Target1 | Target2)
Note: This is not the same as
FROM (Source1 -(Edge1>)- Target 1) | (Source2 -(<Edge2)- Target 2)
The latter can be achieved by writing two SELECT query blocks and getting the UNION of their results.
The point of pattern matching is to identity sets of graph entities that match your input pattern. Once you've done that, GSQL enables you to do advanced and efficient computation on that data, from simply counting the matches to advanced algorithms and analytics. This section compares accumulation in the current Pattern Matching syntax to earlier versions, but it does not attempt to explain accumulators in full. You may want to consult the Accumulators Tutorial and and the GSQL Language Reference's section on the ACCUM and POST-ACCUM clauses.
TigerGraph 3.0 removes the Pattern Matching (SYNTAX v2)-related restrictions on the ACCUM and POST-ACCUM clause.
Just as in classic GSQL syntax, the ACCUM clause it executed once (in parallel) for each set of vertices and edges in the graph which match the pattern and constraints given in the FROM and WHERE clauses. You can think of FROM-WHERE as producing a virtual table. The columns of this matching table are the alias variables from the FROM clause pattern, and the rows are each possible set of vertex and edge aliases (e.g. a path) which fit the pattern.
A simple pattern 1-hop pattern, which could be syntax v1 or v2, like this:
produces a match table with 3 columns: A, B, and C. Each row is a tuple (A,B,C) where there is a has_lived_in
edge B from a Person
vertex A to a City
vertex C. We say that the match table provides a binding between the pattern aliases and graph's vertices and edges. A multi-hop pattern simply has more columns than a 1-hop pattern.
The ACCUM clause iterates through ALL matches. If you do not have an alias on every vertex in the pattern, then the number of distinct matches may be less than that number of matches.
For, example, consider
This asks who are the friends of friends of Andy@www.com. Suppose Andy knows 3 persons (Larry, Moe, and Curly) who know Wendy. The accumulator C.@patternCount
will be incremented 3 times for C = Wendy. This is similar to a SQL SELECT C, COUNT(*) ... GROUP BY C
query. There is no alias for the vertex in the middle of KNOWS.KNOWS
so the identities of Larry, Moe, and Curly cannot be reported.
As of TigerGraph 3.0, Pattern Matching (V2) syntax supports multiple POST-ACCUM clauses.
At the end of the ACCUM clause, all the requested accumulation (+=) operators are processed in bulk, and the updated values are now visible. You can now use POST-ACCUM clauses to perform a second, different round of computation on the results of your pattern matching.
The ACCUM clause executes for each full path that matches the pattern in the FROM clause. In contrast, the POST-ACCUM clause executes for each vertex in one vertex set (e.g. one vertex column in the matching table); its statements can access the aggregated accumulator result computed in the ACCUM clause. New for v3.0, if you want to perform per-vertex updates for more than one vertex alias, you should use a separate POST-ACCUM clause for each vertex alias. The multiple POST-ACCUM clauses are processed in parallel; it doesn't matter in what order you write them. (For each binding, the statements within a clause are executed in order.)
For example, below we have two POST-ACCUM clauses. The first one iterates through s, and for each s, we do s.@cnt2 += s.@cnt1
. The second POST-ACCUM iterations through t.
which produces the result
However, the following is not allowed, since it involves two aliases (t and s) in one POST-ACCUM clause.
Also, you may not use more than one alias in a single assignment. The following is not allowed:
Example 1. Find the 3rd superclass of the Tag class whose name is "TennisPlayer".
You can copy the above GSQL script to a file named example1.gsql, and invoke this script file in a Linux shell.
Example 2. Find in which continents were the 3 most recent messages in Jan 2011 created.
You can copy the above GSQL script to a file named example2.gsql, and invoke this script file in a Linux shell.
Example 3. Find Viktor Akhiezer's favorite author of 2012 whose last name begins with character 'S'. Also find how many LIKES Viktor has given to the author's post or comment.
You can copy the above GSQL script to a file named example3.gsql, and invoke this script file in a Linux shell.
We have shown how complex multi-hop patterns, containing even a conjunctive of patterns, can be expressed in a single FROM clause of a single SELECT query. There are times, however, when it is better or necessary to write query as more than one SELECT block. This could be because of the need to do computation and decision matching in stages, to make the query easier to read, or to optimize performance.
Regardless of the reason, GSQL has always supported writing procedural queries containing multiple SELECT query blocks. Moreover, each SELECT statement outputs a vertex set. This vertex set can be used in the FROM clause of an subsequence SELECT block.
For example, if Set1, Set2, and Set3 were the outputs of three previous SELECT blocks in this query, then each of these FROM clauses can take place later in the query:
FROM Set1:x1 -(mh1)- :x2 -(mh2)- Set3:x3
FROM :x1 -(mh1)- :x2 -(mh2)- Set3:x3
FROM Set2:x1 -(mh1)- :x2 -(mh2)- Set2:x3
Example 1. Find Viktor Akhiezer's liked messages' authors, whose last name starts with letter S. Find these authors alumni count.
Example 2. Find Viktor Akhiezer's liked posts' authors A, and his liked comments' authors B. Count the common universities that both A and B have members studied at.
Example 3. Find Viktor Akhiezer's liked posts' authors A. See how many pair of persons in A that one person likes a message authored by another person.
Example 4. Find how many messages are created and liked by the same person whose first name begins with letter T.
We have demonstrated the basic pattern match syntax. You should have mastered the basics by this point. In this section, we show two end-to-end solutions using the pattern match syntax.
In this example, we want to recommend some messages (comments or posts) to the person Viktor Akhiezer.
How do we do this?
One way is to find Others who likes the same messages Viktor likes. And then recommend the messages that Others like but Viktor have not seen. The pattern is roughly like below
Viktor - (Likes>) - Message - (<Likes) - Others
Others - (Likes>) - NewMessage
Recommend NewMessage to Viktor
However, this is too fine granularity, and we are overfitting the message level data with a collaborative filtering algorithm. The intuition is that two persons are similar to each other when their "liked" messages fall into the same category (tag). This makes more sense and common than finding two persons that "likes" the same set of messages. As a result, one way to avoid this overfitting is to go one level above. That is, instead of finding common messages as a similarity base, we find common messages' tags as a similarity base. Person A and Person B are similar if they like messages that belong to the same tag. This scheme fixes the overfitting problem. In pattern match vocabulary, we have
Viktor - (Likes>) - Message - (Has>) - Tag - (<Has) - Message - (<Likes) - Others
Others - (Likes>) - NewMessage
Recommend NewMessage to Viktor
GSQL. RecommendMessage Application.
This time, we create the query first, and interpret the query by calling the query name with parameters. If we are satisfied with this query, we can use "install query queryName" to get the compiled query installed which has the best performance.
You can copy the above GSQL script to a file named app1.gsql, and invoke this script file in linux command line.
When you are satisfied with your query in the GSQL interpret mode, you can now install it as a generic service which has a much faster speed. Since we have been using "CREATE QUERY .." syntax, the query is added into the catalog, we can set the syntax version and install it.
The above use log-cosine as a similarity measurement. We can also use cosine similarity by using two persons liked messages.
This section includes some advanced features related to pattern match. It includes using the PER clause to fine control the ACCUM execution, DML support in pattern match, and the conjunctive pattern match syntax which allows multiple patterns in one FROM clause. We dedicate a subsection for each topic.
In this tutorial, we will show you how to write and run Pattern Matching queries. Pattern Matching is available in TigerGraph v2.4+.
We assume you have finished GSQL 101. If not, please complete GSQL 101 first.
This tutorial was updated for TigerGraph 3.0. If you are using an older version, please change to the documentation for that version.
A graph pattern is a traversal trace on the graph schema. A pattern can contain repeated steps. A pattern can be a linear trace, or a non-linear trace (tree, circle etc.). For example, imagine a simple schema consisting of a Person vertex type and a Friendship edge type. A pattern could be a trace on this simple schema,
or, use *2 to denote the two consecutive Friendship edges,
Pattern matching is the process of finding subgraphs in a data graph that conform to a given query pattern.
We assume you are running your own TigerGraph instance as the sole user with full privileges. If you are on a multiuser Enterprise Edition, consult with your DB administrator. You need to have Designer or Admin privilege on an empty graph. At various points in this tutorial, there are links to download files. Most are small, but the graph data file is 1GB when uncompressed.
First, let's check that you can access GSQL, and that your version is 3.0 or higher.
Open a Linux shell.
Type gsql
as below. A GSQL shell prompt should appear as below.
Type version in GSQL shell. It should show 2.4 or higher as below. If not, please download and install the latest version from https://www.tigergraph.com/download/
If the GSQL shell does not launch, try resetting the system with "gadmin start all". This will launch each service if they have not been started yet. If you need further help, please see the manage TigerGraph with gadmin, and TigerGraph Knowledge Base and FAQs.
You need to start from an empty data catalog. If necessary, run drop all
to clear the catalog first.
The following general use commands were introduced in GSQL 101.
The %
prefix indicates Linux shell commands. You need TigerGraph admin privilege to run most gadmin commands.
The GSQL>
prefix indicates GSQL shell commands.
Command
Description
% gsql
Enter the GSQL shell in interactive mode
% gsql '<GSQL command string>'
Run one GSQL command
% gadmin status
Check the status of TigerGraph services (If your graph store is empty, it is normal for some statuses to be flagged in red.)
% gadmin restart
Restart TigerGraph services
GSQL> ls
List the graph schema, loading jobs, and queries
GSQL> show user
Show your user name and roles
GSQL> drop all
Delete the entire schema, all data, all
jobs, and all queries
GSQL> exit
Exit GSQL interactive shell
Below, we use the GSQL loading language to define a loading job script, which encodes all the mappings from the source CSV file (generated by the LDBC SNB benchmark data generator) to our schema.
The script to load the LDBC-SNB data is below.
We have generated a data set with scale factor 1 (approximate 1GB). You can download it from https://s3-us-west-1.amazonaws.com/tigergraph-benchmark-dataset/LDBC/SF-1/ldbc_snb_data-sf1.tar.gz
After downloading the raw file, run the tar command below to decompress the downloaded file.
After decompressing the file, you will see a folder named "ldbc_snb_data". Within it, you will see two subfolders
social_network
substitution_parameters
The raw data is in the social_network folder.
Download setup_schema.gsql which combines the schema script and loading job script shown before.
Specify the environment variable LDBC_SNB_DATA_DIR to point to your raw file folder un-tarred in the previous section. In our example below, the raw data is in /home/tigergraph/ldbc_snb_data/social_network, so we use the export shell command to specify its location. Then, start your TigerGraph services if needed. Finally, run the setup_schema.gsql script to create your LDBC Social Network graph.
Download the loading job script and invoke it on the command line. #
After loading, you can check the graph's size using built-in REST endpoint calls.
Below we call two functions, stat_vertex_number and stat_edge_number to return the cardinality of each vertex and edge type.
A common pattern is the two-step "Friend of a Friend". Or, how many entities might receive a message if it is passed up to three times? Do you have any known change of connections to a celebrity?
GSQL pattern matching makes it easy to express such variable-length patterns which repeat a single-hop. Everything else stays the same as introduced in the previous section, except we append an asterisk (or Kleene star for you regular expressionists) and an optional min..max range to an edge pattern.
(E*) means edge type E repeats any number of times (including zero!)
(E*1..3) means edge type E occurs one to three times.
Below are more illustrative examples:
1-hop star pattern — repetition of an edge pattern 0 or more times
FROM X:x - (E*) - Y:y
FROM X:x - (F>*) - Y:y
FROM X:x - (<G*) - Y:y
FROM X:x - (_*) - Y:y
Any undirected edge can be chosen at each repetition.
FROM X:x - (_>*) - Y:y
Any right-directed edge can be chosen at each repetition.
FROM X:x - (<_*) - Y:y
Any left-directed edge can be chosen at each repetition.
FROM X:x - ((E|F>|<G)*) - Y:y
Either E, F> or <G can be chosen at each repetition.
1-hop star pattern with bounds
FROM X:x - (E*2..) - Y:y
Lower bounds only. There is a chain of at least 2 E edges.
FROM X:x - (F>*..3) - Y:y
Upper bounds only. There is a chain of between 0 and 3 F edges.
FROM X:x - (<G*3..5) - Y:y
Both Lower and Upper bounds. There is a chain of 3 to 5 G edges.
FROM X:x - ((E|F>|<G)*3) - Y:y
Exact bound. There is a chain of exactly 3 edges, where each edge is either E, F>, or <G.
No alias allowed for edge with Kleene star An edge alias may not be used when a Kleene star is used. The reason is that when there are a variable number of edges, we cannot associate or bind the alias to a specific edge in the pattern.
Shortest path semantics When an edge is repeated with a Kleene star, only the shortest matching occurrences are selected. See the example below:
In Figure 2, or Pattern 1 - (E>*) - 4
, any of the following paths reach 4 from 1.
1->2->3->4
1->2->3->5->6->2->3->4
any path that goes through the cycle 2->3->5->6->2 two or more times and jumps out at 3.
The first path is shorter than the rest; it is considered the only match.
In this tutorial, we will use the Interpreted Mode for GSQL, introduced in TigerGraph 2.4. Interpreted mode lets us skip the INSTALL step, and even run a query as soon as we create it, to offer a more interactive experience. These one-step interpreted queries are unnamed (anonymous) and parameterless, just like SQL.
Example 1. Find the direct or indirect superclass (including the self class) of the TagClass whose name is "TennisPlayer".
You can copy the above GSQL script to a file named example1.gsql, and invoke this script file in a Linux shell.
Note below that the starting vertex s, whose name is TennisPlayer, is also a match, using a path with zero hops.
Example 2. Find the immediate superclass of the TagClass whose name is "TennisPlayer". (This is equivalent to a 1-hop non-repeating pattern.)
You can copy the above GSQL script to a file named example2.gsql, and invoke this script file in a Linux shell.
Example 3. Find the 1 to 2 hops direct and indirect superclasses of the TagClass whose name is "TennisPlayer".
You can copy the above GSQL script to a file named example3.gsql, and invoke this script file in a Linux shell.
Example 4. Find the superclasses within 2 hops of the TagClass whose name is "TennisPlayer".
You can copy the above GSQL script to a file named example4.gsql, and invoke this script file in a Linux shell.
Example 5. Find the superclasses at least one hop from the TagClass whose name is "TennisPlayer".
You can copy the above GSQL script to a file named example5.gsql, and invoke this script file in a Linux shell.
Example 6. Find the 3 most recent comments that are liked or created by Viktor Akhiezer, and the total number of comments related to (created or liked by) Viktor Akhiezer.
You can copy the above GSQL script to a file named example6.gsql, and invoke this script file in a Linux shell.
‌Pattern matching by nature is declarative. It enables users to focus on specifying what they want from a query without worrying about the underlying query processing.‌
A pattern usually appears in the FROM clause, the most fundamental part of the query structure. The pattern specifies sets of vertex types and how they are connected by edge types. A pattern can be refined further with conditions in the WHERE clause. In this tutorial, we'll start with simple one-hop path patterns, and then extend it multi-hop patterns and finally multiple-path patterns.
Currently, pattern matching may only be used in read-only queries. DML support will be added in the near future release.
Pattern matching queries support nested queries
The easiest way to understand patterns is to start with a simple 1-Hop pattern. Even a single hop has several options. After we've tackled single hops, then we'll see how to add repetition to make variable length patterns and how to connect single hops to form bigger patterns.
In pattern matching, we use the punctuation -( )-
to denote a 1-hop pattern, where the edge type(s) is enclosed in the parentheses ()
and the hyphens -
symbolize connection without specifying direction. Instead, directionality is explicitly stated for each edge type.
For an undirected edge E, no added decoration: E
For a directed edge E from left to right, use a suffix: E>
For a directed edge E from right to left, use a prefix: <E
‌For example, in the LDBC SNB schema, there are two directed relationships between Person and Message: person LIKES message, and message HAS_CREATOR person. Despite the fact that these relationships are in opposite directions, we can include both of them in the same pattern very concisely using an alternation separator |
:
The underscore _
is a wildcard meaning any edge type. Arrowheads are still used to indicate direction, e.g., _>
or <_
or _
The empty parentheses ()
means any edge, directed or undirected.
Prior to TigerGraph 3.0, the source (leftmost) vertex set needed to be defined as an explicit set, prior to the SELECT statement. A typical approach is shown here.
Beginning in TigerGraph 3.0, SYNTAX V2 treats the source vertex set the same as the target vertex set. That is, the source or the target vertex set may be:
a vertex type
SELECT t FROM Person:s -(IS_LOCATED_IN>:e) - City:t
an alternation of vertex types
SELECT t FROM (Post|Comment):s -(IS_LOCATED_IN>:e) - Country:t
omitted, with only an alias, meaning any vertex type
SELECT s FROM :s -(IS_LOCATED_IN>:e) - Country:t
omitted, without an alias, meaning any vertex type
SELECT t FROM -(IS_LOCATED_IN>:e) - Country:t
Performance may be better when types are explicitly provided.
FROM X:x - (E1:e1) - Y:y
E1 is an undirected edge, x and y bind to the end points of E1, and e1 is the alias of E1.
FROM X:x - (E2>:e2) - Y:y
Right directed edge x binds to the source of E2; y binds to the target of E2.
FROM X:x - (<E3:e3) - Y:y
Left directed edge; y binds to the source of E3; x binds to the target of E3.
FROM X:x - (_:e) - Y:y
Any undirected edge between a member of X and a member of Y.
FROM X:x - (_>:e) - Y:y
Any right directed edge with source in X and target in Y.
FROM X:x - (<_:e) - Y:y
Any left directed edge with source in Y and target in X.
FROM X:x - ((<_|_):e) - Y:y
Any left directed or any undirected; "|" means OR, and parentheses enclose the group of edge descriptors; e is the alias for the edge pattern (<_|_).
FROM X:x - ((E1|E2>|<E3):e) - Y:y
Any one of the three edge patterns.
FROM X:x - () - Y:y
any edge (directed or undirected)
Same as (<_|_>|_)
To use pattern matching, you need to either set a session parameter or specify it in the query. There are currently two syntax versions for queries:
"v1" is the classic syntax, traversing one hop per SELECT statement. This is the default mode.
"v2" enhances the v1 syntax with pattern matching.
You can use the SET command to assign a value to the syntax_version session parameter: v1 for classic syntax; v2 for pattern matching. If the parameter is never set, the classic v1 syntax is enabled. Once set, the selection remains valid for the duration of the GSQL client session, or until it is changed with another SET command.
You can also select the syntax by using the SYNTAX clause in the CREATE QUERY statement: v1 for classic syntax (default); v2 for pattern matching. The query-level SYNTAX option overrides the syntax_version session parameter.
In this tutorial, we will use Interpreted Mode for GSQL, introduced in TigerGraph 2.4. Interpreted mode lets us skip the INSTALL step, and even run a query as soon as we create it, to offer a more interactive experience. These one-step interpreted queries are unnamed (anonymous) and parameterless, just like SQL.
To run an anonymous query, replace the keyword CREATE with INTERPRET. Remember, no parameters:
Recommendation: Increase the query timeout threshold.
Interpreted queries may run slower than installed queries, so we recommend increasing the query timeout threshold:
Example 1. Find persons who know the person named "Viktor Akhiezer" and return the top 3 oldest such persons.
Syntax Enhancement in TigerGraph 3.0+
In Example 1, "FOR GRAPH ldbc_snb" is not used after () in the query signature. It's an optional component in 3.0+ when "USE GRAPH graphName" is used; Or from the command line, "gsql -g graphName " precedes any query invocation.
In the FROM clause, we directly use vertex type Person as the starting vertex set. This syntax enhancement is available in syntax V2 only.
You can copy the above GSQL script to a file named example1.gsql and invoke this script file in Linux.
Example 2. Find the total number of comments and total number of posts liked by Viktor. A Person can reach Comments or Posts via a directed edge LIKES.
You can copy the above GSQL script to a file named example2.gsql, and invoke this script file in Linux.
Example 3. Solve the same problem as in Example 2, but use a left-directed edge pattern.
Note below (line 8) that the source vertex set are now Comment and Post, and the target is Person.
You can copy the above GSQL script to a file named example3.gsql, and invoke this script file in linux command line. The output should be the same as in Example 2.
Example 4. Find Viktor Akhiezer's total number of related comments and total number of related posts. That is, a comment or post is either created by Viktor or is liked by Viktor. Note that the HAS_CREATOR edge type starts from Comment|Post, and the LIKES edge type starts from Person.
You can copy the above GSQL script to a file named example4.gsql, and invoke this script file in Linux:
Example 5. Find the total number of comments or posts related to "Viktor Akhiezer". This time, we count them together and, we use wildcard "_" to represent the two types of edges: HAS_CREATOR and LIKES_REVERSE. Both are following the same direction.
You can copy the above GSQL script to a file named example5.gsql, and invoke this script file in Linux:
In classic GSQL queries, described in , we used the punctuation -( )->
in the FROM clause to indicate a 1-hop query, where the arrow specifies the vertex flow from left to right, and ( )
encloses the edge types.
We have covered a lot of territory in GSQL 102:
Showed how to invoke GSQL Pattern Matching syntax
Explained how Pattern Matching extends the classic FROM clause grammar:
Each hop can be a choice of multiple, individually directed edge types
The Kleene star and a min...max range enable each hop to be repeated.
GSQL automatically finds the shortest paths that satisfy a variable length path.
A virtual match table has a column for each vertex or edge alias in a multi-hop path, and a row for each graph path that satisfies the pattern.
The ACCUM clause iterates on each row in the match table.
A POST-ACCUM clause iterates on one vertex alias; a query can have multiple POST-ACCUM clauses.
Described the improvements to Pattern Matching in TigerGraph 3.0:
The source (leftmost) vertex set can be specified with the same flexibility as the other vertex sets: a vertex type, an alteration of types, or omitted. Explicit seed sets are no longer needed
Restrictions on which vertex aliases may be used in the ACCUM clause have been lifted.
Described three major advanced options:
The PER <vertex alias> clause enables users to fine tune the ACCUM iteration.
Data modification (insert, update, delete) are now supported.
Conjunctive Pattern Matching let users express a complex pattern as a set of path patterns which must all be satisfied.
Provided best practices for writing queries, especially pattern matching queries:
Put the smaller vertex set on the left end.
Specify all vertex and edge types explicitly.
Use the PER clause to reduce the match table size
Provided numerous examples and the full set of LDBC Social Network benchmark queries.
With a little practice, you will be writing GSQL pattern matching queries to efficiently solve real-world problems. You can post your feedback and questions on the GSQL community forum. Our community members and developers love to hear any feedback from your graph journey of using GSQL and are ready to help clarifying any doubts.
Pattern Matching GSQL supports Insert, Update, and Delete operations. The syntax is identical to that in classic GSQL (v1), though the full range of data modification operations are not yet support.
In general, data modification can be at two levels in GSQL:
Top level. The statement does not need to within any other statement.
Within a SELECT query statement. The FROM-WHERE clauses define a match table, and the data modification is performed based on the vertex and edge information in the match table. The GSQL specifications calls these within-SELECT statements DML-sub statements.
Insert, Update, and Delete currently work in compiled mode only (e.g., you must run INSTALL QUERY before RUN QUERY.) Data Modification in interpreted mode is not yet available.
SELECT queries with data modification may only have one POST-ACCUM clause.
Pattern matching Insert is supported at both the top-level and within-SELECT levels, using the same syntax as in classic GSQL. You can insert vertices and edges.
For a top-level statement, use INSERT INTO,
Inside an ACCUM or POST-ACCUM clause, use the DML-sub INSERT statement.
Example 1. Create a Person vertex, whose name is Tiger Woods. Next, find Viktor's favorite 2012 posts' authors, whose last name is prefixed with S. Finally, insert KNOWS edges connecting Tiger Woods with Viktor's favorite authors.
You can verify the result by running a simple built-in REST endpoint.
Check the inserted vertex.
Check the inserted edges.
Top-level UPDATE statements are not yet supported in syntax v2.
Vertex attributes can only be updated in POST-ACCUM clause, and edge attributes can only be updated in the ACCUM clause.
To perform within-SELECT updates, the FROM pattern can only be a single hop, fixed length pattern.
Example 2. For all KNOWS edges that connect Viktor Akhiezer and his friends whose lastName begins with "S", update the edge creationDate to "2020-10-01". Also, for the Person vertex (Tiger Woods) update the vertex's creationDate and language he speaks.
To verify the update, we can use REST calls.
Check Tiger Woods' creationDate and language he speaks.
Check KNOWS edges whose source is tiger woods.
You can use delete () function to delete edges and vertices in ACCUM and POST-ACCUM clauses.
Top-levels DELETE statements are not yet supported in SYNTAX v2.
Edges can only be deleted in the ACCUM clause.
For best performance, vertices should be deleted in the POST-ACCUM clause.
To perform within-SELECT deletes, the FROM pattern can only be a single hop, fixed length pattern.
Example 3. Delete vertex Tiger Woods and its KNOWS edges.
To verify the result, you can use built-in REST calls.
So far, we have described pattern matching as one path pattern in a FROM clause. In this section, we introduce GSQL's capability to match multiple patterns in one FROM clause. This extension is called Conjunctive Pattern Matching (CPM), because the query asks for the conjunction (logical AND) of the patterns. To get a match, all of the patterns must be satisfied, and the patterns can interrelate. Visually, you can think of patterns formed by a set of intersecting line segments. This feature, introduced as a Beta feature in TigerGraph 3.0, enables you to express complex patterns concisely in a single query block.
In general, a CPM query block consists of multiple patterns in the FROM clause. It has a structure illustrated below.
We elaborate on each of the clause.
The SELECT clause selects only one vertex alias from all the patterns in the FROM clause.
This is where the conjunctive matching is expressed. The FROM clause consists of a list of path patterns, which are separated by commas. Evaluating each pattern against the underlying graph data produces a match table. If two patterns share a vertex alias, then we form the natural join of the two match tables.
For example, consider this CPM:
The first pattern's variables are x, e1, y, e2, and z; the second pattern's variables are z, e3, u, e4, and v. Considering the two patterns independently would yield the follwing match table schemas:
Natural joining two match tables compares all the shared vertex aliases between the two tables, and the resulting joined table contains all non-shared variables plus one copy of each of the shared vertex variables. Here is the match table for the CPM above:
The match table of the conjunctive pattern match is the natural join of all the patterns' match tables. By design, a row in the CPM match table must simultaneously satisfy all the match tables.
If the match tables of the patterns in a FROM clause can be naturally joined into one match table, then the FROM clause has a valid CPM input. Otherwise, the FROM clause has an invalid pattern input list.
For example, below we show two valid CPM inputs and one invalid CPM input.
The predicates in the WHERE clause can use any of the vertex or edge aliases in any of the patterns, including predicates which combine variables from different constituent paths. CPM queries do not have any special restrictions on the WHERE predicate. Distance matters, however, for performance. Conditions that are local, measured both cross-path and within-path, can be resolved earlier and therefore are faster.
In the example below, x2.age > x4.age
is a cross-pattern predicate, e1.timestamp < e3.timestamp
is a cross-pattern predicate, and x1.gender == x4.gender
is a local predicate of the second pattern.
You can ACCUM to any vertex variable in a CPM block.
The ACCUM clause by default will execute as many times as the row (match) count of the CPM match table; each execution uses one row from the match table.
POST-ACCUM for CPM behaves the same as POST-ACCUM for single path patterns. That is, each POST-ACCUM clause can refer to one vertex alias, and the clause executes iteratively over that vertex set (e.g. one vertex column in the matching table). Its statements can access the aggregated accumulator result computed in the ACCUM clause. The query can have multiple POST-ACCUM clauses, one for each vertex alias you wish to work on. The multiple POST-ACCUM clauses are processed in parallel; it doesn't matter in what order you write them. (For each binding, the statements within a clause are executed in order.)
For example, below we have three POST-ACCUM clauses. The first one iterates throughx1
, and for each x1, we do @@cnt += x1.@cnt
. The second and third POST-ACCUMs iterate through x2
and x3
respectively, and accumulates their @cnt
accumulator value into @@cnt
.
Example 1. Find Viktor Akhiezer's liked messages (100+ days after their creation) whose author's last name begin with letter S. Output the message's forum.
Example 2. Find any authors who wrote posts that Viktor Akhiezer's liked and whose last name begins with S. Find the country for each of these authors and report on the countries.
Example 3. Given a TagClass and a Country, find all the Forums created in the given Country, containing at least one Post with Tags belonging directly to the given TagClass. The location of a Forum is identified by the location of the Forum’s moderator.
Example 4. For a given country, count all the distinct triples of Persons such that:
a is a friend of b.
b is a friend of c
c is a friend of a.
Distinct means that if a certain 3 vertices appear once in the results, it will not be repeated: it will appear only once. KNOWS is an undirected relationship, so it doesn't matter in what order we list the 3 vertices.
More Examples. We translated LDBC-SNB BI and IC queries using CPM, and shared the translation in github. Please refer to the query translation here. Most of the queries are installed as functions, you can find sample parameter(s) of the functions from here.
As mentioned when we first described pattern matching, in One-hop patterns, the source (leftmost) vertex set can be a vertex type, an alternation of types, or even omitted.
Example 1. Find Viktor Akhiezer's favorite messages' creators whose last name begins with letter S. Count them.
Example 2. Same query as example 1, but without beginning with vertex types. GSQL compiler can infer the types of :s.
Example 3. Count the LIKES edge.
Pattern matching produces a virtual match table, and the ACCUM clause acts like a FOREACH loop, executing the clause's statement once for each row of the match table.
Patterns are paths in the graphs, and each row in the match table is a distinct path. However, paths may share some vertices or edges. Some applications do not want to do aggregations per path. Instead, they want to execute the ACCUM clause per distinct group of vertex aliases.
For example, consider the following query which counts the number of paths in a simple 2-hop pattern:
Suppose the query produces the following match table.
By default, the ACCUM clause will execute the @@cnt += 1
statement 4 times, for each row in the match table. The result will be @@cnt = 4
.
For the same query, what if the user wants to
count the number of distinct path endings in the match table? For this case, we would want to iterate on the alias t
.
count the number of distinct (start, end) pairs in the match table? For that case, we would want to iterate on distinct pairs of the aliases (s, t)
.
To provide users with this added flexibility and finer control over ACCUM iteration, TigerGraph 3.0 adds the PER clause to pattern matching (V2) syntax.
The PER Clause is an optional clause that comes at the start of the ACCUM clause in a SELECT statement. As illustrated below, it starts with the keyword PER, and followed by a pair of parenthesis, in which user can put one or more distinct vertex aliases found in the FROM pattern.
Examples. Below are multiple examples of the PER Clause using the same FROM clause.
The PER Clause specifies a list of vertex aliases, which are used to group the rows in the match table, one group per distinct value of the alias or of the alias list. If there are N distinct groups, we will execute the ACCUM clause N times, once per distinct vertex aliases' binding. Note that the PER clause has no effect on POST-ACCUM clauses semantic, except confining the POST-ACCUM vertex alias.
Suppose s, m, and t are vertex aliases in a pattern. Below are some interpretations of the PER Clause based on the graph element bindings found in the match table.
PER (s) ACCUM
means that per each distinct s vertex, execute the ACCUM clause once.
PER (s,t) ACCUM
means that per each distinct (s, t) pair, execute the ACCUM clause once.
PER (s,m,t) ACCUM
means that per each distinct (s, m, t) tuple, execute the ACCUM clause once.
Examples to show PER clause semantics.
If the PER Clause is used in a SELECT query block, then the vertex aliases used in the SELECT, ACCUM , and POST-ACCUM clauses must be confined to the aliases that appear in the PER clause.
Below are some illegal cases.
Example 1. Count the number of Countries that has a City which has a resident that likes a post.
Example 2. Count the number of posts liked by a person who is located in a city that belongs to a country. (All cities are in a country, but humor us. We are reusing the same FROM pattern in several examples.)
Example 3. Find for each country in ("Dominican_Republic","Angola", "Cambodia") the number of posts that is liked by a person living in that country.
Example 4. Find for each country in ("Dominican_Republic","Angola", "Cambodia") the number of posts that is liked by a person living in that country. Use local accumulators this time.
The PER Clause not only helps users to control the semantics of the ACCUM clause, it also boosts the performance of the pattern match query, as it uses the PER clause to optimize the query execution.
To get the best performance, we recommend three guidelines for writing efficient queries.
Per target is in general faster than Per source. In the example below, query q2 is faster than q1. The only difference between these two queries is q2's FROM pattern is the flip of q1's FROM pattern.
The match table is built by traversing the pattern from left to right. Follow the basic principle of pruning early rather than late by orienting the query the smaller cardinality set on the left. This practice will result in producing the least number of candidate matches during the query computation. For example, if there are fewer distinct tags than persons, then query q4 is faster than q3.
Specifying complete type information improves performance. For example, query q6 is faster than q5 even though they are known to be logically identical. Forum
is the CONTAINER_OF
Post
, so it does not need to be specified in q5, but explicitly saying Forum
in q6 speeds up performance.
Using the PER clause and linear regular path pattern, we have translated all of the LDBC-SNB queries. You can find them on github at https://github.com/tigergraph/ecosys/tree/ldbc/ldbc_benchmark/tigergraph/queries_linear/queries. Most of the queries are installed as functions. You can find sample parameter(s) of the functions from https://github.com/tigergraph/ecosys/tree/ldbc/ldbc_benchmark/tigergraph/queries/seeds.