Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
In this tutorial, we will show you how to write and run Pattern Matching queries. Pattern Matching is available in TigerGraph 2.4+.
We assume you have finished GSQL 101. If not, please complete GSQL 101 first.
Pattern is a traversal trace on the graph schema. For repetitive traversal on the schema, we can use some regular expression to represent the repeating step(s). A pattern can be a linear trace, or a non-linear trace (tree, circle etc.). For example, imagine a simple schema consisting of a Person vertex type and a Friendship edge type. A pattern could be a trace on this simple schema,
or, use *2 to denote the two consecutive Friendship edges,
Pattern matching is the process of finding subgraphs in a data graph that conforms to a given query pattern.
We're assuming you are running Developer Edition as the sole user with full privileges. If you are on a multiuser Enterprise Edition, consult with your DB administrator. You need to have Designer or Admin privilege on an empty graph. There are also links to download files at various points in the tutorial. Most are small, but the graph data file is 1GB when uncompressed.
First, let's check that you can access GSQL, and that your version is 2.4 or higher.
Open a Linux shell.
Type gsql as below. A GSQL shell prompt should appear as below.
Type version in GSQL shell. It should show 2.4 or higher as below. If not, please download and install the latest developer version from https://www.tigergraph.com/download/
If the GSQL shell does not launch, try resetting the system with "gadmin start". This will take some time to launch each service if they have not been started yet. If you need further help, please see the TigerGraph Knowledge Base and FAQs.
You need to start from an empty data catalog. If necessary, run "drop all" to clear the catalog first.
The following general use commands were introduced in GSQL 101.
The % prefix indicates Linux shell commands. You need TigerGraph admin privilege to run most gadmin commands.
The GSQL> prefix indicates GSQL shell commands.
Command
Description
% gsql
Enter the GSQL shell in interactive mode
% gsql '<GSQL command string>'
Run one GSQL command
% gadmin status
Check the status of TigerGraph services (If your graph store is empty, it is normal for some statuses to be flagged in red.)
% gadmin restart -fy
Force all TigerGraph services to restart
GSQL> ls
List the graph schema, loading jobs, and queries
GSQL> show user
Show your user name and roles
GSQL> drop all
Delete the entire schema, all data, all
jobs, and all queries
GSQL> exit
Exit GSQL interactive shell
Below, we use GSQL loading language to define a loading job script, which encodes all the mappings from the source csv file from the LDBC SNB benchmark data generator to our schema.
You can download the below loading script from here.
We have generated scale-factor 1 data set (approximate 1GB). You can download it from https://s3-us-west-1.amazonaws.com/tigergraph-benchmark-dataset/LDBC/SF-1/ldbc_snb_data-sf1.tar.gz
After downloading the raw file, you can run tar command below to decompress the downloaded file.
After decompressing the file, you will see a folder named "ldbc_snb_data". Enter it, you will see two subfolders
social_network
substitution_parameters
The raw data is under the social_network folder.
Download the setup_schema.gsql file, and run the script in the shell command line to setup the schema and the loading job.
Setup the environment variable LDBC_SNB_DATA_DIR pointing to your raw file folder un-tarred in the previous section. In the example below, the raw data is in /home/tigergraph/ldbc_snb_data/social_network. Note, the folder should have the name social_network.
Download the loading job script and invoke it on the command line.
After loading, you can check the graph's size using one of the options of the administrator tool, gadmin
. From a Linux shell, enter the command
gadmin status graph -v
You should see VertexCount: 3,181,724 and EdgeCount 34,512,076.
Pattern matching by nature is declarative. It enables users to focus on specifying what they want from a query without worrying about the underlying query processing.
A pattern usually appears in the FROM clause, the most fundamental part of the query structure. The pattern specifies sets of vertex types and how they are connected by edge types. A pattern can be refined further with conditions in the WHERE clause. In this tutorial, we'll focus on the linear pattern.
Currently, pattern matching may only be used in read-only queries.
The easiest way to understand patterns is to start with a simple 1-Hop pattern. Even a single hop has several options. After we've tackled single hops, then we'll see how to add repetition to make variable length patterns and how to connect single hops to form bigger patterns.
In classic GSQL queries, described in GSQL 101, we used the punctuation -( )->
in the FROM clause to indicate a 1-hop query, where the arrow specifies the vertex flow from left to right, and ( )
encloses the edge types.
In pattern matching, we use the punctuation -( )-
to denote a 1-hop pattern, where the edge type(s) is enclosed in the parentheses ()
and the hyphens -
symbolize connection without specifying direction. Instead, directionality is explicitly stated for each edge type.
For an undirected edge E, no added decoration: E
For a directed edge E from left to right, use a suffix: E>
For a directed edge E from right to left, use a prefix: <E
For example, in the LDBC SNB schema, there are two directed relationships between Person and Message: person LIKES message, and message HAS_CREATOR person. Despite the fact that these relationships are in opposite directions, we can include both of them in the same pattern very concisely:
The underscore _
is a wildcard meaning any edge type. Arrowheads are still used to indicate direction, e.g., _>
or <_
or _
The empty parentheses ()
means any edge, directed or undirected.
FROM X:x - (E1:e1) - Y:y
E1 is an undirected edge. x and y bind to the end points of E1. e1 is the alias of E1.
FROM x - (E2>:e2) - Y:y
Right directed edge, x binds to the source of E2, y binds to the target of E2.
FROM X:x - (<E3:e3) - Y:y
Left directed edge, y binds to the source of E3, x binds to the target of E3.
FROM X:x - (_:e) - Y:y
Any undirected edge between a member of X and a member of Y.
FROM X:x - (_>:e) - Y:y
Any right directed edge with source in X and target in Y.
FROM X:x - (<_:e) - Y:y
Any left directed edge with source in Y and target in X.
FROM X:x - ((<_|_):e) - Y:y
Any left directed or any undirected. "|" means OR, and parentheses enclose the group of edge descriptors. e is the alias for the edge pattern (<_|_).
FROM X:x - ((E1|E2>|<E3):e) - Y:y
Any one of the three edge patterns.
FROM X:x - () - Y:y
any edge (directed or undirected)
Same as (<_|_>|_)
To use the pattern match syntax, you need to either set a session parameter or specify it in the query. There are currently two syntax versions for queries:
"v1" is the classic syntax, traversing one hop per SELECT statement. This is the default mode.
"v2" enhances the v1 syntax with pattern matching.
You can use the SET command to assign a value to the syntax_version session parameter: v1 for classic syntax; v2 for pattern matching. If the parameter is never set, the classic v1 syntax is enabled. Once set, the selection remains valid for the duration of the GSQL client session, or until it is changed with another SET command.
You can also select the syntax by using the new SYNTAX option in the CREATE QUERY statement: v1 for classic syntax (default); v2 for pattern matching. The Query-Level SYNTAX option overrides the syntax_version session parameter.
CHANGE ADVISORY
The punctuation used with the SYNTAX keyword was streamlined, from
CREATE QUERY <query_name><parameters> FOR GRAPH <graph_name>
SYNTAX ("v2")
# original version, TigerGraph 2.4.0
to
CREATE QUERY <query_name><parameters> FOR GRAPH <graph_name>
SYNTAX v2
# final version, since TigerGraph 2.4.1
In this tutorial, we will use the new Interpreted Mode for GSQL, also introduced in TigerGraph 2.4. Interpreted mode lets us skip the INSTALL step, and even to run a query as soon as we create it, to offer a more interactive experience. These one-step interpreted queries are unnamed (anonymous) and parameterless, just like SQL.
To send an anonymous query to the interpret engine, replace the keyword CREATE with INTERPRET. Remember, no parameters:
Recommendation: Increase the query timeout threshold.
Interpreted queries may run slower than installed queries, so we recommend increasing the query timeout threshold:
Example 1. Find persons who know the person named "Viktor Akhiezer" and return the top 3 oldest such persons.
You can copy the above GSQL script to a file named example1.gsql and invoke this script file in Linux.
Example 2. Do the same as Example 1, but use a right-directed edge pattern.
You can copy the above GSQL script to a file named example2.gsql, and invoke this script file in Linux.
The output should be the same as example1's output.
Example 3. Find Viktor Akhiezer's total number of comments, total number of posts, and total number of persons he knows. A Person can reach Comments, Posts and other Persons via a directed edge.
You can copy the above GSQL script to a file named example3.gsql, and invoke this script file in Linux.
Example 4. Do the same as Example 3, but use a left-directed edge pattern.
Note below (line 10) that the Seed is now {Person.*, Comment.*, Post.* }, the three types of entities that are targets of edges from a Person.
In the current version, the vertex set on the left side of the pattern must be defined in a previous statement (e.g., a seed statement), the same requirement as in v1 syntax FROM clauses. In the example below, the current version of pattern matching would not permit FROM _:s -(<:e) - Person:tgt
You can copy the above GSQL script to a file named example4.gsql, and invoke this script file in linux command line. The output should be the same as in Example 3.
Example 5. Find the two oldest persons who either know "Viktor Akhiezer" or are known by "Vicktor Akhiezer". KNOWS is a directed relationship, so we need to include both directions in the pattern.
You can copy the above GSQL script to a file named example5.gsql, and invoke this script file in Linux:
Example 6. Find the total comments or posts created by "Viktor Akhiezer". Again, we include two types of edges, but in this case, we count them together.
You can copy the above GSQL script to a file named example6.gsql, and invoke this script file in Linux:
We have demonstrated the basic pattern match syntax. You should have mastered the basics by this point. The next step is to see more examples and practice more.
In this example, we want to recommend some messages (comments or posts) to the person Viktor Akhiezer.
How do we do this?
One way is to find Others who likes the same messages Viktor likes. And then recommend the messages that Others like but Viktor have not seen. The pattern is roughly like below
Viktor - (Likes>) - Message - (<Likes) - Others
Others - (Likes>) - NewMessage
Recommend NewMessage to Viktor
However, this is too fine granularity, and we are overfitting the message level data with a collaborative filtering algorithm. The intuition is that two persons are similar to each other when their "liked" messages fall into the same category (tag). This makes more sense and common than finding two persons that "likes" the same set of messages. As a result, one way to avoid this overfitting is to go one level above. That is, instead of finding common messages as a similarity base, we find common messages' tags as a similarity base. Person A and Person B are similar if they like messages that belong to the same tag. This scheme fixes the overfitting problem. In pattern match vocabulary, we have
Viktor - (Likes>) - Message - (Has>) - Tag - (<Has) - Message - (<Likes) - Others
Others - (Likes>) - NewMessage
Recommend NewMessage to Viktor
GSQL. RecommendMessage Application.
This time, we create the query first, and interpret the query by calling the query name with parameters. If we are satisfied with this query, we can use "install query queryName" to get the compiled query installed which has the best performance.
You can copy the above GSQL script to a file named app1.gsql, and invoke this script file in linux command line.
When you are satisfied with your query in the GSQL interpret mode, you can now install it as a generic service which has a much faster speed. Since we have been using "CREATE QUERY .." syntax, the query is added into the catalog, we can set the syntax version and install it.
The above use log-cosine as a similarity measurement. We can also use cosine similarity by using two persons liked messages.
We have made available all LDBC SNB benchmark queries translated into GSQL pattern matching syntax. The benchmark queries are described in Sections 4 and 5 of the LDBC Social Network Benchmark Reference.
You can find our GSQL pattern match translations of these queries on Github: https://github.com/tigergraph/ecosys/tree/master/tools/ldbc_benchmark/tigergraph/queries_pattern_match
Note we use CREATE/INSTALL/RUN QUERY instead of INTERPRET QUERY so that these queries can be optimized and installed as REST services. There are three sets of queries:
Also, you may want to use the GraphStudio UI to help you visualize and explore the graph and to try your hand at writing your own queries.
Last but not least, join GSQL community forum, discuss and get help from fellow GSQL users and GSQL developers.
Repeating the same hop is useful sometimes, but the real power of pattern matching comes from expressing multi-hop patterns, with specific characteristics for each hop. For example, the well-known product recommendation phrase "People who bought this product also bought this other product", is expressed by the following 2-hop pattern:
As you see, a 2-hop pattern is a simple concatenation and merging of two 1-hop patterns where the two patterns share a common endpoint. Below, Y:y is the connecting end point.
Similarly, a 3-hop pattern concatenates three 1-hop patterns in sequence, each pair of adjacent hops sharing one end point. Below, Y:y and Z:z are the connecting end points.
In general, we can connect n 1-hop patterns into a n-hop pattern. The database will search the graph topology to find subgraphs that match this n-hop pattern.
A multi-hop pattern has two endpoint vertex sets and one or more intermediate vertex sets. If the query does not need to express any conditions for an intermediate vertex set, then the vertex set can be omitted and the two surrounding edge sets can be joined with a simple "." For example, in the 2-hop pattern example above, if we did not need to specify that the intermediate vertex type is Y, nor need to refer to that vertex set in any of the query's other clauses (such as WHERE or ACCUM), then the pattern can be reduced as follows:
If a pattern has a Kleene star to repeat an edge, GSQL pattern matching selects only the shortest paths which match the pattern. If we did not apply this restriction, computer science theory tells us that the computation time could be unbounded or extreme (NP, to be technical). If we instead matched ALL paths regardless of length when a Kleene star is used without an upper bound, there could be an infinite number of matches, if there are loops in the graph. Even without loops or with an upper bound, the number of paths to check grows exponentially with the number of hops.
For the pattern 1 - (_*) - 5
in Figure 3 above, you can see the following:
There are TWO shortest paths: 1-2-3-4-5 and 1-2-6-4-5
These have 4 hops, so we can stop searching after 4 hops. This makes the task tractable.
If we search for ALL paths which do not repeat any vertices:
There are THREE non-repeated-vertex paths: 1-2-3-4-5, 1-2-6-4-5, and 1-2-9-10-11-12-4-5
The actual number of matches is small, but number of paths to consider is NP.
If we search for ALL paths which do not repeat any edges:
There are FOUR non-repeated-edge paths: 1-2-3-4-5, 1-2-6-4-5, 1-2-9-10-11-12-4-5, and 1-2-3-7-8-3-4-5
The actual number of matches is small, but number of paths to consider is NP.
If we search for ALL paths with no restrictions:
There are infinite matches, because we can go around the 3-7-8-3 cycle any number of times.
Each vertex set or edge set in a pattern (except edges with Kleene stars) can have an alias variable associated with it. When the query runs and finds matches, it associates, or binds, each alias to the matching vertices or edges in the graph.
TigerGraph 2.4 has certain restrictions on how accumulators and aliases can be used. Some or all of these restrictions will be lifted in future releases. We elaborate on them below.
The SELECT clause specifies the output vertex set of a SELECT statement. For a multiple-hop pattern, we can only select one of the two endpoints of the pattern. None of the intermediate aliases can be selected. The example below shows the two possible choices for the given pattern:
For a multiple-hop pattern, if you don't need to refer to the intermediate vertex points, you can just use "." to connect the edge patterns, giving a more succinct representation. For example, below we remove y and z, and connect E2, <E3 and E4 using the period symbol. Note that you cannot have an alias for a multi-hop sequence like E2>.<E3.E4.
In a multiple-hop pattern, the WHERE clause is a conjunction of local hop predicates (conditions), with the exception that the starting end point can appear in the last hop local predicate.
Consider a pattern
Local hop predicate means a predicate can refer to a single hop’s alias (x_i, e_i, x_(i+1)) only.
Last hop local predicate can refer to the pattern's starting end point. I.e. (x1, x3, e3, x4).
A WHERE clause is a conjunction (AND) of local hop predicates.
Kleene Star breaks local hop predicate. When a local hop's edge has kleene star, we cannot compose a local predicate using the local alias'.
In the current version of GSQL, only certain parts of a pattern (and their corresponding aliases) are available in ACCUM and POST-ACCUM clauses. Refer to the example pattern and its highlighted parts in Figure 4 below.
Accumulators can only be attached to the pattern's endpoints (x and u in the figure). The accumulation statements may only access data from either the left endpoint (x) or the rightmost hop (z, e4, and u). Below is an example of a valid ACCUM clause.
In the POST-ACCUM clause, only the pattern's endpoints can be accessed.
For queries in Distributed mode, accumulators may only be on the right endpoint (x in the figure).
The example below shows a valid ACCUM clause:
Example 1. Find the 3rd superclass of the Tag class whose name is "TennisPlayer".
You can copy the above GSQL script to a file named example1.gsql, and invoke this script file in a Linux shell.
Example 2. Find in which continents were the 3 most recent messages in Jan 2011 created.
You can copy the above GSQL script to a file named example2.gsql, and invoke this script file in a Linux shell.
Example 3. Find Viktor Akhiezer's favorite author of 2012 whose last name begins with character 'S', and how many LIKES Viktor give to the author's post or comment.
You can copy the above GSQL script to a file named example3.gsql, and invoke this script file in a Linux shell.
A common pattern is the two-step "Friend of a Friend". Or, how many entities might receive a message if it is passed up to three times? Do you have any known change of connections to a celebrity?
GSQL pattern matching makes it easy to express such variable-length patterns which repeat a single-hop. Everything else stays the same as introduced in the previous section, except we append an asterisk (or Kleene star for you regular expressionists) and an optional min..max range to an edge pattern.
(E*) means edge type E repeats any number of times (including zero!)
(E*1..3) means edge type E occurs one to three times.
Below are more illustrative examples:
1-hop star pattern — repetition of an edge pattern 0 or more times
FROM X:x - (E1*) - Y:y
FROM X:x - (E2>*) - Y:y
FROM X:x - (<E3*) - Y:y
FROM X:x - (_*) - Y:y
Any undirected edge can be chosen at each repetition.
FROM X:x - (_>*) - Y:y
Any right-directed edge can be chosen at each repetition.
FROM X:x - (<_*) - Y:y
Any left-directed edge can be chosen at each repetition.
FROM X:x - ((E1|E2>|<E3)*) - Y:y
Either E1, E2> or <E3 can be chosen at each repetition.
1-hop star pattern with bounds
FROM X:x - (E1*2..) - Y:y
Lower bounds only. There is a chain of at least 2 E1 edges.
FROM X:x - (E2>*..3) - Y:y
Upper bounds only. There is a chain of between 0 and 3 E2 edges.
FROM X:x - (<E3*3..5) - Y:y
Both Lower and Upper bounds. There is a chain of 3 to 5 E3 edges.
FROM X:x - ((E1|E2>|<E3)*3) - Y:y
Exact bound. There is a chain of exactly 3 edges, where each edge is either E1, E2>, or <E3.
No alias allowed for edge with Kleene star An edge alias may not be used when a Kleene star is used. The reason is that when there are a variable number of edges, we cannot associate or bind the alias to a specific edge in the pattern.
Shortest path semantics When an edge is repeated with a Kleene star, only the shortest matching occurrences are selected. See the example below:
In Figure 2, for Pattern 1 - (E>*) - 4
, any of the following paths reach 4 from 1.
1->2->3->4
1->2->3->5->6->2->3->4
any path that goes through the cycle 2->3->5->6->2 two or more times and jumps out at 3.
The first path is shorter than the rest; it is considered the only match.
In this tutorial, we will use the new Interpreted Mode for GSQL, introduced in TigerGraph 2.4. Interpreted mode lets us skip the INSTALL step, and even to run a query as soon as we create it, to offer a more interactive experience. These one-step interpreted queries are unnamed (anonymous) and parameterless, just like SQL.
Example 1. Find the direct or indirect superclass (including the self class) of the TagClass whose name is "TennisPlayer".
You can copy the above GSQL script to a file named example1.gsql, and invoke this script file in a Linux shell.
Note below that the starting vertex s, whose name is TennisPlayer, is also a match, using a path with zero hops.
Example 2. Find the immediate superclass of the TagClass whose name is "TennisPlayer". (This is equivalent to a 1-hop non-repeating pattern.)
You can copy the above GSQL script to a file named example2.gsql, and invoke this script file in a Linux shell.
Example 3. Find the 1 to 2 hops direct and indirect superclasses of the TagClass whose name is "TennisPlayer".
You can copy the above GSQL script to a file named example3.gsql, and invoke this script file in a Linux shell.
Example 4. Find the superclasses within 2 hops of the TagClass whose name is "TennisPlayer".
You can copy the above GSQL script to a file named example4.gsql, and invoke this script file in a Linux shell.
Example 5. Find the superclasses at least one hop from the TagClass whose name is "TennisPlayer".
You can copy the above GSQL script to a file named example5.gsql, and invoke this script file in a Linux shell.
Example 6. Find the 3 most recent comments that are liked or created by Viktor Akhiezer, and the total number of comments related to (created or liked by) Viktor Akhiezer.
You can copy the above GSQL script to a file named example6.gsql, and invoke this script file in a Linux shell.
We will use the LDBC Social Network Benchmark (LDBC SNB) data set. This data set models a twitter-like social forum. It comes with a data generator, which allows you to generate data at different scale factors. Scale factor 1 generates roughly 1GB raw data, scale factor 10 generates roughly 10GB raw data, etc.
Figure 1 shows the schema (from the LDBC SNB specification). It models the activities and relationships of social forum participants. For example, a forum Member can publish Posts on a Forum, and other Members of the Forum can make a Comment on the Post or on someone else's Comment. A Person's home location is a hierarchy (Continent>Country>City), and a person can be affiliated with a University or a Company. Tags can be used to classify a Forum and a Person's interests. Tags can belong to a TagClass. The relationships between entities are modeled as directed edges. For example, Person connects to Tag by the hasInterest edge. Forum connects to Person by two different edges, hasMember and hasModerator.
LDBC SNB schema uses inheritance to model certain entity type relationships:
Message is the superclass of Post and Comment.
Place is the superclass of City, Country, and Continent.
Organization is the superclass of University and Company.
We do not use the superclasses in our graph model. When there is an edge type connecting an entity to a superclass, we instead create an edge type from the entity to each of the subclasses of the superclass. For example, Message has an isLocatedIn relationship to Country. Since Message has two subclasses, Post and Comment, we create two edge types to Country:
Post_IS_LOCATED_IN_Country
Comment_IS_LOCATED_IN_Country
Vertex Type
For each entity in Figure 1 (the rectangular boxes), we create a vertex type with the entity's name.
Person is a person who participates in a forum.
Forum is a place where persons discuss topics.
City, Country, and Continent are geographic locations of other entities.
Company and University are organizations related to a person's affiliation.
Comment and Post are the interaction messages created by persons in a forum.
Tag is a topic or a concept.
TagClass is a class or a category. TagClass can form a hierarchy of tags.
Edge Type
For each relationship in Figure 1, we create an edge type whose name consists of the source entity name, the edge name (all capitalized), and the target entity name. The three parts are connected by underscores.
SourceEntityName_EDGENAME_TargetEntityName
For example,
Person_KNOWS_Person: Person is the source and target entity names, and Knows is the edge name.
Person_LIKES_Comment: Person is the source entity name, Comment is the target entity name, and Likes is the edge name.
When the edge name has two or more words, we separate words by an underscore as well. For example:
Tag_HAS_TYPE_TagClass: Tag is the source entity name, TagClass is the target entity name, and hasType is the edge name (which is written as HAS_TYPE).
Forum_HAS_MODERATOR_Person: Forum is the source entity name, Person is the target entity name, and hasModerator is the edge name (which is written as HAS_MODERATOR).
The GSQL script below can be downloaded from this link.