Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Develop, install, and run parameterized GSQL queries
We just saw how easy and quick it is to run simple built-in queries. However you'll undoubtedly want to create more customized or complex queries. GSQL puts maximum power in your hands through parameterized vertex set queries. Parameterized queries let you traverse the graph from one vertex set to an adjacent set of vertices, again and again, performing computations along the way, with built-in parallel execution and handy aggregation operations. You can even have one query call another query. But we'll start simple.
A GSQL parameterized query has three steps.
Define your query in GSQL. This query will be added to the GSQL catalog.
Install one or more queries in the catalog, generating a REST endpoint for each query.
Run an installed query, supplying appropriate parameters, either as a GSQL command or by sending an HTTP request to the REST endpoint.
Now, let's write our first GSQL query. We'll display all the direct (1-hop) neighbors of a person, given as an input parameter.
This query features one SELECT statement. The SELECT statements here are much more powerful than the ones in built-in queries. Here you can do the following:The query starts by seeding a vertex set "Start" with the person vertex identified by parameter p passed in from the query call. The curly braces tell GSQL to construct a set containing the enclosed items.
Next, the SELECT statement describes a 1-hop traversal according to the pattern described in the FROM clause:
Start:s -(friendship:e)-> person:tgt
This is basically the same syntax we used for the built-in select edges query. Namely, we select all edges beginning from the given source set (Start), which have the given edge type (friendship) and which end at the given vertex type (person). A feature we haven't seen before is the use of vertex and edge set aliases defined by ":alias": "s" is the alias for the source vertex set, "e" is the edge set alias, and "tgt" is the target vertex set alias.
Refer back to the initial clause and the assignment ("Result = SELECT tgt
"). Here we see the target set's alias tgt. This means that the SELECT statement should return the target vertex set (as filtered and processed by the full set of clauses in the SELECT query block) and assign that output set to the variable called Result.
Last, we print out the Result vertex set, in JSON format.
Rather than defining our query in interactive mode, we can store the query in a file and invoke the file from within the GSQL shell, using the @filename syntax. Copy and paste the above query into a file /home/tigergraph/hello.gsql
. Then, enter the GSQL shell and invoke the file using @hello.qsql ( Note that if you are not in the /home/tigergraph folder when you start gsql, then you can use the absolute path to invoke a gsql file. e.g., @/home/tigergraph/hello.gsql
) Then run the "ls" command to see that the query is now in the catalog.
However, the query is not installed yet; it is not ready to run. In the GSQL shell, type the following command to installed the just added query "hello".
It takes about 1 minute for the database to install this new query. Be patient! For queries on large datasets, this small investment pays off many times over in faster query execution, particularly if you will run the query many times, with different parameters. The installation will generate machine instructions and a REST endpoint. After the progress bar reaches 100%, we are ready to run this query.
To run a query in GSQL, use "RUN QUERY" followed by the query name and a set of parameter values.
The result is presented in JSON format. Tom has two 1-hop neighbors, namely Dan and Jenny.
Under the hood, installing a query will also generate a REST endpoint, so that the parameterized query can be invoked by an http call. In Linux, the curl command is the most popular way to submit an http request. In the example below, the portion that is standard for all queries is shown in bold ; the portion in normal weight pertains to this particular query and parameter value. The JSON result will be returned to the Linux shell's standard output. So, our parameterized query becomes a http service!
Finally, to see the GSQL text of a query in the catalog, you can use
Congratulations! At this point, you have gone through the whole process of defining, installing, and running a query.
Installing query will give the fastest query speed, but user needs to wait for the installation overhead.
The Interpreted Mode for GSQL, introduced in TigerGraph 2.4, let us skip the INSTALL step, and even run a query as soon as we create it, to offer a more interactive experience. These one-step interpreted queries are unnamed (anonymous) and parameterless, just like SQL. Please refer to GSQL 102 for this mode.
Now, let's do a more advanced query. This time, we are going to learn to use the powerful built-in accumulators, which serves as the runtime attributes (properties) attachable to each vertex visited during our traversal on the graph. Runtime means they exist only while the query is running; they are called accumulators because they are specially designed to gather (accumulate) data during an implicitly parallel processing of the query.
In this query we will find all the persons which are exactly 2 hops away from the parameterized input person. Just for fun, let's also compute the average age of those 2-hop neighbors.
In the standard approach for this kind of graph traversal algorithm, you use a boolean variable to mark the first time that the algorithm "visits" a vertex, so that it knows not to count it again. To fit this need, we'll define a local accumulator of the type OrAccum. To declare a local accumulator, we prefix an identifier name with a single "@" symbol. Each accumulator type has a default initial value; the default value for boolean accumulators is false. Optionally, you can specify an initial value.
We also want to compute one average, so we will define a global AvgAccum. The identifier for a global accumulator begins with two "@"s.
After defining the Start set, we then have our first one 1-hop traversal. The SELECT and FROM clauses are the same as in our first example, but there is an additional ACCUM clause. The += operator within an ACCUM clause means that for each edge matching the FROM clause pattern, we accumulate the right-hand-side expression (true) to the left-hand-accumulator (tgt.@visited as well as s.@visited). Note that a source vertex or target vertex may be visited multiple times. Referring to Figure 1, if we start at vertex Tom, there are two edges incidents to Tom, so the ACCUM clause in the first SELECT statement will visit Tom two times. Since the accumulator type is OrAccum, the cumulative effect of the two traversals is the following:
Tom.@visited <== (initial value: false) OR (true) OR (true)
Note that it does not matter which of the two edges was processed first, so this operation is suitable for multithreaded parallel processing. The net effect is that as long as a vertex is visited at least once, it will end up with @visited = true. The result of this first SELECT statement is assigned to the variable FirstNeighbors.
The second SELECT block will do one hop further, starting from the FirstNeighbors vertex set variable, and reaching the 2-hop neighbors. Note that this time, we have omitted the edge type friendship and the target vertex type person from the FROM clause, but we retained the aliases. If no type is mentioned for an alias, then it is interpreted as ALL types. Since our graph has only one vertex type and one edge type, it is logically the same as if we had specified the types. The WHERE clause filters out the vertices which have been marked as visited before (the 1-hop neighbors and the starting vertex p ). This SELECT statement uses POST_ACCUM instead of ACCUM. The reason is that POST_ACCUM traverses the vertex sets instead of the edge sets, guaranteeing that we do not double-count any vertices. Here, we accumulate the ages of the 2-hop neighbors to get their average.
Finally, the SecondNeighbors of p are printed out.
This time, we put all of the following GSQL commands into one file hello2.gsql:
USE GRAPH social
The query definition
Installing the query
Running the query
We can execute this full set of commands without entering the GSQL shell. Please copy and paste the above GSQL commands into a Linux file named /home/tigergraph/hello2.gsql.
In a Linux shell, under /home/tigergraph, type the following:
Queries are installed in the catalog and can have one or more input parameters, enabling reuse of queries.
A GSQL query consists of a series of SELECT query blocks, each generating a named vertex set.
Each SELECT query block can start traversing the graph from any of the previously defined vertex sets (that is, the sequence does not have to form a linear chain).
Accumulators are runtime variables with built-in accumulation operations, for efficient multithreaded computation.
Query can call another query.
Output is in JSON format.
For this tutorial, we will work mostly in the GSQL shell, in interactive mode. A few commands will be from a Linux shell. The first step in creating a GSQL graph is to define its schema. GSQL provides a set of DDL (Data Definition Language) commands, similar to SQL DDL commands, to model vertex types, edge types and a graph.
Use CREATE VERTEX command to define a vertex type named person. Here, PRIMARY_ID is required: each person must have a unique identifier. The rest is the optional list of attributes which characterize each person vertex, in the format attribute_name data_type, attribute_name data_type, ...
We show GSQL keywords in ALL CAPS to highlight them, but they are case-insensitive.
GSQL will confirm the creation of the vertex type.
You can create as many vertex types as you need.
Next, use the CREATE ... EDGE command to create an edge type named friendship. The keyword UNDIRECTED indicates this edge is a bidirectional edge, meaning that information can flow starting from either vertex. If you'd rather have a unidirectional connection where information flows only from the FROM vertex, use the DIRECTED keyword in place of UNDIRECTED. Here, FROM and TO are required to specify which two vertex types the edge type connects. An individual edge is specifying by giving the primary_ids of its source (FROM) vertex and target (TO) vertex. These are followed by an optional list of attributes, just as in the vertex definition.
GSQL will confirm the creation of the edge type.
You can create as many edge types as you need.
Next, use the CREATE GRAPH command to create a graph named social. Here, we just list the vertex types and edge types that we want to include in this graph.
GSQL will confirm the creation of the first graph after several seconds, during which it pushes the catalog information to all services, such as the GSE, GPE and RESTPP.
At this point, we have created a person vertex type, a friendship edge type, and a social graph that includes them. You've now built your first graph schema! Let's take a look what's in the catalog by typing the ls
command in the GSQL shell.
You now have a graph with data! You can run some queries using the built-in REST endpoint calls.
Below we call two functions, stat_vertex_number and stat_edge_number to return the cardinality of each vertex and edge type.
REST endpoints return results in JSON format. JSON data are used for various purposes. But JSON data can’t be read easily from JSON file by using bash script like other normal files. jq tool is used to solve this problem
We recommend you install jq and redirect the REST call result to jq before it is output.
If you want to lookup the details about a vertex with its primary_id, you can use the following REST call.
Example. Find a person vertex whose primary_id is "Tom".
In similar fashion, we can see details about edges. To describe an edge, you name the types of vertices and edges in the two parts or three parts of a URL.
Example. Find all friendship edges whose source vertex's primary_id is "Tom".
For more built-in REST endpoints, you can refer to the Built-in Endpoints page.
After creating a graph schema, the next step is to load data into it. The task here is to instruct the GSQL loader how to associate ("map") the fields in a set of data files to the attributes in your vertex types and edge types of the graph schema we just defined.
You should have the two data files person.csv and friendship.csv on your local disk. It's not necessary that they are in the same folder with you.
If you need to exit the GSQL shell for any reason, you can do so by typing "quit" without the quotes. Type gsql to enter again.
The loading job below assumes that your data files are in the folder /home/tigergraph
. If they are elsewhere, then in the loading job script below replace /home/tigergraph/person.csv
and /home/tigergraph/friendship.csv
with their corresponding file path respectively. Assuming you're (back) in the GSQL shell, enter the following set of commands.
Let's walk through the commands:
USE GRAPH social
:
Tells GSQL which graph you want to work with.
BEGIN ... END
:
Indicates multiple-line mode. The GSQL shell will treat everything between these markers as a single statement. These is only needed for interactive mode. If you run GSQL statements that are stored in a command file, the command interpreter will study your whole file, so it doesn't need the BEGIN and END hints.
CREATE LOADING JOB
:
One loading job can describe the mappings from multiple files to multiple graph objects. Each file must be assigned to a filename variable. The field labels can be either by name or by position. By-name labelling requires a header line in the source file. By-position labelling uses integers to indicate source column position 0, 1,... In the example above, the first LOAD statement refers to the source file columns by name, whereas the second LOAD statement refers to the source file columns by position. Note the following details:
The column "name" in file1 gets mapped to two fields, both the PRIMARY_ID and the "name" attribute of the person vertex.
In file1, gender comes before age. In the person vertex, gender comes after age. When loading, state your attributes in the order needed by the target object (in this case, the person vertex).
Each LOAD statement has a USING clause. Here it tells GSQL that both files contain a header (whether we choose to use the names or not, GSQL still needs to know whether to consider the first line as data or not). It also says the column separator is comma. GSQL can handle any single-character separator, not just commas.
When you run the CREATE LOADING JOB statement, GSQL checks for syntax errors and checks that you have data files in the locations specified. If it detects no errors, it compiles and saves your job.
You can now run your loading job to load data into your graph:
The result is shown below.
Notice the location of the loading log file. The example assumes that you installed TigerGraph in the default location, /home/tigergraph/
. In your installation folder is the main product folder, tigergraph. Within the tigergraph folder are several subfolders, such as logs, document, config, bin, and gstore. If you installed in a different location, say /usr/local/
, then you would find the product folder at /usr/local/tigergraph
.
In this tutorial, we will show you how to create a graph schema, load data in your graph, write simple parameterized queries, and run your queries. Before you start, you need to have installed the TigerGraph system, verified that it is working, and cleared out any previous data. It'll also help to become familiar with our graph terminology.
A graph is a collection of data entities and the connections between them. That is, it's a network of data entities.
Many people call a data entity a node ; at TigerGraph we called it a vertex. The plural is vertices. We call a connection an edge. Both vertices and edges can have properties or attributes. The figure below is a visual representation of a graph containing 7 vertices (shown as circles) and 7 edges (the lines).
A graph schema is the model which describes the types of vertices (nodes) and edge (connections) which can appear in your graph. The graph above has one type of vertex (person) and one type of edge (friendship).
A schema diagram looks like a small graph, except each node represents one type of vertex, and each link represents one type of edge.
The friendship loop shows that a friendship is between a person and another person.
For this tutorial, we will create and query the simple friendship social graph shown in Figure Friendship Social Graph. The data for this graph consists of two files in csv (comma-separated values) format. To follow along with this tutorial, please save these two files, person.csv and friendship.csv, to your TigerGraph local disk. In our running example, we use the /home/tigergraph/
folder to store the two csv files.
First, let's check that you can access GSQL.
Open a Linux shell.
Type gsql as below. A GSQL shell prompt should appear as below.
If the GSQL shell does not launch, try resetting the system with "gadmin start all". If you need further help, please see manage TigerGraph with gadmin and TigerGraph Knowledge Base and FAQs .
If this is your first time using GSQL, the TigerGraph data store is probably empty. However, if you or someone else has already been working on the system, there may already be a database. You can check by listing out the database catalog with the "ls" command. This is what should look like if it is empty:
If the data catalog is not empty, you will need to empty it to start this tutorial. We'll assume you have your coworkers' permission. Use the command DROP ALL to delete all the database data, its schema, and all related definitions. This command takes about a minute to run.
Restarting TigerGraph
If you need to restart TigerGraph for any reason, use the following command sequence:
Running GSQL commands from Linux
You can also run GSQL commands from a Linux shell. To run a single command, just use "gsql" followed by the command line enclosed in single quotes. (The quotes aren't necessary if there is no parsing ambiguity; it's safer to just use them.) For example,
You can also execute a series of commands which you have stored in a file, by simply invoking "gsql" following by the name of the file.
When you are done, you can exit the GSQL shell with the command "quit" (without the quotes).
You have learned a lot in GSQL 101!
With just the knowledge from GSQL 101 and a little practice, you should be able to do the following:
Create a graph schema containing multiple vertex types and edge types.
Define a loading job that takes one or more CSV files and maps the data directly to the vertices and edges of your graph.
Write and run simple parameterized queries which start at one vertex and then traverse one or more hops to generate a final vertex set. Make a simple additive computation and return the results.
Want to learn more?
To learn to do the same types of operations using the GraphStudio Visual SDK and UI, see the TigerGraph GraphStudio UI Guide.
Continue learning Accumulators and Pattern Matching.
To get answers to common questions, see the TigerGraph Knowledge Base and FAQs .
To see the full GSQL specification (whose table of contents with give you and idea of what is available) see
To discuss and get help with fellow GSQL users and GSQL developers, join GSQL community forum.
In this exercise, we will go through the 3-step process of writing GSQL-- define a schema, load data, and write a query.
This tutorial is written so that you can follow along and perform the steps on your TigerGraph system as your read.