1 of 33

Tutorials and Start Guides

This section provides tutorials and getting-started guides for users who want to get working right away.

GSQL

GET STARTED Tutorial
GSQL 101
Accumulators Tutorial
GSQL102 Pattern Matching

GraphStudio

GraphStudio UI Guide

Getting Started

Welcome to the TigerGraph™ Platform - the first real-time, Native Parallel Graph data analytics platform. This document covers the various options to get started with TigerGraph.

If you are a laptop user (macOS or Windows), we recommend you use Docker to start up TigerGraph on your computer:

If you have a Linux machine that meets our Software and Hardware Requirements, you can install TigerGraph on your machine directly:

If you would like to run TigerGraph on a Virtual Machine with Virtual Box, follow this guide:

You can also start up TigerGraph instances from Cloud Images on AWS, Microsoft Azure, or Google Cloud Platform:

Get Started with Docker

This document provides step-by-step instructions on how to pull the latest TigerGraph Enterprise Edition docker image to your host machine. You can follow the sections in sequence to set up the TigerGraph docker environment.

The latest TigerGraph docker image includes the following content:

The latest version of TigerGraph
Linux packages:
- openssh-server
- git
- wget
- curl
- emac
- vim
- jq
- tar
Tutorial material
- GSQL 101
- GSQL 102
The latest GSQL open-source graph algorithm library

1. Install Docker Desktop

Follow the steps below to install Docker Desktop on your machine and configure it with sufficient resources for TigerGraph:

Install Docker on your OS:
- To install Docker for Linux, follow these instructions:
Configure Docker Desktop with sufficient resources:
- Recommended: 4 cores and 16GB memory
- Minimum: 2 cores and 10GB memory
- Click the Docker Desktop icon, click Preferences >> Resources, drag the CPU and Memory sliders to the desired configuration, save and restart Docker Desktop

2. Prepare a Shared Folder on Host OS to be shared with Docker Container

Open a shell on your host machine and create or select a directory for sharing data between your host machine and docker container. Grant read+write+execute permission to the folder. For example, to create a folder called data in Linux:

You can mount (map) the data folder to a folder under the docker container, which allows you to share files between your host OS and Docker OS.

For example, if you mount the host OS folder ~/data to the docker folder /home/tigergraph/mydata, then anything you put on ~/data will be visible in the docker container under /home/tigergraph/mydata, and vice versa.

3. Run TigerGraph Docker image as a daemon

Run the following command to pull the TigerGraph docker image, bind ports, map a shared data folder, and start a container from the image. This command is very long; click the copy button at the end of the code block to copy the whole command:

Here is a breakdown of the options and arguments in the command:
- -d: make the container run in the background.
- -p: map docker 22 port to host OS 14022 port, 9000 port to host OS 9000 port, 14240 port to host OS 14240 port.
- --name: name the container tigergraph.
- --ulimit: set the ulimit (the number of open file descriptors per process) to 1 million.
- -v: mount the host OS ~/data folder to the docker /home/tigergraph/mydata folder using the -v option. If you are using Windows, change the above ~/data to something using windows file system convention, e.g. c:\data
- -t: allocate a pseudo-TTY
- docker.tigergraph.com/tigergraph:latest: download the latest docker image from the TigerGraph docker registry URL docker.tigergraph.com/tigergraph.

Replace "latest" with a specific version number if a dedicated version of TigerGraph is to be used. E.g., if you want to get the 3.0.5 version, the URL should be:

docker.tigergraph.com/tigergraph:3.0.5

To use the legacy developer editions, use:

docker.tigergraph.com/tigergraph-dev

If you use Windows and have disk drive permission issues with the above command, try the following command instead (this command does not map the shared folder on your host machine to your container) :

4. Use SSH to connect to your container

After launching the container, you can use SSH to connect to your container:

Verify that the container is running. You should see a row that describes the running container after running the command below:
Use ssh to open a shell to the container. At the prompt, enter tigergraph as the password. Note that we have mapped the host 14022 port to the container's 22 port (the ssh default port), so on the host we use ssh to connect to port 14022.

5. Start TigerGraph

After connecting to the container via ssh, inside the container, start all TigerGraph services with the following command (which may take up to one minute):
Start GraphStudio, TigerGraph's visual IDE, by visiting http://localhost:14240
in a browser on your laptop (host OS).

Operation Commands Cheat Sheet

After you start Docker Desktop, use the commands below to stop and restart the container:
Start the TigerGraph service within the container:
ssh to the container. Note: if localhost is not recognized, remove the localhost entry from ~/.ssh/known_hosts
Linux users can access the container through its ip address directly:
Default user: tigergraph
Default password: tigergraph
After running gadmin start, you can go to GraphStudio. Open a browser on your laptop (host OS) and access GraphStudio at the following URL:
Check the version of GSQL:

Get Started with Virtual Box

Install TigerGraph on Linux

You can install TigerGraph on a Linux machine that meets the Hardware and Software Requirements. For a step-by-step guide on installing TigerGraph on your Linux machine, please visit the following page:

Quickstart guide for new users

Installation Checklist

CHECK Hardware and Software Requirements
DOWNLOAD the TigerGraph platform: https://info.tigergraph.com/enterprise-free

INSTALL the Platform

For simple single-server installation: Assuming your downloaded file is called <your_tigergraph_package>:

tar xzf <your_tigergraph_package>.tar.gz 
cd tigergraph*/ 

# to install license in interactive mode
sudo ./install.sh

# to install license in non-interactive mode
sudo ./install.sh -n

For additional options, see TigerGraph Platform Installation Guide

Get Started from Cloud Images

This chapter covers different options for getting started with TigerGraph in the cloud. To continue, choose the cloud platform you wish to get started on:

For all editions on cloud marketplaces, please contact TigerGraph support and we'll assist you in upgrading from older images to the latest image.

Get Started on AWS

This tutorial will show you how to start TigerGraph Enterprise Edition from an image on AWS.

Deploying Your Instance

1. Go to AWS Marketplace and search for TigerGraph. 2. Click "Continue to Subscribe".

3. Click on "Continue to Configuration".

4. Select the Software Version and Region. We recommend selecting the latest version for the most up-to-date features. After making your selections, click on "Continue to Launch".

5. Select the instance type, security group settings, and other settings. The default settings are fine for most users, but feel free to modify them. Click "Launch" when finished. Notes: The instance type needs to have at least 4 CPUs and 16GB RAM for TigerGraph to work properly. The security group must allow inbound TCP traffic to port 14240 if you want to access GraphStudio (TigerGraph's visualization platform). For more about GraphStudio, see the GraphStudio UI Guide. The security group must allow inbound TCP traffic to port 9000 if you want to send RESTful requests to TigerGraph from outside the instance (this includes configuring the GSQL client on a remote machine). For more about the REST API, see the TigerGraph RESTful API User Guide.

For more about the TigerGraph Platform, see the TigerGraph Platform Overview.

6. That's it! The TigerGraph instance has been successfully deployed on AWS.

Starting TigerGraph on Your Instance

Log on to the instance and switch to user tigergraph using the following command:

sudo su - tigergraph

2. Run the following command to check the current status of TigerGraph. The services "ADMIN", "CTRL", "ETCD", "IFM", "KAFKA", and "ZK" are started automatically and should be up at this point. If any of them are not or you get the following error message, please wait for 30 to 60 seconds and check the status again before reporting it to TigerGraph support.

gadmin status

3. Run the following command to start TigerGraph:

gadmin start

4. Check the status again. All services should be up at this point:

gadmin status

5. TigerGraph has been successfully started on your cloud instance.

TigerGraph License on AWS Images

The TigerGraph Enterprise edition image comes with a perpetual license that will only work on the AWS instance it's installed on. Please run the following command to see it:

gadmin license status

Get Started on Microsoft Azure

This tutorial will show you how to start TigerGraph Enterprise Edition from an image on Microsoft Azure.

Deploying Your Instance

1. Go to Microsoft Azure Marketplace and search for "TigerGraph". 2. Select your software plan and Click "Create". Select the latest plan (with the highest version number) to access the latest features.

3. Fill out the "Resource group", "Virtual machine name", "Username" and "SSH Public key" fields. The default values should work for the rest of the fields. Then click "Next: Disks >".

4. Keep the default values for all other settings and click "Next" until you see the "Review + Create" page below. Check all your settings and click "Create" when you are satisfied. Notes: The instance type needs to have at least 4 CPUs and 16GB RAM for TigerGraph to work properly. The "NIC network security group" must allow inbound TCP traffic to port 14240 if you want to access GraphStudio (TigerGraph's visualization platform). For more about GraphStudio, see the GraphStudio UI Guide. The "NIC network security group" must allow inbound TCP traffic to port 9000 if you want to send RESTful requests to TigerGraph from outside the instance (this includes configuring the GSQL client on a remote machine). For more about the REST API, see the TigerGraph RESTful API User Guide.

For more about the TigerGraph Platform, see the TigerGraph Platform Overview.

5. That's it! The TigerGraph instance has been successfully deployed on Microsoft Azure.

Starting TigerGraph on Your Instance

Log on to the instance and switch to user tigergraph using the following command:

sudo su - tigergraph

gadmin status

3. Run the following command to start TigerGraph:

gadmin start

4. Check the status again. All services should be up at this point:

gadmin status

5. TigerGraph has been successfully started on your cloud instance.

TigerGraph License on Microsoft Azure Images

The TigerGraph Enterprise edition image comes with a perpetual license that will only work on the Microsoft Azure instance it's installed on. Please run the following command to see it:

gadmin license status

Get Started on Google Cloud

This tutorial will show you how to start TigerGraph from an image on Google Cloud. Please select your edition from below:

Deploying Your Instance‌

1. Go to Google Cloud Marketplace https://console.cloud.google.com/marketplace and search for "TigerGraph Enterprise Edition". Choose the latest edition to access the most up-to-date features.

2. Click on "Launch"‌.

3. The default settings are fine for most users, but feel free to modify them. When ready, click on "Deploy".

Notes: The instance type needs to have at least 4 CPUs and 16GB RAM for TigerGraph to work properly.

The "Allow TCP port 14240 traffic from the Internet" checkbox must be checked if you want to access GraphStudio (TigerGraph's visualization platform). For more about GraphStudio, see the GraphStudio UI Guide.

For information on how to set up authentication please see User access management.

The "Allow TCP port 9000 traffic from the Internet" checkbox must be checked if you want to send RESTful requests to TigerGraph from outside the instance (this includes configuring the GSQL client on a remote machine). For more about the REST API, see the TigerGraph RESTful API User Guide.‌

For more about the TigerGraph Platform, see the TigerGraph Platform Overview.‌

4. That's it! The TigerGraph instance has been successfully deployed on Google Cloud.‌

Starting TigerGraph on Your Instance

‌1. Log on to the instance and switch to user tigergraph using the following command:

sudo su - tigergraph

gadmin status

3. Run the following command to start TigerGraph:

gadmin start

4. Check the status again. All services should be up at this point:

gadmin status

5. TigerGraph has been successfully started on your cloud instance.‌

TigerGraph License on Google Cloud Images

The TigerGraph Enterprise edition image comes with a perpetual license that will only work on the Google Cloud instance it's installed on. Please run the following command to see it:

gadmin license status

GSQL 101

In this exercise, we will go through the 3-step process of writing GSQL-- define a schema, load data, and write a query.

This tutorial is written so that you can follow along and perform the steps on your TigerGraph system as your read.

Get Set

Introduction

In this tutorial, we will show you how to create a graph schema, load data in your graph, write simple parameterized queries, and run your queries. Before you start, you need to have installed the TigerGraph system, verified that it is working, and cleared out any previous data. It'll also help to become familiar with our graph terminology.

What is a Graph?

A graph is a collection of data entities and the connections between them. That is, it's a network of data entities.

Many people call a data entity a node ; at TigerGraph we called it a vertex. The plural is vertices. We call a connection an edge. Both vertices and edges can have properties or attributes. The figure below is a visual representation of a graph containing 7 vertices (shown as circles) and 7 edges (the lines).

A graph schema is the model which describes the types of vertices (nodes) and edge (connections) which can appear in your graph. The graph above has one type of vertex (person) and one type of edge (friendship).

A schema diagram looks like a small graph, except each node represents one type of vertex, and each link represents one type of edge.

The friendship loop shows that a friendship is between a person and another person.

Data Set

For this tutorial, we will create and query the simple friendship social graph shown in Figure Friendship Social Graph. The data for this graph consists of two files in csv (comma-separated values) format. To follow along with this tutorial, please save these two files, person.csv and friendship.csv, to your TigerGraph local disk. In our running example, we use the /home/tigergraph/ folder to store the two csv files.

person.csv

name,gender,age,state
Tom,male,40,ca
Dan,male,34,ny
Jenny,female,25,tx
Kevin,male,28,az
Amily,female,22,ca
Nancy,female,20,ky
Jack,male,26,fl

friendship.csv

person1,person2,date
Tom,Dan,2017-06-03
Tom,Jenny,2015-01-01
Dan,Jenny,2016-08-03
Jenny,Amily,2015-06-08
Dan,Nancy,2016-01-03
Nancy,Jack,2017-03-02
Dan,Kevin,2015-12-30

Prepare Your TigerGraph Environment

First, let's check that you can access GSQL.

Open a Linux shell.
Type gsql as below. A GSQL shell prompt should appear as below.
Linux Shell
```
$ gsql 
GSQL >
```
If the GSQL shell does not launch, try resetting the system with "gadmin start all". If you need further help, please see manage TigerGraph with gadmin and TigerGraph Knowledge Base and FAQs .

If this is your first time using GSQL, the TigerGraph data store is probably empty. However, if you or someone else has already been working on the system, there may already be a database. You can check by listing out the database catalog with the "ls" command. This is what should look like if it is empty:

GSQL shell - an empty database catalog

GSQL > ls
---- Global vertices, edges, and all graphs
Vertex Types:
Edge Types:

Graphs:
Jobs:

Json API version: v2

If the data catalog is not empty, you will need to empty it to start this tutorial. We'll assume you have your coworkers' permission. Use the command DROP ALL to delete all the database data, its schema, and all related definitions. This command takes about a minute to run.

GSQL shell - DROP ALL

GSQL > drop all

Dropping all, about 1 minute ...
Abort all active loading jobs 
[ABORT_SUCCESS] No active Loading Job to abort.

Shutdown restpp gse gpe ...
Graph store /usr/local/tigergraph/gstore/0/ has been cleared!
Everything is dropped.

Restarting TigerGraph

If you need to restart TigerGraph for any reason, use the following command sequence:

Linux Shell - Restarting TigerGraph services

# Switch to the user account set up during installation
# The default is user=tigergraph, password=tigergraph
$ su tigergraph
Password:tigergraph

# Start all services
$ gadmin restart -y

Running GSQL commands from Linux

You can also run GSQL commands from a Linux shell. To run a single command, just use "gsql" followed by the command line enclosed in single quotes. (The quotes aren't necessary if there is no parsing ambiguity; it's safer to just use them.) For example,

Linux shell - GSQL commands from a Linux shell

# "-g graphname" is need for a given graph
gsql -g social 'ls'
gsql 'drop all'
gsql 'ls'

You can also execute a series of commands which you have stored in a file, by simply invoking "gsql" following by the name of the file.

When you are done, you can exit the GSQL shell with the command "quit" (without the quotes).

Define a Schema

Introduction

For this tutorial, we will work mostly in the GSQL shell, in interactive mode. A few commands will be from a Linux shell. The first step in creating a GSQL graph is to define its schema. GSQL provides a set of DDL (Data Definition Language) commands, similar to SQL DDL commands, to model vertex types, edge types and a graph.

Create a Vertex Type

Use CREATE VERTEX command to define a vertex type named person. Here, PRIMARY_ID is required: each person must have a unique identifier. The rest is the optional list of attributes which characterize each person vertex, in the format attribute_name data_type, attribute_name data_type, ...

We show GSQL keywords in ALL CAPS to highlight them, but they are case-insensitive.

GSQL will confirm the creation of the vertex type.

You can create as many vertex types as you need.

Create an Edge Type

Next, use the CREATE ... EDGE command to create an edge type named friendship. The keyword UNDIRECTED indicates this edge is a bidirectional edge, meaning that information can flow starting from either vertex. If you'd rather have a unidirectional connection where information flows only from the FROM vertex, use the DIRECTED keyword in place of UNDIRECTED. Here, FROM and TO are required to specify which two vertex types the edge type connects. An individual edge is specifying by giving the primary_ids of its source (FROM) vertex and target (TO) vertex. These are followed by an optional list of attributes, just as in the vertex definition.

GSQL will confirm the creation of the edge type.

You can create as many edge types as you need.

Create a Graph

Next, use the CREATE GRAPH command to create a graph named social. Here, we just list the vertex types and edge types that we want to include in this graph.

GSQL will confirm the creation of the first graph after several seconds, during which it pushes the catalog information to all services, such as the GSE, GPE and RESTPP.

At this point, we have created a person vertex type, a friendship edge type, and a social graph that includes them. You've now built your first graph schema! Let's take a look what's in the catalog by typing the ls command in the GSQL shell.

Load Data

After creating a graph schema, the next step is to load data into it. The task here is to instruct the GSQL loader how to associate ("map") the fields in a set of data files to the attributes in your vertex types and edge types of the graph schema we just defined.

You should have the two data files person.csv and friendship.csv on your local disk. It's not necessary that they are in the same folder with you.

If you need to exit the GSQL shell for any reason, you can do so by typing "quit" without the quotes. Type gsql to enter again.

Define a Loading Job

The loading job below assumes that your data files are in the folder /home/tigergraph. If they are elsewhere, then in the loading job script below replace /home/tigergraph/person.csv and /home/tigergraph/friendship.csv with their corresponding file path respectively. Assuming you're (back) in the GSQL shell, enter the following set of commands.

Let's walk through the commands:

USE GRAPH social : Tells GSQL which graph you want to work with.
BEGIN ... END: Indicates multiple-line mode. The GSQL shell will treat everything between these markers as a single statement. These is only needed for interactive mode. If you run GSQL statements that are stored in a command file, the command interpreter will study your whole file, so it doesn't need the BEGIN and END hints.
CREATE LOADING JOB: One loading job can describe the mappings from multiple files to multiple graph objects. Each file must be assigned to a filename variable. The field labels can be either by name or by position. By-name labelling requires a header line in the source file. By-position labelling uses integers to indicate source column position 0, 1,... In the example above, the first LOAD statement refers to the source file columns by name, whereas the second LOAD statement refers to the source file columns by position. Note the following details:
- The column "name" in file1 gets mapped to two fields, both the PRIMARY_ID and the "name" attribute of the person vertex.
- In file1, gender comes before age. In the person vertex, gender comes after age. When loading, state your attributes in the order needed by the target object (in this case, the person vertex).
- Each LOAD statement has a USING clause. Here it tells GSQL that both files contain a header (whether we choose to use the names or not, GSQL still needs to know whether to consider the first line as data or not). It also says the column separator is comma. GSQL can handle any single-character separator, not just commas.

When you run the CREATE LOADING JOB statement, GSQL checks for syntax errors and checks that you have data files in the locations specified. If it detects no errors, it compiles and saves your job.

Run a Loading Job

You can now run your loading job to load data into your graph:

The result is shown below.

Notice the location of the loading log file. The example assumes that you installed TigerGraph in the default location, /home/tigergraph/. In your installation folder is the main product folder, tigergraph. Within the tigergraph folder are several subfolders, such as logs, document, config, bin, and gstore. If you installed in a different location, say /usr/local/, then you would find the product folder at /usr/local/tigergraph.

Run Built-in Queries

You now have a graph with data! You can run some queries using the built-in REST endpoint calls.

Get Vertex/Edge Statistics

Below we call two functions, stat_vertex_number and stat_edge_number to return the cardinality of each vertex and edge type.

REST endpoints return results in JSON format. JSON data are used for various purposes. But JSON data can’t be read easily from JSON file by using bash script like other normal files. jq tool is used to solve this problem

We recommend you install jq and redirect the REST call result to jq before it is output.

Linux Shell

#get vertex cardinality
curl -X POST 'http://localhost:9000/builtins/social' -d  '{"function":"stat_vertex_number","type":"*"}'  | jq .

#results
{
  "version": {
    "edition": "enterprise",
    "api": "v2",
    "schema": 0
  },
  "error": false,
  "message": "",
  "results": [
    {
      "v_type": "person",
      "count": 7
    }
  ]
}

Linux Shell

#get edge cardinality
curl -X POST 'http://localhost:9000/builtins/social' -d  '{"function":"stat_edge_number","type":"*"}' | jq .

#results
{
  "version": {
    "edition": "enterprise",
    "api": "v2",
    "schema": 0
  },
  "error": false,
  "message": "",
  "results": [
    {
      "e_type": "friendship",
      "count": 7
    }
  ]
}

Select Vertices

If you want to lookup the details about a vertex with its primary_id, you can use the following REST call.

Linux Shell

curl -X GET "http://localhost:9000/graph/{graph_name}/vertices/{vertex_type}/{vertex_id}"

Example. Find a person vertex whose primary_id is "Tom".

Linux Shell

 curl -X GET "http://localhost:9000/graph/social/vertices/person/Tom" | jq .

 #result
 {
  "version": {
    "edition": "enterprise",
    "api": "v2",
    "schema": 0
  },
  "error": false,
  "message": "",
  "results": [
    {
      "v_id": "Tom",
      "v_type": "person",
      "attributes": {
        "name": "Tom",
        "age": 40,
        "gender": "male",
        "state": "ca"
      }
    }
  ]
}

Select Edges

In similar fashion, we can see details about edges. To describe an edge, you name the types of vertices and edges in the two parts or three parts of a URL.

Linux Shell

#two parts
curl -X GET "http://localhost:9000/graph/edges/{source_vertex_type}/{source_vertex_id}/{edge_type}/"

#three parts
curl -X GET "http://localhost:9000/graph/edges/{source_vertex_type}/{source_vertex_id}/{edge_type}/{target_vertex_type}/{target_vertex_id}"

Example. Find all friendship edges whose source vertex's primary_id is "Tom".

Linux Shell

curl -X GET "http://localhost:9000/graph/social/edges/person/Tom/friendship/" | jq .

#result
{
  "version": {
    "edition": "enterprise",
    "api": "v2",
    "schema": 0
  },
  "error": false,
  "message": "",
  "results": [
    {
      "e_type": "friendship",
      "directed": false,
      "from_id": "Tom",
      "from_type": "person",
      "to_id": "Dan",
      "to_type": "person",
      "attributes": {
        "connect_day": "2017-06-03 00:00:00"
      }
    },
    {
      "e_type": "friendship",
      "directed": false,
      "from_id": "Tom",
      "from_type": "person",
      "to_id": "Jenny",
      "to_type": "person",
      "attributes": {
        "connect_day": "2015-01-01 00:00:00"
      }
    }
  ]
}

For more built-in REST endpoints, you can refer to the Built-in Endpoints page.

Develop Parameterized Queries

Develop, install, and run parameterized GSQL queries

We just saw how easy and quick it is to run simple built-in queries. However you'll undoubtedly want to create more customized or complex queries. GSQL puts maximum power in your hands through parameterized vertex set queries. Parameterized queries let you traverse the graph from one vertex set to an adjacent set of vertices, again and again, performing computations along the way, with built-in parallel execution and handy aggregation operations. You can even have one query call another query. But we'll start simple.

A GSQL parameterized query has three steps.

Define your query in GSQL. This query will be added to the GSQL catalog.
Install one or more queries in the catalog, generating a REST endpoint for each query.
Run an installed query, supplying appropriate parameters, either as a GSQL command or by sending an HTTP request to the REST endpoint.

A Simple 1-Hop Query

Now, let's write our first GSQL query. We'll display all the direct (1-hop) neighbors of a person, given as an input parameter.

GSQL command

USE GRAPH social
CREATE QUERY hello(VERTEX<person> p) {
  Start = {p};
  Result = SELECT tgt
           FROM Start:s-(friendship:e) ->person:tgt;
  PRINT Result;
}

This query features one SELECT statement. The SELECT statements here are much more powerful than the ones in built-in queries. Here you can do the following:The query starts by seeding a vertex set "Start" with the person vertex identified by parameter p passed in from the query call. The curly braces tell GSQL to construct a set containing the enclosed items.

Next, the SELECT statement describes a 1-hop traversal according to the pattern described in the FROM clause:

Start:s -(friendship:e)-> person:tgt

This is basically the same syntax we used for the built-in select edges query. Namely, we select all edges beginning from the given source set (Start), which have the given edge type (friendship) and which end at the given vertex type (person). A feature we haven't seen before is the use of vertex and edge set aliases defined by ":alias": "s" is the alias for the source vertex set, "e" is the edge set alias, and "tgt" is the target vertex set alias.

Refer back to the initial clause and the assignment ("Result = SELECT tgt"). Here we see the target set's alias tgt. This means that the SELECT statement should return the target vertex set (as filtered and processed by the full set of clauses in the SELECT query block) and assign that output set to the variable called Result.

Last, we print out the Result vertex set, in JSON format.

Create A Query

Rather than defining our query in interactive mode, we can store the query in a file and invoke the file from within the GSQL shell, using the @filename syntax. Copy and paste the above query into a file /home/tigergraph/hello.gsql. Then, enter the GSQL shell and invoke the file using @hello.qsql ( Note that if you are not in the /home/tigergraph folder when you start gsql, then you can use the absolute path to invoke a gsql file. e.g., @/home/tigergraph/hello.gsql ) Then run the "ls" command to see that the query is now in the catalog.

GSQL shell

GSQL > @hello.gsql
Using graph 'social'
The query hello has been added!
GSQL > ls
---- Graph social
Vertex Types: 
  - VERTEX person(PRIMARY_ID name STRING, name STRING, age INT, gender STRING, state STRING) WITH STATS="OUTDEGREE_BY_EDGETYPE"
Edge Types: 
  - UNDIRECTED EDGE friendship(from person, to person, connect_day DATETIME)

Graphs: 
  - Graph social(person:v, friendship:e)
Jobs: 
  - CREATE LOADING JOB load_social FOR GRAPH social {
      DEFINE FILENAME file2 = "/home/tigergraph/friendship.csv";
      DEFINE FILENAME file1 = "/home/tigergraph/person.csv";

      LOAD file1 TO VERTEX person VALUES($"name", $"name", $"age", $"gender", $"state") USING SEPARATOR=",", HEADER="true", EOL="\n";
      LOAD file2 TO EDGE friendship VALUES($0, $1, $2) USING SEPARATOR=",", HEADER="true", EOL="\n";
    }

Queries: 
  - hello(vertex<person> p)

Install a Query

However, the query is not installed yet; it is not ready to run. In the GSQL shell, type the following command to installed the just added query "hello".

GSQL command

INSTALL QUERY hello

GSQL shell

GSQL > INSTALL QUERY hello
Start installing queries, about 1 minute ...
hello query: curl -X GET 'http://127.0.0.1:9000/query/social/hello?p=VALUE'. Add -H "Authorization: Bearer TOKEN" if authentication is enabled.

[========================================================================================================] 100% (1/1)

It takes about 1 minute for the database to install this new query. Be patient! For queries on large datasets, this small investment pays off many times over in faster query execution, particularly if you will run the query many times, with different parameters. The installation will generate machine instructions and a REST endpoint. After the progress bar reaches 100%, we are ready to run this query.

Run a Query in GSQL

To run a query in GSQL, use "RUN QUERY" followed by the query name and a set of parameter values.

GSQL command - run query examples

RUN QUERY hello("Tom")

The result is presented in JSON format. Tom has two 1-hop neighbors, namely Dan and Jenny.

GSQL shell

GSQL > RUN QUERY hello("Tom")
{
  "error": false,
  "message": "",
  "version": {
    "edition": "developer",
    "schema": 0,
    "api": "v2"
  },
  "results": [{"Result": [
    {
      "v_id": "Dan",
      "attributes": {
        "gender": "male",
        "name": "Dan",
        "state": "ny",
        "age": 34
      },
      "v_type": "person"
    },
    {
      "v_id": "Jenny",
      "attributes": {
        "gender": "female",
        "name": "Jenny",
        "state": "tx",
        "age": 25
      },
      "v_type": "person"
    }
  ]}]
}

Run a Query as a REST Endpoint

Under the hood, installing a query will also generate a REST endpoint, so that the parameterized query can be invoked by an http call. In Linux, the curl command is the most popular way to submit an http request. In the example below, the portion that is standard for all queries is shown in bold ; the portion in normal weight pertains to this particular query and parameter value. The JSON result will be returned to the Linux shell's standard output. So, our parameterized query becomes a http service!

Linux shell

curl -X GET 'http://localhost:9000/query/social/hello?p=Tom'

Finally, to see the GSQL text of a query in the catalog, you can use

GSQL command - show query example

#SHOW QUERY query_name. E.g.
SHOW QUERY hello

Congratulations! At this point, you have gone through the whole process of defining, installing, and running a query.

Running Anonymous Queries Without Installing

Installing query will give the fastest query speed, but user needs to wait for the installation overhead.

The Interpreted Mode for GSQL, introduced in TigerGraph 2.4, let us skip the INSTALL step, and even run a query as soon as we create it, to offer a more interactive experience. These one-step interpreted queries are unnamed (anonymous) and parameterless, just like SQL. Please refer to GSQL 102 for this mode.

A More Advanced Query

Now, let's do a more advanced query. This time, we are going to learn to use the powerful built-in accumulators, which serves as the runtime attributes (properties) attachable to each vertex visited during our traversal on the graph. Runtime means they exist only while the query is running; they are called accumulators because they are specially designed to gather (accumulate) data during an implicitly parallel processing of the query.

GSQL command file - hello2.gsql

USE GRAPH social
CREATE QUERY hello2 (VERTEX<person> p) {
  OrAccum  @visited = false;
  AvgAccum @@avgAge;
  Start = {p};

  FirstNeighbors = SELECT tgt
                   FROM Start:s -(friendship:e)-> person:tgt
                   ACCUM tgt.@visited += true, s.@visited += true;

  SecondNeighbors = SELECT tgt
                    FROM FirstNeighbors -(:e)-> :tgt
                    WHERE tgt.@visited == false
                    POST_ACCUM @@avgAge += tgt.age;

  PRINT SecondNeighbors;
  PRINT @@avgAge;
}
INSTALL QUERY hello2
RUN QUERY hello2("Tom")

In this query we will find all the persons which are exactly 2 hops away from the parameterized input person. Just for fun, let's also compute the average age of those 2-hop neighbors.

In the standard approach for this kind of graph traversal algorithm, you use a boolean variable to mark the first time that the algorithm "visits" a vertex, so that it knows not to count it again. To fit this need, we'll define a local accumulator of the type OrAccum. To declare a local accumulator, we prefix an identifier name with a single "@" symbol. Each accumulator type has a default initial value; the default value for boolean accumulators is false. Optionally, you can specify an initial value.

We also want to compute one average, so we will define a global AvgAccum. The identifier for a global accumulator begins with two "@"s.

After defining the Start set, we then have our first one 1-hop traversal. The SELECT and FROM clauses are the same as in our first example, but there is an additional ACCUM clause. The += operator within an ACCUM clause means that for each edge matching the FROM clause pattern, we accumulate the right-hand-side expression (true) to the left-hand-accumulator (tgt.@visited as well as s.@visited). Note that a source vertex or target vertex may be visited multiple times. Referring to Figure 1, if we start at vertex Tom, there are two edges incidents to Tom, so the ACCUM clause in the first SELECT statement will visit Tom two times. Since the accumulator type is OrAccum, the cumulative effect of the two traversals is the following:

Tom.@visited <== (initial value: false) OR (true) OR (true)

Note that it does not matter which of the two edges was processed first, so this operation is suitable for multithreaded parallel processing. The net effect is that as long as a vertex is visited at least once, it will end up with @visited = true. The result of this first SELECT statement is assigned to the variable FirstNeighbors.

The second SELECT block will do one hop further, starting from the FirstNeighbors vertex set variable, and reaching the 2-hop neighbors. Note that this time, we have omitted the edge type friendship and the target vertex type person from the FROM clause, but we retained the aliases. If no type is mentioned for an alias, then it is interpreted as ALL types. Since our graph has only one vertex type and one edge type, it is logically the same as if we had specified the types. The WHERE clause filters out the vertices which have been marked as visited before (the 1-hop neighbors and the starting vertex p ). This SELECT statement uses POST_ACCUM instead of ACCUM. The reason is that POST_ACCUM traverses the vertex sets instead of the edge sets, guaranteeing that we do not double-count any vertices. Here, we accumulate the ages of the 2-hop neighbors to get their average.

Finally, the SecondNeighbors of p are printed out.

This time, we put all of the following GSQL commands into one file hello2.gsql:

USE GRAPH social
The query definition
Installing the query
Running the query

We can execute this full set of commands without entering the GSQL shell. Please copy and paste the above GSQL commands into a Linux file named /home/tigergraph/hello2.gsql.

In a Linux shell, under /home/tigergraph, type the following:

Linux shell

gsql hello2.gsql

GSQL Query Summary:

Queries are installed in the catalog and can have one or more input parameters, enabling reuse of queries.
A GSQL query consists of a series of SELECT query blocks, each generating a named vertex set.
Each SELECT query block can start traversing the graph from any of the previously defined vertex sets (that is, the sequence does not have to form a linear chain).
Accumulators are runtime variables with built-in accumulation operations, for efficient multithreaded computation.
Query can call another query.
Output is in JSON format.

Review

You have learned a lot in GSQL 101!

With just the knowledge from GSQL 101 and a little practice, you should be able to do the following:

Create a graph schema containing multiple vertex types and edge types.
Define a loading job that takes one or more CSV files and maps the data directly to the vertices and edges of your graph.
Write and run simple parameterized queries which start at one vertex and then traverse one or more hops to generate a final vertex set. Make a simple additive computation and return the results.

Want to learn more?

To learn to do the same types of operations using the GraphStudio Visual SDK and UI, see the TigerGraph GraphStudio UI Guide.
Continue learning Accumulators and Pattern Matching.
To see more GSQL examples, see IS, IC, and BI workloads based on LDBC benchmark schema, and GSQL Demo Examples .
To get answers to common questions, see the TigerGraph Knowledge Base and FAQs .
To see the full GSQL specification (whose table of contents with give you and idea of what is available) see
- GSQL Language Reference Part 1 - Defining Graphs and Loading Data
- GSQL Language Reference Part 2 - Querying
To discuss and get help with fellow GSQL users and GSQL developers, join GSQL community forum.

Accumulators Tutorial

Introduction

GSQL is a Turing complete Graph Database query language. Comparing to other graph query languages, the biggest advantages is its support of accumulators -- global or vertex local.

In addition to provide the classic pattern match syntax, which is easy to master, GSQL supports powerful run-time vertex attributes (a.k.a local accumulators) and global state variables (a.k.a global accumulators).

This short tutorial aims to shorten the learning curve of accumulator. Supposedly, after reading this article, everyone can master the essence of accumulator by heart, and start solving real-life graph problems with this handy language feature.

What is an Accumulator?

An accumulator is a state variable in GSQL. Its state is mutable throughout the life cycle of a query. It has an initial value, and users can keep accumulating (using its "+=" built-in operator) new values into it. Each accumulator variable has a type. The type decides what semantics the declared accumulator will use to interpret the "+=" operation.

In Figure 1's left box, from line 3 to line 8, six different accumulator variables (those with prefix @@) are declared, each with a unique type. Below we explain the semantic and usage of them.

SumAccum<INT> allows user to keep adding INT values into its internal state variable. As the line 10 and 11 have shown, we added 1 and 2 to the accumulator, and end up with the value 3 (shown on line 3 in the right box).
MinAccum<INT> keeps the smallest INT number it has seen. As the line 14 and 15 have shown, we accumulated 1 and 2 to the MinAccum accumulator, and end up with the value 1 (shown on line 6 in the right box).
MaxAccum<INT> is symmetric to MinAccum. It returns the MAX INT value it has seen. Lines 18 and 19 show that we send 1 and 2 into it, and end up with the value 2 (shown on line 9 in the right box).
OrAccum keeps OR-ing the internal boolean state variable with new boolean variables that accumulate to it. The initial default value is FALSE. Lines 22 and 23 show that we send TRUE and FALSE into it, and end up with the TRUE value (shown on line 12 in the right box).
AndAccum is symmetric to OrAccum. Instead of using OR, it uses the AND accumulation semantics. Line 26 and 27 show that we accumulate TRUE and FALSE into it, and end up with the FALSE value (shown on line 15 in the right box).
ListAccum<INT> keeps appending new integer(s) into its internal list variable. Line 30 - 32 show that we append 1, 2, and [3,4] to the accumulator, and end up with [1,2,3,4] (shown on lines 19-22 in the right box).

Global vs. Vertex-attached Accumulator

At this point, we have seen that accumulators are special typed variable in GSQL language. We are ready to explain their global and local scopes.

Global accumulator belongs to the entire query. Anywhere in a query, a statement can update its value. Local accumulator belongs to each vertex. It can only be updated when its owning vertex is accessible. To differentiate them, we use special prefixes in the identifier when we declare them.

@@ prefix is used for declaring global accumulator variable. It is always used stand-alone. E.g @@cnt +=1
@ prefix is used for declaring local accumulator variable. It must be used with a vertex alias in a query block. E.g. v.@cnt += 1, where v is a vertex alias specified in a FROM clause of a SELECT-FROM-WHERE query block.

Consider a toy social graph modeled by a person vertex type and a person-to-person friendship edge type shown in Figure 2. Below we write a query, which accepts a person, and does a 1-hop traversal from the input person to its neighbors. We use the @@global_edge_cnt accumulator to accumulate the total number of edges we traverse. And we use @vertex_cnt to write to the input person's each friend vertex an integer 1.

As Figure 2 shows, Dan has 4 direct friends -- Tom, Kevin, Jenny, and Nancy, each of which holds a local accumulator @vertex_cnt. And the @@global_edge_cnt has value 4, reflecting the fact that for each edge, we have accumulated 1 into it.

ACCUM vs. POST-ACCUM

ACCUM and POST-ACCUM clauses are computed in stages, where-in a SELECT-FROM-WHERE query block, ACCUM is executed first, followed by the POST-ACCUM clause.

ACCUM executes its statement(s) once for each matched edge (or path) of the FROM clause pattern. Further, ACCUM parallelly executes its statements for all the matches.
POST-ACCUM executes its statement(s) once for each involved vertex. Note that each statement within the POST-ACCUM clause can refer to either source vertices or target vertices but not both. Its statements can access the aggregated accumulator result computed in the ACCUM clause.

Conclusion

We have explained the mechanism of accumulators, their types, and the two different scopes--global and local. We also elaborate the ACCUM and POST-ACCUM clause semantics. Once you master the basics, the rest is to practice more. We have made available 46 queries based on the LDBC schema. These 46 queries are divided into three groups.

You can follow GSQL 102 to setup the environment. You can also post your feedback and questions on the GSQL community forum. Our community members and developers love to hear any feedback from your graph journey of using GSQL and are ready to help clarifying any doubts.

GSQL102 Pattern Matching

A guide to GSQL's multi-hop SELECT statements which make it convenient to express pattern-matching queries.

Get Set

Introduction

In this tutorial, we will show you how to write and run Pattern Matching queries. Pattern Matching is available in TigerGraph v2.4+.

This tutorial was updated for TigerGraph 3.0. If you are using an older version, please change to the documentation for that version.

What is a Graph Pattern?

A graph pattern is a traversal trace on the graph schema. A pattern can contain repeated steps. A pattern can be a linear trace, or a non-linear trace (tree, circle etc.). For example, imagine a simple schema consisting of a Person vertex type and a Friendship edge type. A pattern could be a trace on this simple schema,

or, use *2 to denote the two consecutive Friendship edges,

What is Pattern Matching?

Pattern matching is the process of finding subgraphs in a data graph that conform to a given query pattern.

Prepare Your TigerGraph Environment

We assume you are running your own TigerGraph instance as the sole user with full privileges. If you are on a multiuser Enterprise Edition, consult with your DB administrator. You need to have Designer or Admin privilege on an empty graph. At various points in this tutorial, there are links to download files. Most are small, but the graph data file is 1GB when uncompressed.

First, let's check that you can access GSQL, and that your version is 3.0 or higher.

Open a Linux shell.
Type gsql as below. A GSQL shell prompt should appear as below.
You need to start from an empty data catalog. If necessary, run drop all to clear the catalog first.

Cheatsheet

The % prefix indicates Linux shell commands. You need TigerGraph admin privilege to run most gadmin commands.
The GSQL> prefix indicates GSQL shell commands.

Define the Schema

Data Set

We will use the LDBC Social Network Benchmark (LDBC SNB) data set. This data set models a typical social forum, where communities of persons can post messages on a forum to discuss a topic. It comes with a data generator, which allows you to generate data at different scale factors. Scale factor 1 generates roughly 1GB of raw data, scale factor 10 generates roughly 10GB of raw data, etc.

Figure 1 shows the schema (from the LDBC SNB specification). It models the activities and relationships of social forum participants. For example, a forum Member can publish Posts on a Forum, and other Members of the Forum can make a Comment on the Post or on someone else's Comment. A Person's home location is a hierarchy (Continent>Country>City), and a person can be affiliated with a University or a Company. A Tag can be used to classify a Forum, a Message, or a Person's interests. Tags can further be classified by TagClass. The relationships between entities are modeled as directed edges, except person KNOWS person is modeled as an undirected edge. For example, Person connects to Tag by the hasInterest edge. Forum connects to Person by two different edges, hasMember and hasModerator.

LDBC SNB schema uses inheritance to model certain relationships:

Message is the superclass of Post and Comment.
Place is the superclass of City, Country, and Continent.
Organization is the superclass of University and Company.

We do not use the superclasses in our graph model. When there is an edge type connecting an entity to a superclass, we instead create an edge type from the entity to each of the subclasses of the superclass. For example, Message has an isLocatedIn relationship to Country. Since Message has two subclasses, Post and Comment, both connected with Country by the isLocatedIn relationship, we create an edge type IS_LOCATED_IN, connecting both vertex type pairs using the compound edge type available in TigerGraph 3.0.

CREATE DIRECTED EDGE IS_LOCATED_IN ( FROM Comment, TO Country | FROM Post, TO Country )

This new DDL syntax allows a general edge type to be defined over multiple vertex pairs. For example, there are many relationships in the LDBC schema all called isLocatedIn which connect something to a geographical entity. We can model them all as a single edge type IS_LOCATED_IN. The result is a more succinct graph model and less GSQL code when expressing pattern matching queries.

CREATE DIRECTED EDGE IS_LOCATED_IN (FROM Comment, TO Country | FROM Post, TO Country | FROM Company, TO Country | FROM Person, TO City | FROM University, TO City) WITH REVERSE_EDGE="IS_LOCATED_IN_REVERSE"

Schema Naming Conventions

The folowing nameing conventions are followed in the DDL language:

Vertex types
- For each entity in Figure 1 (the rectangular boxes), we create a vertex type with the entity's name in UpperCamelCase.
- Examples
  - Person is a person who participates in a forum.
  - Forum is a place where persons discuss topics.
  - City, Country, and Continent are geographic locations of other entities.
  - Company and University are organizations with which a person can be affiliated.
  - Comment and Post are the interaction messages created by persons in a forum.
  - Tag is a topic or a concept.
  - TagClass is a class or a category. TagClass can form a hierarchy of tags.
Edge types
- For each relationship type, we create an edge type with the relationship name (all capitalized and words are separated by an underscore).
- When multiple relationships share the same semantics in Figure 1, we merge them into a single compound edge type. For example:
- Examples
  - CONTAINER_OF: Forum is the container of posts.
  - HAS_INTERESTS: Person has interest in tag(s).
  - IS_SUBCLASS_OF: Tag is a subclass of another Tag.
  - IS_LOCATED_IN: Comment or Post is located in Country, Company is located in Country, Person is located in City, and University is located in City.

GSQL Schema DDL

Two GSQL scripts for defining the LDBC-SNB schema are shown below. Choose the one that serve your needs. They are not equivalent if you have different organizations using the same graph database instance.

Method 1. Bottom-up DDL

In this method, global vertex/edge type containers are created first. Next, graphs are created to group them. In other words, the global vertex/edge type containers can be shared across graphs.

#LDBC-SNB schema

//clear the current catalog. 
// It may take a while since it restarts the subsystem services. 
DROP ALL

//create vertex types

## Post and Comment
CREATE VERTEX Comment (PRIMARY_ID id UINT, creationDate DATETIME, locationIP STRING,
    browserUsed STRING, content STRING, length UINT) WITH primary_id_as_attribute="TRUE"
CREATE VERTEX Post (PRIMARY_ID id UINT, imageFile STRING, creationDate DATETIME,
    locationIP STRING, browserUsed STRING, lang STRING, content STRING,
    length UINT) WITH primary_id_as_attribute="TRUE"
## organisation
CREATE VERTEX Company (PRIMARY_ID id UINT, name STRING, url STRING) WITH primary_id_as_attribute="TRUE"
CREATE VERTEX University (PRIMARY_ID id UINT, name STRING, url STRING) WITH primary_id_as_attribute="TRUE"
## place
CREATE VERTEX City (PRIMARY_ID id UINT, name STRING, url STRING) WITH primary_id_as_attribute="TRUE"
CREATE VERTEX Country (PRIMARY_ID id UINT, name STRING, url STRING) WITH primary_id_as_attribute="TRUE"
CREATE VERTEX Continent (PRIMARY_ID id UINT, name STRING, url STRING) WITH primary_id_as_attribute="TRUE"
## etc
CREATE VERTEX Forum (PRIMARY_ID id UINT, title STRING, creationDate DATETIME) WITH primary_id_as_attribute="TRUE"
CREATE VERTEX Person (PRIMARY_ID id UINT, firstName STRING, lastName STRING, gender STRING, birthday DATETIME, 
   creationDate DATETIME, locationIP STRING, browserUsed STRING, speaks set<STRING>, email set<STRING>) 
   WITH primary_id_as_attribute="TRUE"
CREATE VERTEX Tag (PRIMARY_ID id UINT, name STRING, url STRING) WITH primary_id_as_attribute="TRUE"
CREATE VERTEX TagClass (PRIMARY_ID id UINT, name STRING, url STRING) WITH primary_id_as_attribute="TRUE"

// create edge types
CREATE DIRECTED EDGE CONTAINER_OF (FROM Forum, TO Post) WITH REVERSE_EDGE="CONTAINER_OF_REVERSE"
CREATE DIRECTED EDGE HAS_CREATOR (FROM Comment|Post, TO Person) WITH REVERSE_EDGE="HAS_CREATOR_REVERSE"
CREATE DIRECTED EDGE HAS_INTEREST (FROM Person, TO Tag) WITH REVERSE_EDGE="HAS_INTEREST_REVERSE"
CREATE DIRECTED EDGE HAS_MEMBER (FROM Forum, TO Person, joinDate DATETIME) WITH REVERSE_EDGE="HAS_MEMBER_REVERSE"
CREATE DIRECTED EDGE HAS_MODERATOR (FROM Forum, TO Person) WITH REVERSE_EDGE="HAS_MODERATOR_REVERSE"
CREATE DIRECTED EDGE HAS_TAG (FROM Comment|Post|Forum, TO Tag) WITH REVERSE_EDGE="HAS_TAG_REVERSE"
CREATE DIRECTED EDGE HAS_TYPE (FROM Tag, TO TagClass) WITH REVERSE_EDGE="HAS_TYPE_REVERSE"
CREATE DIRECTED EDGE IS_LOCATED_IN (FROM Comment, TO Country
                                  | FROM Post, TO Country
                                  | FROM Company, TO Country
                                  | FROM Person, TO City
                                  | FROM University, TO City) WITH REVERSE_EDGE="IS_LOCATED_IN_REVERSE"
CREATE DIRECTED EDGE IS_PART_OF (FROM City, TO Country
                               | FROM Country, TO Continent) WITH REVERSE_EDGE="IS_PART_OF_REVERSE"
CREATE DIRECTED EDGE IS_SUBCLASS_OF (FROM TagClass, TO TagClass) WITH REVERSE_EDGE="IS_SUBCLASS_OF_REVERSE"
CREATE UNDIRECTED EDGE KNOWS (FROM Person, TO Person, creationDate DATETIME)
CREATE DIRECTED EDGE LIKES (FROM Person, TO Comment|Post, creationDate DATETIME) WITH REVERSE_EDGE="LIKES_REVERSE"
CREATE DIRECTED EDGE REPLY_OF (FROM Comment, TO Comment|Post) WITH REVERSE_EDGE="REPLY_OF_REVERSE"
CREATE DIRECTED EDGE STUDY_AT (FROM Person, TO University, classYear INT) WITH REVERSE_EDGE="STUDY_AT_REVERSE"
CREATE DIRECTED EDGE WORK_AT (FROM Person, TO Company, workFrom INT) WITH REVERSE_EDGE="WORK_AT_REVERSE"

//create graph type
CREATE GRAPH ldbc_snb (*)

Method 2. Top-down DDL

In this method, an empty graph is created first. Next, local vertex/edge type containers are added to the empty graph via a schema change job. The vertex/egde type containers added this way will be private to the graph, no other graph can see them.

#LDBC-SNB schema

//clear the current catalog. 
// It may take a while since it restarts the subsystem services. 
DROP ALL

# 1. Create graph
CREATE GRAPH ldbc_snb ()

# 2. Create schema_change job to include all vertex/edge types
CREATE SCHEMA_CHANGE JOB change_schema_of_ldbc  FOR GRAPH ldbc_snb {

  ## Post and Comment
  ADD VERTEX Comment (PRIMARY_ID id UINT, creationDate DATETIME, locationIP STRING,
    browserUsed STRING, content STRING, length UINT) WITH primary_id_as_attribute="TRUE";
    
  ADD VERTEX Post (PRIMARY_ID id UINT, imageFile STRING, creationDate DATETIME,
    locationIP STRING, browserUsed STRING, lang STRING, content STRING,
    length UINT) WITH primary_id_as_attribute="TRUE";
  ## organisation
  ADD VERTEX Company (PRIMARY_ID id UINT, name STRING, url STRING) WITH primary_id_as_attribute="TRUE";
  ADD VERTEX University (PRIMARY_ID id UINT, name STRING, url STRING) WITH primary_id_as_attribute="TRUE";
  ## place
  ADD VERTEX City (PRIMARY_ID id UINT, name STRING, url STRING) WITH primary_id_as_attribute="TRUE";
  ADD VERTEX Country (PRIMARY_ID id UINT, name STRING, url STRING) WITH primary_id_as_attribute="TRUE";
  ADD VERTEX Continent (PRIMARY_ID id UINT, name STRING, url STRING) WITH primary_id_as_attribute="TRUE";
  ## etc
  ADD  VERTEX Forum (PRIMARY_ID id UINT, title STRING, creationDate DATETIME) WITH primary_id_as_attribute="TRUE";
  ADD  VERTEX Person (PRIMARY_ID id UINT, firstName STRING, lastName STRING, gender STRING, birthday DATETIME, 
   creationDate DATETIME, locationIP STRING, browserUsed STRING, speaks set<STRING>, email set<STRING>) 
   WITH primary_id_as_attribute="TRUE";
  ADD VERTEX Tag (PRIMARY_ID id UINT, name STRING, url STRING) WITH primary_id_as_attribute="TRUE";
  ADD VERTEX TagClass (PRIMARY_ID id UINT, name STRING, url STRING) WITH primary_id_as_attribute="TRUE";

  // create edge types
  ADD DIRECTED EDGE CONTAINER_OF (FROM Forum, TO Post) WITH REVERSE_EDGE="CONTAINER_OF_REVERSE";
  ADD  DIRECTED EDGE HAS_CREATOR (FROM Comment|Post, TO Person) WITH REVERSE_EDGE="HAS_CREATOR_REVERSE";
  ADD  DIRECTED EDGE HAS_INTEREST (FROM Person, TO Tag) WITH REVERSE_EDGE="HAS_INTEREST_REVERSE";
  ADD DIRECTED EDGE HAS_MEMBER (FROM Forum, TO Person, joinDate DATETIME) WITH REVERSE_EDGE="HAS_MEMBER_REVERSE";
  ADD DIRECTED EDGE HAS_MODERATOR (FROM Forum, TO Person) WITH REVERSE_EDGE="HAS_MODERATOR_REVERSE";
  ADD DIRECTED EDGE HAS_TAG (FROM Comment|Post|Forum, TO Tag) WITH REVERSE_EDGE="HAS_TAG_REVERSE";
  ADD DIRECTED EDGE HAS_TYPE (FROM Tag, TO TagClass) WITH REVERSE_EDGE="HAS_TYPE_REVERSE";
  ADD  DIRECTED EDGE IS_LOCATED_IN (FROM Comment, TO Country
                                  | FROM Post, TO Country
                                  | FROM Company, TO Country
                                  | FROM Person, TO City
                                  | FROM University, TO City) WITH REVERSE_EDGE="IS_LOCATED_IN_REVERSE";
  ADD DIRECTED EDGE IS_PART_OF (FROM City, TO Country
                               | FROM Country, TO Continent) WITH REVERSE_EDGE="IS_PART_OF_REVERSE";
  ADD DIRECTED EDGE IS_SUBCLASS_OF (FROM TagClass, TO TagClass) WITH REVERSE_EDGE="IS_SUBCLASS_OF_REVERSE";
  ADD UNDIRECTED EDGE KNOWS (FROM Person, TO Person, creationDate DATETIME);
  ADD DIRECTED EDGE LIKES (FROM Person, TO Comment|Post, creationDate DATETIME) WITH REVERSE_EDGE="LIKES_REVERSE";
  ADD DIRECTED EDGE REPLY_OF (FROM Comment, TO Comment|Post) WITH REVERSE_EDGE="REPLY_OF_REVERSE";
  ADD DIRECTED EDGE STUDY_AT (FROM Person, TO University, classYear INT) WITH REVERSE_EDGE="STUDY_AT_REVERSE";
  ADD DIRECTED EDGE WORK_AT (FROM Person, TO Company, workFrom INT) WITH REVERSE_EDGE="WORK_AT_REVERSE";
}

# 3. Run schema_change job
RUN SCHEMA_CHANGE JOB change_schema_of_ldbc

# 4. Drop schema_change job
DROP JOB change_schema_of_ldbc

Load Data

Define the Loading Job

Below, we use the GSQL loading language to define a loading job script, which encodes all the mappings from the source CSV file (generated by the LDBC SNB benchmark data generator) to our schema.

The script to load the LDBC-SNB data is below.

# GSQL script for loading LDBC-SNB data

USE GRAPH ldbc_snb
CREATE LOADING JOB load_ldbc_snb FOR GRAPH ldbc_snb {
  // define vertex
  DEFINE FILENAME v_comment_file;
  DEFINE FILENAME v_post_file;
  DEFINE FILENAME v_organisation_file;
  DEFINE FILENAME v_place_file;
  DEFINE FILENAME v_forum_file;
  DEFINE FILENAME v_person_file;
  DEFINE FILENAME v_tag_file;
  DEFINE FILENAME v_tagclass_file;

  // define edge
  DEFINE FILENAME forum_containerOf_post_file;
  DEFINE FILENAME comment_hasCreator_person_file;
  DEFINE FILENAME post_hasCreator_person_file;
  DEFINE FILENAME person_hasInterest_tag_file;
  DEFINE FILENAME forum_hasMember_person_file;
  DEFINE FILENAME forum_hasModerator_person_file;
  DEFINE FILENAME comment_hasTag_tag_file;
  DEFINE FILENAME post_hasTag_tag_file;
  DEFINE FILENAME forum_hasTag_tag_file;
  DEFINE FILENAME tag_hasType_tagclass_file;
  DEFINE FILENAME organisation_isLocatedIn_place_file;
  DEFINE FILENAME comment_isLocatedIn_place_file;
  DEFINE FILENAME post_isLocatedIn_place_file;
  DEFINE FILENAME person_isLocatedIn_place_file;
  DEFINE FILENAME place_isPartOf_place_file;
  DEFINE FILENAME tagclass_isSubclassOf_tagclass_file;
  DEFINE FILENAME person_knows_person_file;
  DEFINE FILENAME person_likes_comment_file;
  DEFINE FILENAME person_likes_post_file;
  DEFINE FILENAME comment_replyOf_comment_file;
  DEFINE FILENAME comment_replyOf_post_file;
  DEFINE FILENAME person_studyAt_organisation_file;
  DEFINE FILENAME person_workAt_organisation_file;
  
  // load vertex
  LOAD v_comment_file
    TO VERTEX Comment VALUES ($0, $1, $2, $3, $4, $5) USING header="true", separator="|";
  LOAD v_post_file
    TO VERTEX Post VALUES ($0, $1, $2, $3, $4, $5, $6, $7) USING header="true", separator="|";
  LOAD v_organisation_file
    TO VERTEX Company VALUES ($0, $2, $3) WHERE $1=="company",
    TO VERTEX University VALUES ($0, $2, $3) WHERE $1=="university" USING header="true", separator="|";
  LOAD v_place_file
    TO VERTEX City VALUES ($0, $1, $2) WHERE $3=="city",
    TO VERTEX Country VALUES ($0, $1, $2) WHERE $3=="country",
    TO VERTEX Continent VALUES ($0, $1, $2) WHERE $3=="continent" USING header="true", separator="|";
  LOAD v_forum_file
    TO VERTEX Forum VALUES ($0, $1, $2) USING header="true", separator="|";
  LOAD v_person_file
    TO VERTEX Person VALUES ($0, $1, $2, $3, $4, $5, $6, $7, SPLIT($8,";"), SPLIT($9,";")) USING header="true", separator="|";
  LOAD v_tag_file
    TO VERTEX Tag VALUES ($0, $1, $2) USING header="true", separator="|";
  LOAD v_tagclass_file
    TO VERTEX TagClass VALUES ($0, $1, $2) USING header="true", separator="|";
    
  // load edge
  LOAD forum_containerOf_post_file
    TO EDGE CONTAINER_OF VALUES ($0, $1) USING header="true", separator="|";
  LOAD comment_hasCreator_person_file
    TO EDGE HAS_CREATOR VALUES ($0 Comment, $1) USING header="true", separator="|";
  LOAD post_hasCreator_person_file
    TO EDGE HAS_CREATOR VALUES ($0 Post, $1) USING header="true", separator="|";
  LOAD person_hasInterest_tag_file
    TO EDGE HAS_INTEREST VALUES ($0, $1) USING header="true", separator="|";
  LOAD forum_hasMember_person_file
    TO EDGE HAS_MEMBER VALUES ($0, $1, $2) USING header="true", separator="|";
  LOAD forum_hasModerator_person_file
    TO EDGE HAS_MODERATOR VALUES ($0, $1) USING header="true", separator="|";
  LOAD comment_hasTag_tag_file
    TO EDGE HAS_TAG VALUES ($0 Comment, $1) USING header="true", separator="|";
  LOAD post_hasTag_tag_file
    TO EDGE HAS_TAG VALUES ($0 Post, $1) USING header="true", separator="|";
  LOAD forum_hasTag_tag_file
    TO EDGE HAS_TAG VALUES ($0 Forum, $1) USING header="true", separator="|";
  LOAD tag_hasType_tagclass_file
    TO EDGE HAS_TYPE VALUES ($0, $1) USING header="true", separator="|";
  LOAD organisation_isLocatedIn_place_file
    TO EDGE IS_LOCATED_IN VALUES ($0 Company, $1 Country) WHERE to_int($1) < 111,
    TO EDGE IS_LOCATED_IN VALUES ($0 University, $1 City) WHERE to_int($1) > 110 USING header="true", separator="|";
  LOAD comment_isLocatedIn_place_file
    TO EDGE IS_LOCATED_IN VALUES ($0 Comment, $1 Country) USING header="true", separator="|";
  LOAD post_isLocatedIn_place_file
    TO EDGE IS_LOCATED_IN VALUES ($0 Post, $1 Country) USING header="true", separator="|";
  LOAD person_isLocatedIn_place_file
    TO EDGE IS_LOCATED_IN VALUES ($0 Person, $1 City) USING header="true", separator="|";
  LOAD place_isPartOf_place_file
    TO EDGE IS_PART_OF VALUES ($0 Country, $1 Continent) WHERE to_int($0) < 111,
    TO EDGE IS_PART_OF VALUES ($0 City, $1 Country) WHERE to_int($0) > 110 USING header="true", separator="|";
  LOAD tagclass_isSubclassOf_tagclass_file
    TO EDGE IS_SUBCLASS_OF VALUES ($0, $1) USING header="true", separator="|";
  LOAD person_knows_person_file
    TO EDGE KNOWS VALUES ($0, $1, $2) USING header="true", separator="|";
  LOAD person_likes_comment_file
    TO EDGE LIKES VALUES ($0, $1 Comment, $2) USING header="true", separator="|";
  LOAD person_likes_post_file
    TO EDGE LIKES VALUES ($0, $1 Post, $2) USING header="true", separator="|";
  LOAD comment_replyOf_comment_file
    TO EDGE REPLY_OF VALUES ($0, $1 Comment) USING header="true", separator="|";
  LOAD comment_replyOf_post_file
    TO EDGE REPLY_OF VALUES ($0, $1 Post) USING header="true", separator="|";
  LOAD person_studyAt_organisation_file
    TO EDGE STUDY_AT VALUES ($0, $1, $2) USING header="true", separator="|";
  LOAD person_workAt_organisation_file
    TO EDGE WORK_AT VALUES ($0, $1, $2) USING header="true", separator="|";
}

Prepare The Raw Data

We have generated a data set with scale factor 1 (approximate 1GB). You can download it from https://s3-us-west-1.amazonaws.com/tigergraph-benchmark-dataset/LDBC/SF-1/ldbc_snb_data-sf1.tar.gz

# Linux Bash

wget https://s3-us-west-1.amazonaws.com/tigergraph-benchmark-dataset/LDBC/SF-1/ldbc_snb_data-sf1.tar.gz

After downloading the raw file, run the tar command below to decompress the downloaded file.

# Linux Bash

tar -xzf  ldbc_snb_data-sf1.tar.gz

After decompressing the file, you will see a folder named "ldbc_snb_data". Within it, you will see two subfolders

social_network
substitution_parameters

The raw data is in the social_network folder.

Run The Loading Job

Download setup_schema.gsql which combines the schema script and loading job script shown before.

Specify the environment variable LDBC_SNB_DATA_DIR to point to your raw file folder un-tarred in the previous section. In our example below, the raw data is in /home/tigergraph/ldbc_snb_data/social_network, so we use the export shell command to specify its location. Then, start your TigerGraph services if needed. Finally, run the setup_schema.gsql script to create your LDBC Social Network graph.

# Linux Bash

#change the directory to your raw file directory
export LDBC_SNB_DATA_DIR=/home/tigergraph/ldbc_snb_data/social_network/

#start all TigerGraph services
gadmin start all

#setup schema and loading job
gsql setup_schema.gsql

Download the loading job script and invoke it on the command line. #

# Linux Bash

./load_data.sh

# Sample Loading Progress Output

tigergraph/gsql_102$ ./load_data.sh
[Tip: Use "CTRL + C" to stop displaying the loading status update, then use "SHOW LOADING STATUS jobid" to track the loading progress again]
[Tip: Manage loading jobs with "ABORT/RESUME LOADING JOB jobid"]
Starting the following job, i.e.
  JobName: load_ldbc_snb, jobid: ldbc_snb.load_ldbc_snb.file.m1.1558053156447
  Loading log: '/mnt/data/tigergraph/logs/restpp/restpp_loader_logs/ldbc_snb/ldbc_snb.load_ldbc_snb.file.m1.1558053156447.log'

Job "ldbc_snb.load_ldbc_snb.file.m1.1558053156447" loading status
[FINISHED] m1 ( Finished: 31 / Total: 31 )
  [LOADED]
  +----------------------------------------------------------------------------------------------------------------------------------+
  |                                                                              FILENAME |   LOADED LINES |   AVG SPEED |   DURATION|
  |                       /mnt/data/download/ldbc_snb_data/social_network/comment_0_0.csv |        2052170 |    281 kl/s |     7.28 s|
  |     /mnt/data/download/ldbc_snb_data/social_network/comment_hasCreator_person_0_0.csv |        2052170 |    251 kl/s |     8.17 s|
  |            /mnt/data/download/ldbc_snb_data/social_network/comment_hasTag_tag_0_0.csv |        2698394 |    422 kl/s |     6.38 s|
  |     /mnt/data/download/ldbc_snb_data/social_network/comment_isLocatedIn_place_0_0.csv |        2052170 |    291 kl/s |     7.04 s|
  |       /mnt/data/download/ldbc_snb_data/social_network/comment_replyOf_comment_0_0.csv |        1040750 |    253 kl/s |     4.11 s|
  |          /mnt/data/download/ldbc_snb_data/social_network/comment_replyOf_post_0_0.csv |        1011421 |    248 kl/s |     4.07 s|
  |                         /mnt/data/download/ldbc_snb_data/social_network/forum_0_0.csv |          90493 |     87 kl/s |     1.03 s|
  |        /mnt/data/download/ldbc_snb_data/social_network/forum_containerOf_post_0_0.csv |        1003606 |    240 kl/s |     4.18 s|
  |        /mnt/data/download/ldbc_snb_data/social_network/forum_hasMember_person_0_0.csv |        1611870 |    431 kl/s |     3.74 s|
  |     /mnt/data/download/ldbc_snb_data/social_network/forum_hasModerator_person_0_0.csv |          90493 |     89 kl/s |     1.01 s|
  |              /mnt/data/download/ldbc_snb_data/social_network/forum_hasTag_tag_0_0.csv |         309767 |    297 kl/s |     1.04 s|
  |                  /mnt/data/download/ldbc_snb_data/social_network/organisation_0_0.csv |           7956 |      7 kl/s |     1.00 s|
  |/mnt/data/download/ldbc_snb_data/social_network/organisation_isLocatedIn_place_0_0.csv |           7956 |      7 kl/s |     1.00 s|
  |                        /mnt/data/download/ldbc_snb_data/social_network/person_0_0.csv |           9893 |      9 kl/s |     1.05 s|
  |        /mnt/data/download/ldbc_snb_data/social_network/person_hasInterest_tag_0_0.csv |         229167 |    223 kl/s |     1.03 s|
  |      /mnt/data/download/ldbc_snb_data/social_network/person_isLocatedIn_place_0_0.csv |           9893 |      9 kl/s |     1.00 s|
  |           /mnt/data/download/ldbc_snb_data/social_network/person_knows_person_0_0.csv |         180624 |    169 kl/s |     1.06 s|
  |          /mnt/data/download/ldbc_snb_data/social_network/person_likes_comment_0_0.csv |        1438419 |    449 kl/s |     3.20 s|
  |             /mnt/data/download/ldbc_snb_data/social_network/person_likes_post_0_0.csv |         751678 |    331 kl/s |     2.27 s|
  |   /mnt/data/download/ldbc_snb_data/social_network/person_studyAt_organisation_0_0.csv |           7950 |      7 kl/s |     1.00 s|
  |    /mnt/data/download/ldbc_snb_data/social_network/person_workAt_organisation_0_0.csv |          21655 |     21 kl/s |     1.00 s|
  |                         /mnt/data/download/ldbc_snb_data/social_network/place_0_0.csv |           1461 |      1 kl/s |     1.00 s|
  |          /mnt/data/download/ldbc_snb_data/social_network/place_isPartOf_place_0_0.csv |           1455 |      1 kl/s |     1.00 s|
  |                          /mnt/data/download/ldbc_snb_data/social_network/post_0_0.csv |        1003606 |    195 kl/s |     5.14 s|
  |        /mnt/data/download/ldbc_snb_data/social_network/post_hasCreator_person_0_0.csv |        1003606 |    320 kl/s |     3.13 s|
  |               /mnt/data/download/ldbc_snb_data/social_network/post_hasTag_tag_0_0.csv |         713259 |    341 kl/s |     2.09 s|
  |        /mnt/data/download/ldbc_snb_data/social_network/post_isLocatedIn_place_0_0.csv |        1003606 |    327 kl/s |     3.07 s|
  |                           /mnt/data/download/ldbc_snb_data/social_network/tag_0_0.csv |          16081 |     16 kl/s |     1.00 s|
  |          /mnt/data/download/ldbc_snb_data/social_network/tag_hasType_tagclass_0_0.csv |          16081 |     16 kl/s |     1.00 s|
  |                      /mnt/data/download/ldbc_snb_data/social_network/tagclass_0_0.csv |             72 |      71 l/s |     1.00 s|
  |/mnt/data/download/ldbc_snb_data/social_network/tagclass_isSubclassOf_tagclass_0_0.csv |             71 |      70 l/s |     1.00 s|
  +----------------------------------------------------------------------------------------------------------------------------------+

After loading, you can check the graph's size using built-in REST endpoint calls.

Below we call two functions, stat_vertex_number and stat_edge_number to return the cardinality of each vertex and edge type.

Linux shell

curl -X POST 'http://localhost:9000/builtins/ldbc_snb' -d  '{"function":"stat_vertex_number","type":"*"}'  | jq .
curl -X POST 'http://localhost:9000/builtins/ldbc_snb' -d  '{"function":"stat_edge_number","type":"*"}' | jq .

One-hop patterns

Introduction

‌Pattern matching by nature is declarative. It enables users to focus on specifying what they want from a query without worrying about the underlying query processing.‌

A pattern usually appears in the FROM clause, the most fundamental part of the query structure. The pattern specifies sets of vertex types and how they are connected by edge types. A pattern can be refined further with conditions in the WHERE clause. In this tutorial, we'll start with simple one-hop path patterns, and then extend it multi-hop patterns and finally multiple-path patterns.

Currently, pattern matching may only be used in read-only queries. DML support will be added in the near future release.

Pattern matching queries support nested queries

1-Hop Pattern

The easiest way to understand patterns is to start with a simple 1-Hop pattern. Even a single hop has several options. After we've tackled single hops, then we'll see how to add repetition to make variable length patterns and how to connect single hops to form bigger patterns.

In pattern matching, we use the punctuation -( )- to denote a 1-hop pattern, where the edge type(s) is enclosed in the parentheses () and the hyphens - symbolize connection without specifying direction. Instead, directionality is explicitly stated for each edge type.

For an undirected edge E, no added decoration: E
For a directed edge E from left to right, use a suffix: E>
For a directed edge E from right to left, use a prefix: <E

‌For example, in the LDBC SNB schema, there are two directed relationships between Person and Message: person LIKES message, and message HAS_CREATOR person. Despite the fact that these relationships are in opposite directions, we can include both of them in the same pattern very concisely using an alternation separator | :

Edge Type Wildcards

The underscore _ is a wildcard meaning any edge type. Arrowheads are still used to indicate direction, e.g., _> or <_ or _ The empty parentheses () means any edge, directed or undirected.

Vertex Type Wildcards and Path Symmetry

Prior to TigerGraph 3.0, the source (leftmost) vertex set needed to be defined as an explicit set, prior to the SELECT statement. A typical approach is shown here.

Beginning in TigerGraph 3.0, SYNTAX V2 treats the source vertex set the same as the target vertex set. That is, the source or the target vertex set may be:

a vertex type SELECT t FROM Person:s -(IS_LOCATED_IN>:e) - City:t
an alternation of vertex types SELECT t FROM (Post|Comment):s -(IS_LOCATED_IN>:e) - Country:t
omitted, with only an alias, meaning any vertex type SELECT s FROM :s -(IS_LOCATED_IN>:e) - Country:t
omitted, without an alias, meaning any vertex type SELECT t FROM -(IS_LOCATED_IN>:e) - Country:t

Performance may be better when types are explicitly provided.

Examples of 1-Hop Patterns

FROM X:x - (E1:e1) - Y:y
- E1 is an undirected edge, x and y bind to the end points of E1, and e1 is the alias of E1.
FROM X:x - (E2>:e2) - Y:y
- Right directed edge x binds to the source of E2; y binds to the target of E2.
FROM X:x - (<E3:e3) - Y:y
- Left directed edge; y binds to the source of E3; x binds to the target of E3.
FROM X:x - (_:e) - Y:y
- Any undirected edge between a member of X and a member of Y.
FROM X:x - (_>:e) - Y:y
- Any right directed edge with source in X and target in Y.
FROM X:x - (<_:e) - Y:y
- Any left directed edge with source in Y and target in X.
FROM X:x - ((<_|_):e) - Y:y
- Any left directed or any undirected; "|" means OR, and parentheses enclose the group of edge descriptors; e is the alias for the edge pattern (<_|_).
FROM X:x - ((E1|E2>|<E3):e) - Y:y
- Any one of the three edge patterns.
FROM X:x - () - Y:y
- any edge (directed or undirected)
- Same as (<_|_>|_)

Entering Pattern Match Syntax Mode

To use pattern matching, you need to either set a session parameter or specify it in the query. There are currently two syntax versions for queries:

"v1" is the classic syntax, traversing one hop per SELECT statement. This is the default mode.
"v2" enhances the v1 syntax with pattern matching.

syntax_version Session Parameter

You can use the SET command to assign a value to the syntax_version session parameter: v1 for classic syntax; v2 for pattern matching. If the parameter is never set, the classic v1 syntax is enabled. Once set, the selection remains valid for the duration of the GSQL client session, or until it is changed with another SET command.

Query-Level SYNTAX option

You can also select the syntax by using the SYNTAX clause in the CREATE QUERY statement: v1 for classic syntax (default); v2 for pattern matching. The query-level SYNTAX option overrides the syntax_version session parameter.

Running Anonymous Queries Without Installing

In this tutorial, we will use Interpreted Mode for GSQL, introduced in TigerGraph 2.4. Interpreted mode lets us skip the INSTALL step, and even run a query as soon as we create it, to offer a more interactive experience. These one-step interpreted queries are unnamed (anonymous) and parameterless, just like SQL.

To run an anonymous query, replace the keyword CREATE with INTERPRET. Remember, no parameters:

Recommendation: Increase the query timeout threshold.

Interpreted queries may run slower than installed queries, so we recommend increasing the query timeout threshold:

Examples of 1-Hop Fixed Length Query

Example 1. Find persons who know the person named "Viktor Akhiezer" and return the top 3 oldest such persons.

Syntax Enhancement in TigerGraph 3.0+

In Example 1, "FOR GRAPH ldbc_snb" is not used after () in the query signature. It's an optional component in 3.0+ when "USE GRAPH graphName" is used; Or from the command line, "gsql -g graphName " precedes any query invocation.
In the FROM clause, we directly use vertex type Person as the starting vertex set. This syntax enhancement is available in syntax V2 only.

You can copy the above GSQL script to a file named example1.gsql and invoke this script file in Linux.

Example 2. Find the total number of comments and total number of posts liked by Viktor. A Person can reach Comments or Posts via a directed edge LIKES.

You can copy the above GSQL script to a file named example2.gsql, and invoke this script file in Linux.

Example 3. Solve the same problem as in Example 2, but use a left-directed edge pattern.

Note below (line 8) that the source vertex set are now Comment and Post, and the target is Person.

You can copy the above GSQL script to a file named example3.gsql, and invoke this script file in linux command line. The output should be the same as in Example 2.

Example 4. Find Viktor Akhiezer's total number of related comments and total number of related posts. That is, a comment or post is either created by Viktor or is liked by Viktor. Note that the HAS_CREATOR edge type starts from Comment|Post, and the LIKES edge type starts from Person.

You can copy the above GSQL script to a file named example4.gsql, and invoke this script file in Linux:

Example 5. Find the total number of comments or posts related to "Viktor Akhiezer". This time, we count them together and, we use wildcard "_" to represent the two types of edges: HAS_CREATOR and LIKES_REVERSE. Both are following the same direction.

You can copy the above GSQL script to a file named example5.gsql, and invoke this script file in Linux:

Repeating a 1-Hop Pattern

A common pattern is the two-step "Friend of a Friend". Or, how many entities might receive a message if it is passed up to three times? Do you have any known change of connections to a celebrity?

GSQL pattern matching makes it easy to express such variable-length patterns which repeat a single-hop. Everything else stays the same as introduced in the previous section, except we append an asterisk (or Kleene star for you regular expressionists) and an optional min..max range to an edge pattern.

(E*) means edge type E repeats any number of times (including zero!)
(E*1..3) means edge type E occurs one to three times.

Below are more illustrative examples:

1-hop star pattern — repetition of an edge pattern 0 or more times
1. FROM X:x - (E*) - Y:y
2. FROM X:x - (F>*) - Y:y
3. FROM X:x - (<G*) - Y:y
4. FROM X:x - (_*) - Y:y
  - Any undirected edge can be chosen at each repetition.
5. FROM X:x - (_>*) - Y:y
  - Any right-directed edge can be chosen at each repetition.
6. FROM X:x - (<_*) - Y:y
  - Any left-directed edge can be chosen at each repetition.
7. FROM X:x - ((E|F>|<G)*) - Y:y
  - Either E, F> or <G can be chosen at each repetition.
1-hop star pattern with bounds
1. FROM X:x - (E*2..) - Y:y
  - Lower bounds only. There is a chain of at least 2 E edges.
2. FROM X:x - (F>*..3) - Y:y
  - Upper bounds only. There is a chain of between 0 and 3 F edges.
3. FROM X:x - (<G*3..5) - Y:y
  - Both Lower and Upper bounds. There is a chain of 3 to 5 G edges.
4. FROM X:x - ((E|F>|<G)*3) - Y:y
  - Exact bound. There is a chain of exactly 3 edges, where each edge is either E, F>, or <G.

Remarks

No alias allowed for edge with Kleene star An edge alias may not be used when a Kleene star is used. The reason is that when there are a variable number of edges, we cannot associate or bind the alias to a specific edge in the pattern.
Shortest path semantics When an edge is repeated with a Kleene star, only the shortest matching occurrences are selected. See the example below:

In Figure 2, or Pattern 1 - (E>*) - 4, any of the following paths reach 4 from 1.

1->2->3->4
1->2->3->5->6->2->3->4
any path that goes through the cycle 2->3->5->6->2 two or more times and jumps out at 3.

The first path is shorter than the rest; it is considered the only match.

Examples of Variable Hop Queries

In this tutorial, we will use the Interpreted Mode for GSQL, introduced in TigerGraph 2.4. Interpreted mode lets us skip the INSTALL step, and even run a query as soon as we create it, to offer a more interactive experience. These one-step interpreted queries are unnamed (anonymous) and parameterless, just like SQL.

Example 1. Find the direct or indirect superclass (including the self class) of the TagClass whose name is "TennisPlayer".

You can copy the above GSQL script to a file named example1.gsql, and invoke this script file in a Linux shell.

Note below that the starting vertex s, whose name is TennisPlayer, is also a match, using a path with zero hops.

Example 2. Find the immediate superclass of the TagClass whose name is "TennisPlayer". (This is equivalent to a 1-hop non-repeating pattern.)

You can copy the above GSQL script to a file named example2.gsql, and invoke this script file in a Linux shell.

Example 3. Find the 1 to 2 hops direct and indirect superclasses of the TagClass whose name is "TennisPlayer".

You can copy the above GSQL script to a file named example3.gsql, and invoke this script file in a Linux shell.

Example 4. Find the superclasses within 2 hops of the TagClass whose name is "TennisPlayer".

You can copy the above GSQL script to a file named example4.gsql, and invoke this script file in a Linux shell.

Example 5. Find the superclasses at least one hop from the TagClass whose name is "TennisPlayer".

You can copy the above GSQL script to a file named example5.gsql, and invoke this script file in a Linux shell.

Example 6. Find the 3 most recent comments that are liked or created by Viktor Akhiezer, and the total number of comments related to (created or liked by) Viktor Akhiezer.

You can copy the above GSQL script to a file named example6.gsql, and invoke this script file in a Linux shell.

Example - A Recommender

We have demonstrated the basic pattern match syntax. You should have mastered the basics by this point. In this section, we show two end-to-end solutions using the pattern match syntax.

A Recommendation Application

In this example, we want to recommend some messages (comments or posts) to the person Viktor Akhiezer.

How do we do this?

One way is to find Others who likes the same messages Viktor likes. And then recommend the messages that Others like but Viktor have not seen. The pattern is roughly like below

Viktor - (Likes>) - Message - (<Likes) - Others
Others - (Likes>) - NewMessage
Recommend NewMessage to Viktor

However, this is too fine granularity, and we are overfitting the message level data with a collaborative filtering algorithm. The intuition is that two persons are similar to each other when their "liked" messages fall into the same category (tag). This makes more sense and common than finding two persons that "likes" the same set of messages. As a result, one way to avoid this overfitting is to go one level above. That is, instead of finding common messages as a similarity base, we find common messages' tags as a similarity base. Person A and Person B are similar if they like messages that belong to the same tag. This scheme fixes the overfitting problem. In pattern match vocabulary, we have

Viktor - (Likes>) - Message - (Has>) - Tag - (<Has) - Message - (<Likes) - Others
Others - (Likes>) - NewMessage
Recommend NewMessage to Viktor

GSQL. RecommendMessage Application.

This time, we create the query first, and interpret the query by calling the query name with parameters. If we are satisfied with this query, we can use "install query queryName" to get the compiled query installed which has the best performance.

You can copy the above GSQL script to a file named app1.gsql, and invoke this script file in linux command line.

Install the query

When you are satisfied with your query in the GSQL interpret mode, you can now install it as a generic service which has a much faster speed. Since we have been using "CREATE QUERY .." syntax, the query is added into the catalog, we can set the syntax version and install it.

The above use log-cosine as a similarity measurement. We can also use cosine similarity by using two persons liked messages.

Advanced Features

This section includes some advanced features related to pattern match. It includes using the PER clause to fine control the ACCUM execution, DML support in pattern match, and the conjunctive pattern match syntax which allows multiple patterns in one FROM clause. We dedicate a subsection for each topic.

Per Clause (Beta)

Introduction

Pattern matching produces a virtual match table, and the ACCUM clause acts like a FOREACH loop, executing the clause's statement once for each row of the match table.

Patterns are paths in the graphs, and each row in the match table is a distinct path. However, paths may share some vertices or edges. Some applications do not want to do aggregations per path. Instead, they want to execute the ACCUM clause per distinct group of vertex aliases.

For example, consider the following query which counts the number of paths in a simple 2-hop pattern:

SumAccum<int> @@cnt; 

S = SELECT t 
    FROM S:s - (E1:edge1) - M:m -(E2:edge2) - T:t
    ACCUM @@cnt += 1;

Suppose the query produces the following match table.

(s, edge1, m , edge2, t)//match table schema
v1, e1, v3, e2, v2 //match 1 
v1, e3, v4, e4, v2 //match 2 
v5, e5, v6, e6, v7 //match 3
v8, e7, v9, e8, v7 //match 4

By default, the ACCUM clause will execute the @@cnt += 1statement 4 times, for each row in the match table. The result will be @@cnt = 4.

For the same query, what if the user wants to

count the number of distinct path endings in the match table? For this case, we would want to iterate on the alias t.
count the number of distinct (start, end) pairs in the match table? For that case, we would want to iterate on distinct pairs of the aliases (s, t).

To provide users with this added flexibility and finer control over ACCUM iteration, TigerGraph 3.0 adds the PER clause to pattern matching (V2) syntax.

Syntax

The PER Clause is an optional clause that comes at the start of the ACCUM clause in a SELECT statement. As illustrated below, it starts with the keyword PER, and followed by a pair of parenthesis, in which user can put one or more distinct vertex aliases found in the FROM pattern.

selectBlock := SELECT alias 
               FROM pattern
               [sampleClause]
               [whereClause]
               [[perClause] accumClause]
               [postAccumClause]*
               [havingClause]
               [orderClause]
               [limitClause]
                    
perClause := PER (vertex_alias_1, vertex_alias_2, ...)

Examples. Below are multiple examples of the PER Clause using the same FROM clause.

S1 = SELECT s 
     FROM S:s - (E1:edge1) - M:m - (E2:edge2) - T:t
     PER (s) 
     ACCUM @@cnt += 1;
    
S2 = SELECT t 
     FROM S:s - (E1:edge1) - M:m - (E2:edge2) - T:t
     PER (t)
     ACCUM @@cnt += 1;

S3 = SELECT m 
     FROM S:s - (E1:edge1) - M:m - (E2:edge2) - T:t
     PER (m)
     ACCUM @@cnt += 1;
    
S4 = SELECT t 
     FROM S:s - (E1:edge1) - M:m - (E2:edge2) - T:t
     PER (s, t)
     ACCUM @@cnt += 1;
    
S5 = SELECT t 
     FROM S:s - (E1:edge1) - M:m - (E2:edge2) - T:t
     PER (s, m, t)
     ACCUM @@cnt += 1;

Semantics

The PER Clause specifies a list of vertex aliases, which are used to group the rows in the match table, one group per distinct value of the alias or of the alias list. If there are N distinct groups, we will execute the ACCUM clause N times, once per distinct vertex aliases' binding. Note that the PER clause has no effect on POST-ACCUM clauses semantic, except confining the POST-ACCUM vertex alias.

Suppose s, m, and t are vertex aliases in a pattern. Below are some interpretations of the PER Clause based on the graph element bindings found in the match table.

PER (s) ACCUM means that per each distinct s vertex, execute the ACCUM clause once.
PER (s,t) ACCUM means that per each distinct (s, t) pair, execute the ACCUM clause once.
PER (s,m,t) ACCUM means that per each distinct (s, m, t) tuple, execute the ACCUM clause once.

Examples to show PER clause semantics.

//match table
(s, edge1, m , edge2, t)//schema
v1, e1, v3, e2, v2 //match 1 
v1, e3, v4, e4, v2 //match 2 
v5, e5, v6, e6, v7 //match 3
v8, e7, v9, e8, v7 //match 4

//since we have v1, v5, and v8 three distinct vertices bind to s, 
//we execute ACCUM clause 3 times.  
S1 = SELECT s 
     FROM S:s - (E1:edge1) - M:m -(E2:edge2) - T:t
     PER (s) 
     ACCUM @@cnt += 1;
  
//since we have v2, v7 two distinct vertices bind to t, 
//we execute ACCUM clause twice.     
S2 = SELECT t 
     FROM S:s - (E1:edge1) - M:m -(E2:edge2) - T:t
     PER (t)
     ACCUM @@cnt += 1;
     
//since we have v3, v4, v6, v9 four distinct vertices bind to m, 
//we execute ACCUM clause 4 times. 
S3 = SELECT m
     FROM S:s - (E1:edge1) - M:m -(E2:edge2) - T:t
     PER (m)
     ACCUM @@cnt += 1;
     
//since we have (v1, v2), (v5, v7) and (v8, v7) three distinct vertex pairs 
//bind to (s,t), we execute ACCUM clause 3 times.     
S4 = SELECT t 
     FROM S:s - (E1:edge1) - M:m -(E2:edge2) - T:t
     PER (s, t)
     ACCUM @@cnt += 1;

    
//since we have (v1, v3, v2), (v1, v4, v2), (v5, v6, v7) and (v8, v9, v7) four
//distinct vertex groups bind to (s,m,t), we execute ACCUM clause 4 times.      
S5 = SELECT t 
     FROM S:s - (E1:edge1) - M:m -(E2:edge2) - T:t
     PER (s, m, t)
     ACCUM @@cnt += 1;

If the PER Clause is used in a SELECT query block, then the vertex aliases used in the SELECT, ACCUM , and POST-ACCUM clauses must be confined to the aliases that appear in the PER clause.

Below are some illegal cases.

//semantic error. SELECT t, but t doesn't appear in PER clause. 
S1 = SELECT t 
     FROM S:s - (E1:edge1) - M:m -(E2:edge2) - T:t
     PER (s, m)
     ACCUM @@cnt += 1;
     
//semantic error. ACCUM t.@cnt, but t doesn't appear in PER clause.
S2 = SELECT t 
     FROM S:s - (E1:edge1) - M:m -(E2:edge2) - T:t
     PER (s, m)
     ACCUM t.@cnt += 1;
     
//semantic error. POST-ACCUM t.@cnt, but t doesn't appear in PER clause.
S3 = SELECT s
     FROM S:s - (E1:edge1) - M:m -(E2:edge2) - T:t
     PER (s)
     ACCUM s.@cnt += 1
     POST-ACCUM t.@cnt =1;

PER Clause Examples

Example 1. Count the number of Countries that has a City which has a resident that likes a post.

//Example 1. 
USE GRAPH ldbc_snb

INTERPRET QUERY () SYNTAX v2 {
  SumAccum<int> @@cnt;

 R =   SELECT c
       FROM   Country:c -(<IS_PART_OF.<IS_LOCATED_IN.LIKES>)- Post:p
       PER    (c)
       ACCUM  @@cnt +=1;

 PRINT @@cnt;
}

//results
Using graph 'ldbc_snb'
The query AA is dropped.
{
  "error": false,
  "message": "",
  "version": {
    "schema": 0,
    "edition": "enterprise",
    "api": "v2"
  },
  "results": [{"@@cnt": 111}]
}

Example 2. Count the number of posts liked by a person who is located in a city that belongs to a country. (All cities are in a country, but humor us. We are reusing the same FROM pattern in several examples.)

//Example 2. 
USE GRAPH ldbc_snb

INTERPRET QUERY () SYNTAX v2 {
  SumAccum<int> @@cnt;

 R =   SELECT p
       FROM   Country:c -(<IS_PART_OF.<IS_LOCATED_IN.LIKES>)- Post:p
       PER    (p)
       ACCUM  @@cnt +=1;

 PRINT @@cnt;

//result
Using graph 'ldbc_snb'
{
  "error": false,
  "message": "",
  "version": {
    "schema": 0,
    "edition": "enterprise",
    "api": "v2"
  },
  "results": [{"@@cnt": 70668}]
}

Example 3. Find for each country in ("Dominican_Republic","Angola", "Cambodia") the number of posts that is liked by a person living in that country.

//Exmaple 3
USE GRAPH ldbc_snb

INTERPRET QUERY () SYNTAX v2{

 MapAccum<string, SumAccum<int>> @@postPerCountry;

 R =   SELECT p
       FROM   Country:c -(<IS_PART_OF.<IS_LOCATED_IN.LIKES>)- Post:p
       WHERE  c.name in  ("Dominican_Republic","Angola", "Cambodia")
       PER    (c, p)
       ACCUM  @@postPerCountry += (c.name -> 1);

 PRINT @@postPerCountry;
}

//results
Using graph 'ldbc_snb'
{
  "error": false,
  "message": "",
  "version": {
    "schema": 0,
    "edition": "enterprise",
    "api": "v2"
  },
  "results": [{"@@postPerCountry": {
    "Dominican_Republic": 395,
    "Angola": 12,
    "Cambodia": 4002
  }}]
}

Example 4. Find for each country in ("Dominican_Republic","Angola", "Cambodia") the number of posts that is liked by a person living in that country. Use local accumulators this time.

USE GRAPH ldbc_snb

INTERPRET QUERY () SYNTAX v2{

 SumAccum<int> @postCnt;

 R =   SELECT c
       FROM   Country:c -(<IS_PART_OF.<IS_LOCATED_IN.LIKES>)- Post:p
       WHERE  c.name in  ("Dominican_Republic","Angola", "Cambodia")
       PER    (c, p) //per (country, post), add 1 to c.@postCnt
       ACCUM  c.@postCnt += 1;

 PRINT R;
}

//results
Using graph 'ldbc_snb'
{
  "error": false,
  "message": "",
  "version": {
    "schema": 0,
    "edition": "enterprise",
    "api": "v2"
  },
  "results": [{"R": [
    {
      "v_id": "2",
      "attributes": {
        "@postCnt": 12,
        "name": "Angola",
        "id": 2,
        "url": "http://dbpedia.org/resource/Angola"
      },
      "v_type": "Country"
    },
    {
      "v_id": "67",
      "attributes": {
        "@postCnt": 4002,
        "name": "Cambodia",
        "id": 67,
        "url": "http://dbpedia.org/resource/Cambodia"
      },
      "v_type": "Country"
    },
    {
      "v_id": "11",
      "attributes": {
        "@postCnt": 395,
        "name": "Dominican_Republic",
        "id": 11,
        "url": "http://dbpedia.org/resource/Dominican_Republic"
      },
      "v_type": "Country"
    }
  ]}]
}

Performance and Best Practices

The PER Clause not only helps users to control the semantics of the ACCUM clause, it also boosts the performance of the pattern match query, as it uses the PER clause to optimize the query execution.

To get the best performance, we recommend three guidelines for writing efficient queries.

Use PER (target) If Possible

Per target is in general faster than Per source. In the example below, query q2 is faster than q1. The only difference between these two queries is q2's FROM pattern is the flip of q1's FROM pattern.

USE GRAPH ldbc_snb

# not recommended, since it does per (src).
CREATE QUERY q1 () SYNTAX v2 {

  SumAccum<int> @@cnt ;

  T = SELECT c
      FROM Comment:c - (<LIKES) - Person:ps - (IS_LOCATED_IN>) - City:city
      WHERE year(c.creationDate) >= 2006
      PER (c)
      ACCUM @@cnt += 1;

  PRINT @@cnt;
}

# recommended, since it does per (tgt)
CREATE QUERY q2 () SYNTAX v2 {

  SumAccum<int> @@cnt ;

  T = SELECT c
      FROM City:city - (<IS_LOCATED_IN) - Person:ps - (LIKES>) - Comment:c
      WHERE year(c.creationDate) >= 2006
      PER (c)
      ACCUM @@cnt += 1;

  PRINT @@cnt;
}

Write Patterns with Smallest Expected Vertex Set on the Left

The match table is built by traversing the pattern from left to right. Follow the basic principle of pruning early rather than late by orienting the query the smaller cardinality set on the left. This practice will result in producing the least number of candidate matches during the query computation. For example, if there are fewer distinct tags than persons, then query q4 is faster than q3.

USE GRAPH ldbc_snb

# not recommended, since the pattern starts from a large cardinality vertex type
# (Person), and ends at a small cardinality vertex type (Tag). 
CREATE QUERY q3 () SYNTAX v2 {

  SumAccum<int> @cnt;

  V = SELECT s
      FROM Person:s- (LIKES>)-Post:p - (<CONTAINER_OF)-:f - (HAS_TAG>) - :t
      PER (s)
      ACCUM s.@cnt += 1;

  PRINT V.size();
}

# recommended, start from small cardinality end (Tag), and use per tgt
CREATE QUERY q4 () SYNTAX v2 {

  SumAccum<int> @cnt;

  V = SELECT s
      FROM Tag:t-(<HAS_TAG)-Forum:f -(CONTAINER_OF>)-Post:p  - (<LIKES)- Person:s
      PER (s)
      ACCUM s.@cnt += 1;

  PRINT V.size();
}

Specify Complete Type Information

Specifying complete type information improves performance. For example, query q6 is faster than q5 even though they are known to be logically identical. Forum is the CONTAINER_OF Post, so it does not need to be specified in q5, but explicitly saying Forum in q6 speeds up performance.

USE GRAPH ldbc_snb

#we do not put Forum befoe :f, even if we know it.
CREATE QUERY q5 () SYNTAX v2 {

  SumAccum<int> @@person_cnt;

  V = SELECT s
      FROM Person:s- (LIKES>)-Post:p - (<CONTAINER_OF)-:f - (HAS_TAG>) - :t
      PER (s)
      ACCUM @@person_cnt += 1;

  PRINT @@person_cnt;
}


#recommended: we put Forum as the type info. 
CREATE QUERY q6 () SYNTAX v2 {

  SumAccum<int> @@person_cnt;

  V = SELECT s
      FROM Person:s- (LIKES>)-Post:p - (<CONTAINER_OF)-Forum:f - (HAS_TAG>) - :t
      PER (s)
      ACCUM @@person_cnt += 1;

  PRINT @@person_cnt;
}

LDBC Benchmark Queries

Using the PER clause and linear regular path pattern, we have translated all of the LDBC-SNB queries. You can find them on github at https://github.com/tigergraph/ecosys/tree/ldbc/ldbc_benchmark/tigergraph/queries_linear/queries. Most of the queries are installed as functions. You can find sample parameter(s) of the functions from https://github.com/tigergraph/ecosys/tree/ldbc/ldbc_benchmark/tigergraph/queries/seeds.

Data Modification

Pattern Matching GSQL supports Insert, Update, and Delete operations. The syntax is identical to that in classic GSQL (v1), though the full range of data modification operations are not yet support.

In general, data modification can be at two levels in GSQL:

Top level. The statement does not need to within any other statement.
Within a SELECT query statement. The FROM-WHERE clauses define a match table, and the data modification is performed based on the vertex and edge information in the match table. The GSQL specifications calls these within-SELECT statements DML-sub statements.

Insert, Update, and Delete currently work in compiled mode only (e.g., you must run INSTALL QUERY before RUN QUERY.) Data Modification in interpreted mode is not yet available.
SELECT queries with data modification may only have one POST-ACCUM clause.

Insert vertices and edges

Pattern matching Insert is supported at both the top-level and within-SELECT levels, using the same syntax as in classic GSQL. You can insert vertices and edges.

Example 1. Create a Person vertex, whose name is Tiger Woods. Next, find Viktor's favorite 2012 posts' authors, whose last name is prefixed with S. Finally, insert KNOWS edges connecting Tiger Woods with Viktor's favorite authors.

You can verify the result by running a simple built-in REST endpoint.

Check the inserted vertex.

Check the inserted edges.

Update data

To update vertex attributes, use assignment statements in a POST-ACCUM clause. To update edge attributes, use assignment statements in an ACCUM clause. In addition, data updates can only be performed if the FROM statement only contains a single-hop and fix-length pattern.

Query-body level UPDATE statements are not yet supported in syntax v2.

Example 2. For all KNOWS edges that connect Viktor Akhiezer and his friends whose lastName begins with "S", update the edge creationDate to "2020-10-01". Also, for the Person vertex (Tiger Woods) update the vertex's creationDate and language he speaks.

To verify the update, we can use REST calls.

Check Tiger Woods' creationDate and language he speaks.

Check KNOWS edges whose source is tiger woods.

Delete vertices and edges

You can use delete () function to delete edges and vertices in ACCUM and POST-ACCUM clauses.

Top-levels DELETE statements are not yet supported in SYNTAX v2.
Edges can only be deleted in the ACCUM clause.
For best performance, vertices should be deleted in the POST-ACCUM clause.
To perform within-SELECT deletes, the FROM pattern can only be a single hop, fixed length pattern.

Example 3. Delete vertex Tiger Woods and its KNOWS edges.

To verify the result, you can use built-in REST calls.

Conjunctive Pattern Matching (Beta)

What is Conjunctive Pattern Matching?

So far, we have described pattern matching as one path pattern in a FROM clause. In this section, we introduce GSQL's capability to match multiple patterns in one FROM clause. This extension is called Conjunctive Pattern Matching (CPM), because the query asks for the conjunction (logical AND) of the patterns. To get a match, all of the patterns must be satisfied, and the patterns can interrelate. Visually, you can think of patterns formed by a set of intersecting line segments. This feature, introduced as a Beta feature in TigerGraph 3.0, enables you to express complex patterns concisely in a single query block.

In general, a CPM query block consists of multiple patterns in the FROM clause. It has a structure illustrated below.

# Conjunctive Pattern Matching Syntax

SelectBlock := SELECT alias 
               FROM pattern
               [sampleClause]
               [whereClause]
               [ [perClause] accumClause]
               [postAccumClause]*
                    ...
                    
pattern := vertexPattern | edgePattern | (pathPattern ["," pathPattern])
# vertexPattern and edgePattern are from classic GSQL

We elaborate on each of the clause.

SELECT Clause

The SELECT clause selects only one vertex alias from all the patterns in the FROM clause.

FROM Clause - Conjunctive Matching

This is where the conjunctive matching is expressed. The FROM clause consists of a list of path patterns, which are separated by commas. Evaluating each pattern against the underlying graph data produces a match table. If two patterns share a vertex alias, then we form the natural join of the two match tables.

For example, consider this CPM:

FROM X:x - (E1:e1) - Y:y - (E2>:e2) - Z:z,
     Z:z - (E3:e3) - U:u - (E4>:e4) - V:v

The first pattern's variables are x, e1, y, e2, and z; the second pattern's variables are z, e3, u, e4, and v. Considering the two patterns independently would yield the follwing match table schemas:

#first match table
(x, e1, y, e2, z)
#second match table
(z, e3, u, e4, v)

Natural Join of Match Tables

Natural joining two match tables compares all the shared vertex aliases between the two tables, and the resulting joined table contains all non-shared variables plus one copy of each of the shared vertex variables. Here is the match table for the CPM above:

#natural join result; the shared vertex variable z appears once.
(x, e1, y, e2, z, e3, u, e4, v)

Valid Conjunctive Patterns

The match table of the conjunctive pattern match is the natural join of all the patterns' match tables. By design, a row in the CPM match table must simultaneously satisfy all the match tables.

If the match tables of the patterns in a FROM clause can be naturally joined into one match table, then the FROM clause has a valid CPM input. Otherwise, the FROM clause has an invalid pattern input list.

For example, below we show two valid CPM inputs and one invalid CPM input.

# a valid CPM, since the two patterns natrually join on :tgt
SELECT
FROM Person:p - (KNOWS) - :tgt, 
     Post:s -(<LIKES) - :tgt 

# a valid CPM, since the two patterns naturally join on :f
SELECT
FROM Person:p - (KNOWS) - :f - (LIKES>) - Post:tgt, 
     :f - (LIKES>) - Comment:c 

# an invalid CPM, since the two patterns do not share any vertex variables.
# they cannot be naturally joined. 
# One pattern has (p, tgt); the other has (s, t). 
SELECT
FROM Person:p - (KNOWS) - :tgt, 
     Post:s - (<LIKES) - Person:t

WHERE Clause

The predicates in the WHERE clause can use any of the vertex or edge aliases in any of the patterns, including predicates which combine variables from different constituent paths. CPM queries do not have any special restrictions on the WHERE predicate. Distance matters, however, for performance. Conditions that are local, measured both cross-path and within-path, can be resolved earlier and therefore are faster.

In the example below, x2.age > x4.age is a cross-pattern predicate, e1.timestamp < e3.timestamp is a cross-pattern predicate, and x1.gender == x4.gender is a local predicate of the second pattern.

FROM X1:x1-(E1:e1)-X2:x2-(E2:e2)-X3:x3,
     X1:x1-(E3:e3)-X4:x4
WHERE x2.age > x4.age AND e1.timestamp < e3.timestamp AND x1.gender == x4.gender

ACCUM Clause

You can ACCUM to any vertex variable in a CPM block.

The ACCUM clause by default will execute as many times as the row (match) count of the CPM match table; each execution uses one row from the match table.

ACCUM To The Three Vertex Variables of A CPM Pattern

#accum to x1, x2 and x4. 
FROM X1:x1-(E1:e1)-X2:x2-(E2:e2)-X3:x3, 
     X1:x1-(E3:e3)-X4:x4
ACCUM x1.@cnt +=1, x2.@cnt += x3.quantity, x4.@cnt += x3.quantity

POST-ACCUM Clause

POST-ACCUM for CPM behaves the same as POST-ACCUM for single path patterns. That is, each POST-ACCUM clause can refer to one vertex alias, and the clause executes iteratively over that vertex set (e.g. one vertex column in the matching table). Its statements can access the aggregated accumulator result computed in the ACCUM clause. The query can have multiple POST-ACCUM clauses, one for each vertex alias you wish to work on. The multiple POST-ACCUM clauses are processed in parallel; it doesn't matter in what order you write them. (For each binding, the statements within a clause are executed in order.)

For example, below we have three POST-ACCUM clauses. The first one iterates throughx1, and for each x1, we do @@cnt += x1.@cnt. The second and third POST-ACCUMs iterate through x2 and x3 respectively, and accumulates their @cnt accumulator value into @@cnt.

POST-ACCUM to a global accumulator @@cnt, using three CPM Vertex Variables


FROM X1:x1-(E1:e1)-X2:x2-(E2:e2)-X3:x3, 
     X1:x1-(E3:e3)-X4:x4
ACCUM x1.@cnt +=1, x2.@cnt += x3.quantity, x4.@cnt += x3.quantity
POST-ACCUM @@cnt += x1.@cnt
POST-ACCUM @@cnt += x2.@cnt
POST-ACCUm @@cnt += x3.@cnt;

Examples

Example 1. Find Viktor Akhiezer's liked messages (100+ days after their creation) whose author's last name begin with letter S. Output the message's forum.

USE GRAPH ldbc_snb

INTERPRET QUERY () SYNTAX v2 {

  SumAccum<int> @@cnt;

  F  =  SELECT f
        FROM Person:s - (LIKES>:e1) - :msg - (HAS_CREATOR>) - Person:t, 
             Forum:f - (CONTAINER_OF>:e2) - :msg
        WHERE s.firstName == "Viktor" AND s.lastName == "Akhiezer" 
              AND t.lastName LIKE "S%" 
              AND e1.creationDate >DATETIME_ADD(msg.creationDate, INTERVAL 100 DAY);

  PRINT F;
}

#result
{
  "error": false,
  "message": "",
  "version": {
    "schema": 0,
    "edition": "enterprise",
    "api": "v2"
  },
  "results": [{"F": [{
    "v_id": "962072688797",
    "attributes": {
      "id": 962072688797,
      "title": "Album 12 of Mario Santos",
      "creationDate": "2011-04-12 09:36:50"
    },
    "v_type": "Forum"
  }]}]
}

Example 2. Find any authors who wrote posts that Viktor Akhiezer's liked and whose last name begins with S. Find the country for each of these authors and report on the countries.

USE GRAPH ldbc_snb

INTERPRET QUERY () SYNTAX v2 {

  SumAccum<int> @@cnt;

  C  =  SELECT ctry
        FROM Person:s - (LIKES>:e1) - Post:msg - (HAS_CREATOR>) - Person:t,
             :t - (WORK_AT>:e2) - Company:c, 
             :c - (IS_LOCATED_IN>) - Country:ctry
        WHERE s.firstName == "Viktor" AND s.lastName == "Akhiezer"
              AND t.lastName LIKE "S%" ;

  PRINT C;
}

#result
{
  "error": false,
  "message": "",
  "version": {
    "schema": 0,
    "edition": "enterprise",
    "api": "v2"
  },
  "results": [{"C": [{
    "v_id": "93",
    "attributes": {
      "name": "Portugal",
      "id": 93,
      "url": "http://dbpedia.org/resource/Portugal"
    },
    "v_type": "Country"
  }]}]
}

Example 3. Given a TagClass and a Country, find all the Forums created in the given Country, containing at least one Post with Tags belonging directly to the given TagClass. The location of a Forum is identified by the location of the Forum’s moderator.

USE GRAPH ldbc_snb

DROP QUERY bi_4

CREATE QUERY bi_4(string tcName, string cName) for graph ldbc_snb syntax v2 {
  SetAccum<vertex<Post>> @postSet;
  SumAccum<int> @personId, @postCount;

  ForumSet =
    SELECT f
    FROM Forum:f -(HAS_MODERATOR>)- Person:a -(IS_LOCATED_IN>.IS_PART_OF>)- Country:c,
         :f -(CONTAINER_OF>)- Post:p -(HAS_TAG>.HAS_TYPE>)- TagClass:tc
    WHERE c.name == cName and tc.name == tcName
    ACCUM f.@personId = a.id, f.@postSet += p
    POST-ACCUM f.@postCount = f.@postSet.size(), f.@postSet.clear()
    ORDER BY f.@postCount DESC, f.id ASC
    LIMIT 3;

  PRINT ForumSet[ForumSet.id, ForumSet.title, ForumSet.creationDate,
                 ForumSet.@personId, ForumSet.@postCount];
}

INSTALL QUERY bi_4

RUN QUERY bi_4("MusicalArtist", "Burma")

#result
{
  "error": false,
  "message": "",
  "version": {
    "schema": 0,
    "edition": "enterprise",
    "api": "v2"
  },
  "results": [{"ForumSet": [
    {
      "v_id": "81903",
      "attributes": {
        "ForumSet.title": "Wall of Donald Steele-Perkins",
        "ForumSet.@personId": 5226,
        "ForumSet.id": 81903,
        "ForumSet.@postCount": 65,
        "ForumSet.creationDate": "2010-02-15 06:48:04"
      },
      "v_type": "Forum"
    },
    {
      "v_id": "137438953686",
      "attributes": {
        "ForumSet.title": "Wall of Eric Law-Yone",
        "ForumSet.@personId": 2199023262994,
        "ForumSet.id": 137438953686,
        "ForumSet.@postCount": 65,
        "ForumSet.creationDate": "2010-04-25 22:10:32"
      },
      "v_type": "Forum"
    },
    {
      "v_id": "687194810508",
      "attributes": {
        "ForumSet.title": "Wall of Hector Hugh Michie",
        "ForumSet.@personId": 10995116283784,
        "ForumSet.id": 687194810508,
        "ForumSet.@postCount": 39,
        "ForumSet.creationDate": "2010-12-19 15:33:30"
      },
      "v_type": "Forum"
    }
  ]}]
}

Example 4. For a given country, count all the distinct triples of Persons such that:

a is a friend of b.
b is a friend of c
c is a friend of a.

Distinct means that if a certain 3 vertices appear once in the results, it will not be repeated: it will appear only once. KNOWS is an undirected relationship, so it doesn't matter in what order we list the 3 vertices.


USE GRAPH ldbc_snb

CREATE QUERY bi_17(string cName) FOR GRAPH ldbc_snb SYNTAX v2 {
  TYPEDEF TUPLE <uint a, uint b, uint c> triplet;
  SetAccum<triplet> @@tripletSet;
  SumAccum<int> @@tripletCount;

  C =
    SELECT c
    FROM Country:c -(<IS_PART_OF.<IS_LOCATED_IN)- Person:p1,
         :c -(<IS_PART_OF.<IS_LOCATED_IN)- Person:p2,
         :c -(<IS_PART_OF.<IS_LOCATED_IN)- Person:p3,
         :p1 -(KNOWS)- :p2 -(KNOWS)- :p3 -(KNOWS)- :p1
    WHERE c.name == cName AND p1.id < p2.id AND p2.id < p3.id
    ACCUM @@tripletSet += triplet(p1.id, p2.id, p3.id);

  @@tripletCount = @@tripletSet.size();
  @@tripletSet.clear();
  PRINT @@tripletCount;
}


INSTALL QUERY bi_17

RUN QUERY bi_17("Spain")

#result
{
  "error": false,
  "message": "",
  "version": {
    "schema": 0,
    "edition": "enterprise",
    "api": "v2"
  },
  "results": [{"@@tripletCount": 242}]
}

More Examples. We translated LDBC-SNB BI and IC queries using CPM, and shared the translation in github. Please refer to the query translation here. Most of the queries are installed as functions, you can find sample parameter(s) of the functions from here.

Source Vertex Set Flexibility

As mentioned when we first described pattern matching, in One-hop patterns, the source (leftmost) vertex set can be a vertex type, an alternation of types, or even omitted.

Example 1. Find Viktor Akhiezer's favorite messages' creators whose last name begins with letter S. Count them.


USE GRAPH ldbc_snb

#start from a vertex type "Person"
INTERPRET QUERY () SYNTAX v2 {

  SumAccum<int> @@cnt;

  F  =  SELECT t
        FROM Person:s -(LIKES>:e1)- :msg -(HAS_CREATOR>)- Person:t
        WHERE s.firstName == "Viktor" AND s.lastName == "Akhiezer" 
              AND t.lastName LIKE "S%"
        POST-ACCUM @@cnt+=1;

  PRINT  @@cnt;

}
#result
{
  "error": false,
  "message": "",
  "version": {
    "schema": 0,
    "edition": "enterprise",
    "api": "v2"
  },
  "results": [{"@@cnt": 8}]
}

Example 2. Same query as example 1, but without beginning with vertex types. GSQL compiler can infer the types of :s.

USE GRAPH ldbc_snb

#both end points of the pattern do not have vertex types.
INTERPRET QUERY () SYNTAX v2 {

  SumAccum<int> @@cnt;

  F  =  SELECT t
        FROM :s -(LIKES>:e1)- :msg -(HAS_CREATOR>)- :t
        WHERE s.firstName == "Viktor" AND s.lastName == "Akhiezer" AND t.lastName LIKE "S%"
        POST-ACCUM @@cnt+=1;

  PRINT  @@cnt;

}
#result
{
  "error": false,
  "message": "",
  "version": {
    "schema": 0,
    "edition": "enterprise",
    "api": "v2"
  },
  "results": [{"@@cnt": 8}]
}

Example 3. Count the LIKES edge.

USE GRAPH ldbc_snb

# a pattern starts without any information.
INTERPRET QUERY () SYNTAX v2 {

  SumAccum<int> @@cnt;

  F  =  SELECT msg
        FROM  -(LIKES>:e1)- :msg
        ACCUM @@cnt+=1;

  PRINT  @@cnt;

}
#result
{
  "error": false,
  "message": "",
  "version": {
    "schema": 0,
    "edition": "enterprise",
    "api": "v2"
  },
  "results": [{"@@cnt": 2190095}]
}

Summary

We have covered a lot of territory in GSQL 102:

Showed how to invoke GSQL Pattern Matching syntax
Explained how Pattern Matching extends the classic FROM clause grammar:
- Each hop can be a choice of multiple, individually directed edge types
- The Kleene star and a min...max range enable each hop to be repeated.
- GSQL automatically finds the shortest paths that satisfy a variable length path.
- A virtual match table has a column for each vertex or edge alias in a multi-hop path, and a row for each graph path that satisfies the pattern.
- The ACCUM clause iterates on each row in the match table.
- A POST-ACCUM clause iterates on one vertex alias; a query can have multiple POST-ACCUM clauses.
Described the improvements to Pattern Matching in TigerGraph 3.0:
- The source (leftmost) vertex set can be specified with the same flexibility as the other vertex sets: a vertex type, an alteration of types, or omitted. Explicit seed sets are no longer needed
- Restrictions on which vertex aliases may be used in the ACCUM clause have been lifted.
- Described three major advanced options:
  - The PER <vertex alias> clause enables users to fine tune the ACCUM iteration.
  - Data modification (insert, update, delete) are now supported.
  - Conjunctive Pattern Matching let users express a complex pattern as a set of path patterns which must all be satisfied.
Provided best practices for writing queries, especially pattern matching queries:
- Put the smaller vertex set on the left end.
- Specify all vertex and edge types explicitly.
- Use the PER clause to reduce the match table size
Provided numerous examples and the full set of LDBC Social Network benchmark queries.

GSQL Demo Examples

Using the TigerGraph™ platform is as easy as 1-2-3. In this tutorial we will show you how to use the TigerGraph platform and the GSQL language by developing solutions for several use cases, using the following three-step method:

Create a Graph Model for the use case using the GSQL™ language, TigerGraph's high-level graph definition and manipulation language.
Load Initial Data : load and transform data to TigerGraph's graph engine.
Write a Graph-based Solution by writing queries in the GSQL language.

In addition, this guide will also show you how to update your data: load more data, revise your data, or delete selected data.

Each example involves a data set and simple example of a real-life query or task. We develop a graph model, a loading job to load the data, and one or more queries to answer the question at hand. The applications for graph-based queries are limitless. The goal of these examples is to demonstrate the expressive power of GSQL queries, as well as how business intelligence is a natural fit for the graph analytics world.

To start the GSQL Shell: type the command gsqlto exit, type exitor quit to run a command file from within the shell, precede the file name with "@":

You can also run GSQL commands directly from Linux:

For single-line commands, type "gsql" followed by the command enclosed in single-quotation marks:
For command files, just type "gsql" followed by the filename:

The loading jobs have been updated to v2.0 syntax. The output examples have been updated to JSON output API version "v2", which is the default output format for TigerGraph platform version 1.1 or higher.

Common Graph Schema of Demo Examples

The examples in Part 1 of this tutorial have been designed so that all them can be loaded together in one global graph, gsql_demo. This has several benefits:

You can quickly load several demo examples by running just one script.
After they are loaded, you can switch from one example to another with no delay.
The format is modular, so additional examples can be added easily.

If you want to learn how to design your own graph data analyses, we recommend reading and doing Example 1, then Example 2, etc., rather than running the entire batch of examples at once.

Classic Graph Algorithms

Example 8. Single Pair Shortest Path (unweighted)

The shortest path problem is to find the path(s) between two given vertices S and T in a graph such that the path's total edge weight is minimized. If the edge weights are all the same (e.g., weight=1), then the shortest paths are the ones with the fewest edges or steps from S to T. The classic solution to this graph problem is to start the search from one vertex S and walk one step at a time on the graph until it meets the other input vertex T (unidirectional Breadth-First Search). In addition, we present a more sophisticated way to solve this problem on the TigerGraph advanced graph computing platform. Instead of starting the search from one input vertex, our solution will launch the search agents from both input vertices, walking the graph concurrently until they meet. This greatly improves the algorithm performance. To simplify this problem, this article will assume the graph is undirected and unweighted.

1. Graph Schema and Data

The following examples will use the graph that is presented below. Before we show the algorithms, their implementation and examples, we present the graph schema and data used to create the graph. All files in this document are available here:

Graph Schema

First, we give the graph schema. This will create the graph with vertices of type company , persons and skill. It also creates undirected edges that go from person to company, from person to person, from any type to skill, and from any type to company.

Data Set

Data source for company vertices.

Data source for person vertices and skill vertices. The first line,

m1,i1,0,"s2|s3"

means that person m1 has skills s2 and s3.

Data source for person_work_company edges. The first line means that person m1 works for company c1.

Data source for person_person edges.

Data source for all_to_skill edges such as all_to_skill (m1, s2) or all_to_skill (c2, s3). While the schema supports all_to_company edges, this particular data set does not use any..

Loading the Data

To load all of this data into the graph, we can use the following GSQL command file (which also includes the graph schema creation commands).

To run a command file, simply enter gsql name_of_file

2. Unidirectional (BFS) Algorithm

If the edges are unweighted, then the shortest path can be found using the classic Breadth-First Search (BFS) algorithm. Below is an implementation in the GSQL Query Language:

The algorithm works by expanding the search path through all vertices that were seen in the previous step. Each step is taken by one iteration of the WHILE loop. In the first iteration of the WHILE loop, we start at vertex S and travel to all its neighbors. In each of the following iterations, we travel from previously reached vertices to their neighbors that have not already been seen by the path.

To install the query, run the following command:

Example of Unidirectional BFS Search

Let us show a running example of this algorithm. We will be trying to find the shortest path from c1 to c3. First, we have our initial graph, where we have not traveled along any edges yet.

Figure 2: The starting state for our graph. From here, we go on to the first step of the algorithm. We start at c1, and go along each of its edges.

Figure 3: This is the graph after one step. We have traveled from c1 to all of its neighbors, labeling them as visited. For each one that we visit, we update its @pathResult accumulator value in order to keep track of our path as we traverse the graph.

Figure 4: This graph shows where we have traveled after two steps. We traveled to our new vertices s1, s2, s3, c2 and m4 by traveling one edge away from the nodes that we had visited in step 1. Note that the blue edges also tell us how we can get from c1 to a vertex. For example, we notice that e21 is not labeled blue. This means that we did not travel along this edge. That is, we must have gotten to c2 using a different edge. Indeed, we can see that the path c1-m2-c2 is shorter than c1-m3-s3-c2. This explains why e9 is blue, but e21 is not.

Each time that the query travels from a starting vertex (m1, m2, or m3) to a target vertex (s1, s2, s3, c3, or m4), the target vertex's @pathResult ListAccum<string> is updated (Line 22 of the query). A new string is added to the list (the += operator), which means that there is a path string for each time that the target vertex is reached. The path string consists of the path string from the source vertex, followed by this target vertex. That is equivalent to the path from the query's starting vertex (e.g., c1) to the current target vertex.

Figure 5: At the third step of our algorithm, we have reached the nodes m8, m5 and c4. We got here by moving one edge away from the vertices that we reached in step 2.

Figure 6: Finally, we have reached the end of our algorithm. Note that when we travel one edge away from m8, we arrive at our target node of c3. Working backwards, we can reconstruct the shortest path. We reached c3 from m8, m8 from s1, s1 from m3 and m3 from c1. Thus, we get that the shortest path is indeed c1-m3-s1-m8-c3.

To run the query with starting vertex c1, ending vertex c3, and a maximum distance of 10:

This will give the following result.

As we can see, the algorithm tells us that the shortest path from c1 to c3 is going through m3, followed by s1, then m8, then finally arriving at c3. However, this result also tells us that this is the unique shortest path. Indeed, if we instead run:

Our results are:

Note that here we have two paths. The first is from c3 to m6, and then to c4. The other path is from c3, to m7, to c4. We are presented with both paths because each of these consists of the least possible weight: exactly two edges. As explained earlier, this is because we arrive at a vertex at the same time through two different paths. When we started at c3, we traveled to m6, m7 and m8. At the second step, both m6 and m7 arrive at c4 at the exact same time. That means that two path strings will be written to c4.@queryResult, recording two shortest paths.

3. Bi-Directional Shortest Path Search Algorithm

Bi-Directional search will launch two search agents, each from a given vertex. The two agents concurrently walk one step at a time, until they meet at an intermediate vertex. The shortest path length may be odd or even. For example, in Figure 7 below, Case II is an even-length case, and Case III is an odd-length case. Case I is a special case of an odd-length path.

The core of this solution is that in each step, a set of previously unvisited vertices will be discovered by the search frontiers of S and T. The newly visited vertices will become the new frontier of S or T. The algorithm will repeat this process until the frontiers of the two agents meet.

Because this algorithm is more complicated than one directional search, we first give pseudocode to help explain the algorithm.

This algorithm essentially works by running two versions of the algorithm from the first example at the same time, just with different starting vertices. The algorithm continues with these two paths until there is an intersection. Once the two paths cross, we know that the shortest path goes through this intersection, as explained in the previous section.

Below is an implementation in the GSQL Query Language.

Example of Bidirectional BFS Search

The following is a running example to demonstrate the algorithm of finding the shortest path in a bi-directional way. The graph below (Figure 8) shows vertices c1 and c3, with several other vertices between them. The algorithm will demonstrate the two search directions by using two different colors and border thicknesses:

Blue and thin border for c1's search frontier
Orange and thick border for c3's search frontier

Figure 8: Initialization - prepare to start the search process. The two given vertices (c1 and c3) are activated and colored as Blue and Orange respectively. The rest of the graph remains untouched.

Figure 9: The graph after the first step. The search process starts simultaneously from c1 and c3. If a vertex is seen by the agent starting from c1 (c3), we will say it is seen by c1 (c3).

From the vertex c1, the algorithm goes to the neighbors of c1 that have not yet been seen. As a result, the unseen vertices m1, m2 and m3 are discovered and become the frontier of c1's vertex group.
From the vertex c3, in a similar fashion, the vertices m5, m7 and m8 are discovered and become the frontier of c3's vertex group.

Figure 10: As the two groups have not been met yet, the search process continues.

From c1's search agent, the vertices m4, s2, c2, s3 and s1 are all discovered.
From c3's search agent, the vertices c4, m5 and s1 are all discovered.

Notice that both search agents have found the vertex s1. Thus, the algorithm should stop, and return the path going through s1. In this case, this path is c1-m3-s1-m8-c3.

In order to get this result in the TigerGraph Query Language (GSQL), first install the query, for which the code was given earlier.

Now, run the query using c1 as a starting node, c3 as the ending node, and a maximum distance of 10:

This will return the following result:

However, in order to demonstrate the odd-length case, assume that s1 does not exist.

Figure 11: 2nd Iteration in a modified graph in which s1 does not exist. We got here by traveling one edge away form the vertices that were visited in the previous step. However, as we do not yet have a crossing, we must complete one more iteration.

Figure 12: Here, the paths from c1 have finally found a vertex that was previously found by the paths from c3 (and vice versa). That is, the blue paths traveled from c2 to m5 and from m4 to c4. In Figure 11, m5 and c4 were both orange. In Figure 12, we change a vertex's color to purple when one frontier meets the other. This tells us that the shortest path from c1 to c3 either goes through e8 or e3. If we go through e8, we go along the path c1-m2-c2-m5-m7-c3. Note that if we go through e3, we are given two paths. This is almost identical to the multiple path example from the first algorithm. From c4, we can either take e4 or e12 to get to c3. Thus, when going from c1 to c3 through e3, we are actually given two paths. These paths are c1-m1-m4-c4-m6-c3 and c1-m1-m4-c4-m7-c3.

The * operator in Lines 41 and 63 handle the case of multiple paths from one direction merging with multiple paths from the other direction. For example, we know there are two shortest paths from c4 to c3. Pretend for a moment that there are 3 shortest paths from c1 to m4. Then, when m4 and c4 meet, there would then be (3 * 2) = 6 shortest paths from c1 to c3.

Once again, we can implement this alternate graph in GSQL by using the DELETE keyword. First, we delete the vertex s1 from the graph by doing the following:

Now, we can run our query once again:

Notice that this time, we are given the three paths that we previously described.

Common Applications

16KB

DemoExamples_2.0.tar.gz

Example 1. Collaborative Filtering

Here is an observation about social networks: If a set of persons likes me, and many of them also like another person Z, it is probably true that person Z and I have some things in common. The same observation works for products and services: if a set of customers likes product X, and many of them also like product Z, then product X and Z probably have something in common. We say X and Z are "co-liked". This observation can be turned around into a search for recommendations: Given a user X, find a set of of users Z which are highly co-liked. For social networks, this can be used as friend recommendation: find a highly co-liked person Z to introduce to X. For e-commerce, this can be used for purchase recommendation: someone who bought X may also be interested in buying Z. This technique of finding the top co-liked individuals is called collaborative filtering.

A graph analytics approach is a natural fit for collaborative filtering because the original problem is in a graph (social network), and the search criteria can easily be expressed as a path in the graph. We first find all people Y who like user X, then find other users Z who are liked by someone in group Y, and rank members of Z according to how many times they're liked by Y.

Figure 1 below shows a simple graph according to our model. The circles represent three User vertices with id values id1, id2, and id3. There are two directed edges labeled "Liked" which show that User id2 likes id1, and id2 also likes id3. (In this model, friendship is directional because in online social networks, one of the two persons initiates the friendship.) There are two more directed edges in the opposite directions labeled "Liked_By". Since id2 likes both id1 and id3, id1 and id3 are co-liked.

Quick Demo

To just see the basic operation of the TigerGraph system, follow the easy instructions below . You can then continue to read for the explanation of the command files so you can learn to design your own examples.

Quick Demo Instructions

This example uses the graph below and asks the following query: "Who are the top co-liked persons of id1"?

Step 1: Obtain the data and command files. Create a graph model.

This example uses 4 small files: 3 command files ( cf_model.gsql , cf_load.gsql , cf_query.gsql) and one data file ( cf_data.csv) . Their contents are shown below, so you can either copy from this document or download the files (look in the "cf" subfolder of Examples.zip)

> gsql 'DROP ALL'
> gsql cf_model.gsql
> gsql 'CREATE GRAPH gsql_demo(*)'

Step 2: Load data:

The command below loads our new data.

> gsql -g gsql_demo cf_load.gsql

Step 3: Install and execute the query:

The file cf_query.gsql creates a query called topCoLiked. Then we install the query. The creation step runs fast, but the installation (compiling) step may take about 1 minute. We then run the query, asking for the top 20 Users who are co-liked with User id1.

> gsql -g gsql_demo cf_query.gsql
> gsql -g gsql_demo 'INSTALL QUERY topCoLiked'
> gsql -g gsql_demo 'RUN QUERY topCoLiked("id1", 20)'

The query results should be the following. Interpretation: id4 has as score (@cnt) = 2, which means there are two persons who like both id1 and id4. Next, id2 and id3 each have 1 co-friend in common with id1.

{
  "error": false,
  "message": "",
  "version": {
    "edition": "developer",
    "schema": 0,
    "api": "v2"
  },
  "results": [{"L2": [
    {
      "v_id": "id4",
      "attributes": {"@cnt": 2},
      "v_type": "User"
    },
    {
      "v_id": "id3",
      "attributes": {"@cnt": 1},
      "v_type": "User"
    },
    {
      "v_id": "id2",
      "attributes": {"@cnt": 1},
      "v_type": "User"
    }
  ]}]
}

We now begin a tutorial-style explanation of this TigerGraph example and the workflow in general.

The figure below outlines the steps to progress from an empty graph to a query solution. Each of the blocks below corresponds to one of the steps in the Quick Demo above. The tutorial below will give you a deeper understanding of each step, so you can learn how it works and so you can design your own graph solutions.

Step 1: Create a graph model.

The first step is to create a model for your data which describes the types of vertices and edges you will have.

This example is written to be compatible with older TigerGraph platforms which support only one graph model at a time (though the user can make the model simple or complex, to handle multiple needs). To clear an existing model and old data, so you can install a new one, run the DROP ALLcommand.

The statements below describe the vertex types and edge types in our Co-Liked model

CREATE VERTEX User (PRIMARY_ID id string)
CREATE DIRECTED EDGE Liked (FROM User, TO User) WITH REVERSE_EDGE = "Liked_By"

The first CREATE statement creates one vertex type called User. The second statement creates one directed edge type called Liked. The WITH REVERSE_EDGE clause means that for every two vertices (x,y) connected by a Liked type of edge, the system will automatically generate a corresponding edge of type Liked_By pointing from y to x, and both edges will have the same edge attributes.

After defining all your vertex and edge types, execute the following command to create a graph which binds the vertices and edges into one graph model:

CREATE GRAPH command

CREATE GRAPH gsql_demo(*)

The name of the graph is gsql_demo. Within the parentheses, you can either list the specific vertex and edge types (User, Liked), or you can use *, which means include everything. We chose to use * so that the same command can be used for all of our examples.

The CREATE commands can be stored in one file and executed together.

cf_model.gsql

CREATE VERTEX User (PRIMARY_ID id string)
CREATE DIRECTED EDGE Liked (FROM User, TO User) WITH REVERSE_EDGE = "Liked_By"
#CREATE GRAPH gsql_demo(*)

The CREATE GRAPH command is commented out for the following reason:

Our examples have been designed to run either as individual graphs or merged together into one multi-application graph. The CREATE GRAPH command may be run only once, after all the vertex and edge types have been created. (Each of our demo examples uses unique vertex and edge names, to avoid conflicts.) In other words, we run CREATE GRAPH gsql_demo(*)as a separate command after creating all the vertex and edge types. If you decide you want to modify the schema after running CREATE GRAPH, you can create and run a SCHEMA_CHANGE JOB.

Newer TigerGraph platforms (i.e., version 1.1 or higher) can support multiple graphs, but this tutorial has been designed to be compatible with older single-graph platforms.

To execute these statements (DROP ALL, CREATE VERTEX, etc.), you can type them individually at the GSQL shell prompt, or you can first save them to a file, such as cf_model.gsql , and then run the command file. From within the shell, you would run @cf_model.gsql From outside the shell, you would run > gsql cf_model.gsql

Normally a user would put all their CREATE VERTEX, CREATE EDGE, and the final CREATE GRAPH statements in one file. In our example files, we have separated out the CREATE GRAPH statement because we want to merge all our example schemas together into one common graph.

The vertex, edge, and graph types become part of the catalog . To see what is currently in your catalog, type the ls command from within the GSQL shell to see a report as below:

Catalog contents, as reported by the "ls" command

Vertex Types: 
  - VERTEX User(PRIMARY_ID id STRING) WITH STATS="OUTDEGREE_BY_EDGETYPE"

Edge Types: 
  - directed edge Liked(from User, to User) with reverse_edge="Liked_By"
  - directed edge Liked_By(from User, to User) with reverse_edge="Liked"

Graphs: 

Jobs: 
Queries:

Json API version: v2

To remove a definition from the catalog, use some version of the DROPcommand. Use the helpcommand to see a summary of available GSQL commands.
In our examples, we typically show keywords in ALL UPPERCASE to distinguish them from user-defined identifiers. Identifiers are case-sensitive but keywords are not.

In this example, the vertices and edges don't have attributes. In general, a TigerGraph graph can have attributes on both vertices and edges, and it can also have different types of edges connecting the same two vertices. Please see GSQL Language Reference Part 1 - Defining Graphs and Loading Data which provides a more complete description of the graph schema definition language with additional examples.

Step 2: Load initial data.

Figure 2 shows a larger graph with five vertices and several edges. To avoid crowding the figure, only the Liked edges are shown: For every Liked edge, there is a corresponding Liked_By edge in the reverse direction.

The data file below describes the five vertices and seven edges of Figure 2.

cf_data.csv

id2,id1
id2,id3
id3,id1
id3,id4
id5,id1
id5,id2
id5,id4

The loading job below will read from a data file and create vertex and edge instances to put into the graph.

Per Example: Load data into the graph (file: cf_load.gsql)

# define the loading job
USE GRAPH gsql_demo # added for v1.2
CREATE LOADING JOB load_cf FOR GRAPH gsql_demo {
  DEFINE FILENAME f;
  LOAD f
    TO VERTEX User VALUES ($0),
    TO VERTEX User VALUES ($1),
    TO EDGE Liked VALUES ($0, $1);
}

# load the data
RUN LOADING JOB load_cf USING f="../cf/data/cf_data.csv"

Now that we have defined a graph (in Step 1), GSQL commands or sessions should specify that you want to use a particular graph. Line 2 (new for v1.2) sets the working graph to be gsql_demo. Another way to set the working graph is to specify each time you invoke the gsql command, e.g.,

gsql -g gsql_demo cf_load.gsql

The CREATE LOADING JOB statement (line 3) defines a job called load_cf. The job will read each line of the input file, creates one vertex based on the value in the first column (referenced with column name $0), another vertex based on the value in the second column ($1), and one Liked edge pointing from the first vertex to the second vertex. In addition, since the Liked edge type definition includes the WITH REVERSE_EDGE clause, a Liked_By edge pointing in the opposite direction is also created.

After the job has been created, we run the job (line 12). the RUN LOADING JOB command line includes details about the data source: the name of the file is cf_data.csv, commas are used to separate columns, and \n is used to end each line. (Data files should not contain any extra spaces before or after the separator character.)

The TigerGraph loader automatically filters out duplicates. If either of the two column values has already been seen before, that vertex won't be created. Instead the existing vertex will be used. For example, if we read the first two data lines in data file cf_data.csv , the first line will generate two User vertices, one edge type of Liked , and one edge type of Liked_By . For the second row, however, only one new vertex will be created since id2 has been seen already. Two edges will be created for the second row.

It is okay to run an LOADING JOB again, or to run a different loading job, to add more data to a graph store which already has some data. For example, you could do the following:

RUN LOADING JOB load_cf USING f="../cf/cf_data1.csv"
RUN LOADING JOB load_cf USING f="../cf/cf_data2.tsv"

2.After loading, you can use the GraphStudio UI to visually inspect your data. Refer to the TigerGraph GraphStudio UI Guide .

To clear all your data but to keep your graph model, run the "CLEAR GRAPH STORE -HARD" command. -HARD must be in all capital letters.

Be very careful using CLEAR GRAPH STORE; there is no UNDO command.

For the querying and updating examples in the remainder of this use case, we will assume that Figure 2 has been loaded.

This loading example is basic. The GSQL language can do complex data extraction and transformation, such as dealing with JSON input format and key-value list input, all in high-level syntax. Please see GSQL Language Reference Part 1 - Defining Graphs and Loading Data for more examples.

Step 3: Write a graph-based query solution

The GSQL language includes not only data definition and simple inspection of the data, but also advanced querying which traverses the graph and which supports aggregation and iteration.

Built-In Queries

First , we can run some simple queries to verify that the data were loaded correctly. Below are some examples of some built-in GSQL queries which can be run in GSQL shell:

GSQL > SELECT * FROM User LIMIT 5
[
  {
    "v_id": "id2",
    "attributes": {},
    "v_type": "User"
  },
  {
    "v_id": "id5",
    "attributes": {},
    "v_type": "User"
  },
  {
    "v_id": "id3",
    "attributes": {},
    "v_type": "User"
  },
  {
    "v_id": "id4",
    "attributes": {},
    "v_type": "User"
  },
  {
    "v_id": "id1",
    "attributes": {},
    "v_type": "User"
  }
]

Create a Query

Note on approx_count(*)

The approx_count(*) function relies on statistics which may not account for recent insertions and deletions. If there has been no recent activity, they will give accurate results. In contrast, the count(*) function insures that recent data insertions and deletions are processed, so that it returns an accurate count.

SELECT * displays information in JSON format. Below is an example of query output.

Now let's solve our original problem: find users who are co-liked with a user X. The following query demonstrates a 2-step traversal with aggregation.

The query below performs the co-liked collaborative filtering search. The concept behind this query is to describe a "graph path" which represents the relationship between a person (the starting point) and a person that is co-liked (the ending point). Figure 1 illustrates this path: id3 is a co-liked user of id1, because id2 likes both of them. The path from id1 to co-liked users is: (1) traverse a Liked_By edge to a User, and then (2) traverse a Liked edge to another User. This query also calculates the magnitude of the relationship between the starting point and each ending point. The more users there are such as id2 which connect id1 and id3, the stronger the co-like relationship between id1 and id3. Counting the number of paths that end at id3 serves to calculate this magnitude.

cf_query.gsql - Define the collaborative filtering query

CREATE QUERY topCoLiked( vertex<User> input_user,  INT topk) FOR GRAPH gsql_demo
{
  SumAccum<int>  @cnt = 0;
  # @cnt is a runtime attribute to be associated with each User vertex
  # to record how many times a user is liked.
    L0 = {input_user};
    L1 = SELECT tgt
    	FROM L0-(Liked_By)->User:tgt;
    L2 =  SELECT tgt
          FROM L1-(Liked)->:tgt
          WHERE  tgt != input_user
          ACCUM tgt.@cnt += 1
          ORDER BY tgt.@cnt DESC
          LIMIT topk;
  PRINT L2;
}

This query is structured like a procedure with two input parameters: an input vertex and value of k for the top-K ranking. The query contains three SELECT statements executed in order. The L0 statement defines our initial list of vertices: a set containing a single user supplied by the input_user parameter. Suppose the input user is id1 . Next, the L1 statement starts from every vertex in the set L0, traverses every connected edge of type Liked_By and returns every target vertex (that is, the other ends of the connected edges). As a result, L1 is the set of all users who liked the input user. Referring to the graph in Figure 2, the query travels backwards along every Liked edge which points to id1 , arriving at id2 , id3 , and id5 . These three vertices form L1. Next, the L2 statement starts from each user in L1, travels to every user liked by that starting user (via the Liked type of edges), and increments the count for each User reached. That is, the algorithm counts how many times each vertex is visited by a query path. The WHERE condition makes sure the original input user will not be returned in the result.ORDER BY and LIMIT have the same meaning as in SQL. Below, we show how the L2 step tallies the counts for each vertex encountered:

From id2, Liked edges lead to id1 and id3. id1 is excluded due to the WHERE clause. The cnt count for id3 is incremented from 0 to 1.
From id3, Liked edges lead to id1 and id4. id1 is excluded due to the WHERE clause. The cnt count for id4 is incremented from 0 to 1.
From id5, Liked edges lead to id1, id2, and id4. id1 is excluded to to the WHERE clause. The cnt count of id2 is incremented from 0 to 1. The cnt count of id4 is incremented from 1 to 2.

The three co-liked users and their cnt scores: id3 (cnt score = 1), id4 (cnt = 2), and id2 (cnt = 1). The ORDER BY clause indicates that the sorting should be in descending order, such that the LIMIT clause trims L2 to the 20 vertices with the highest (as opposed to lowest) cnt values. For the test graph, there are only 3 vertices which are co-liked, less than the limit of 20. id4 has the strongest co-liked relationship.

Install and Run a Query

After the query is defined (in the CREATE QUERY block), it needs to be installed. The INSTALL QUERY command compiles the query.

INSTALL QUERY topCoLiked

If you have several queries, you can wait to install them in one command, which runs faster than installed each one separately. E.g.,

INSTALL QUERY query1, query2

INSTALL QUERY ALL

is faster than

INSTALL QUERY query1
INSTALL QUERY query2

After a query has been installed, it can be run as many times has desired. The command RUN QUERY invokes the query, with the given input arguments.

Using "id1" as the starting point and allowing up to 5 vertices in the output, the RUN QUERY command and its output on our test graph is shown below:

GSQL > RUN QUERY topCoLiked("id1", 5)
{
  "error": false,
  "message": "",
  "version": {
    "edition": "developer",
    "schema": 0,
    "api": "v2"
  },
  "results": [{"L2": [
    {
      "v_id": "id4",
      "attributes": {"@cnt": 2},
      "v_type": "User"
    },
    {
      "v_id": "id3",
      "attributes": {"@cnt": 1},
      "v_type": "User"
    },
    {
      "v_id": "id2",
      "attributes": {"@cnt": 1},
      "v_type": "User"
    }
  ]}]
}

Instead of using the RUN QUERY command within the GSQL shell, the query can be invoked from the operating system via a RESTful GET endpoint (which is automatically created by the INSTALL QUERY command):

curl -X GET "http://hostName:port/query/gsql_demo/topCoLiked?input_user=id1&topk=5"

If you followed the standard installation instructions for the TigerGraph system, hostName for the REST server is localhostand port is 9000.

As of TigerGraph 1.2, the URL for query REST endpoints includes the graph name after query/. Prior to 1.2, the URL for the example above was http://hostName:port /query/topCoLiked

Step 4 (Optional): Update Your Data.

You can update the stored graph at any time, to add new vertices and edges, to remove some, or to update existing values. The GSQL language includes ADD, DROP, ALTER, UPSERT, and DELETE operations which are similar to the SQL operations of the same name. The UPSERT operation is a combined UPDATE-INSERT operation: If object exists, then UPDATE, else INSERT. Note that this is the default behavior for The GSQL language's 'smart' loading described above. There are three basic types of modifications to a graph:

Adding or deleting objects
Altering the schema of the graph
Modifying the attributes of existing objects

We'll give a quick example of each type. To show the effect each modification, we'll use the following simple built-in queries:

cf_mod_check.gsql

SELECT * FROM User LIMIT 1000
SELECT * FROM User-(Liked)->User WHERE from_id=="id2"

The current results, before making any modifications, are shown below.

Users vertices and Edges from id2, before any modifications

GSQL > SELECT * FROM User LIMIT 1000
[
  {
    "v_id": "id2",
    "attributes": {},
    "v_type": "User"
  },
  {
    "v_id": "id5",
    "attributes": {},
    "v_type": "User"
  },
  {
    "v_id": "id3",
    "attributes": {},
    "v_type": "User"
  },
  {
    "v_id": "id4",
    "attributes": {},
    "v_type": "User"
  },
  {
    "v_id": "id1",
    "attributes": {},
    "v_type": "User"
  }
]
GSQL > SELECT * FROM User-(Liked)->User WHERE from_id=="id2"
[
  {
    "from_type": "User",
    "to_type": "User",
    "directed": true,
    "from_id": "id2",
    "to_id": "id3",
    "attributes": {},
    "e_type": "Liked"
  },
  {
    "from_type": "User",
    "to_type": "User",
    "directed": true,
    "from_id": "id2",
    "to_id": "id1",
    "attributes": {},
    "e_type": "Liked"
  }
]

Graph modification operations are performed by a distributed computing model which satisfies Sequential Consistency. For these examples, a brief one second pause between the updating and querying the graph should be sufficient.

Modification Type 1: Adding or deleting

Adding is simply running a loading job again with a new data file. More details are in the GSQL Language Reference Part 1.

Deleting: Suppose we want to delete vertex id3 and all its connections:

cf_mod1.gsql

DELETE FROM User WHERE primary_id=="id3"

Users vertices and Edges from id2, after Modification 1

GSQL > SELECT * FROM User LIMIT 1000
[
  {
    "v_id": "id2",
    "attributes": {},
    "v_type": "User"
  },
  {
    "v_id": "id5",
    "attributes": {},
    "v_type": "User"
  },
  {
    "v_id": "id4",
    "attributes": {},
    "v_type": "User"
  },
  {
    "v_id": "id1",
    "attributes": {},
    "v_type": "User"
  }
]
GSQL > SELECT * FROM User-(Liked)->User WHERE from_id=="id2"
[{
  "from_type": "User",
  "to_type": "User",
  "directed": true,
  "from_id": "id2",
  "to_id": "id1",
  "attributes": {},
  "e_type": "Liked"
}]

Modification Type 2: Altering the schema

The GSQL DELETE operation is a cascading deletion. If a vertex is deleted, then all of the edges which connect to it are automatically deleted as well.

Result: one fewer vertex and one fewer edge from id2.

The GSQL language supports four types of schema alterations:

Adding a new type of vertex or edge: ADD VERTEX | DIRECTED EDGE | UNDIRECTED EDGE
Removing a type of vertex or edge: DROP VERTEX | DIRECTED EDGE | UNDIRECTED EDGE
Adding attributes to a vertex or edge type: ALTER VERTEX vertex_type | EDGE edge_type ADD ATTRIBUTE (name type)
Removing attributes of a vertex or edge type: ALTER VERTEX vertex_type | EDGE edge_type DROP ATTRIBUTE (name)

To make schema changes, create a SCHEMA_CHANGE job. Running the SCHEMA_CHANGE JOB will automatically stop all services, update the graph store, and restart the service. For example, suppose we wish to add a name for Users and a weight to Liked edges to indicatehow much User A likes User B.

cf_mod2.gsql

CREATE GLOBAL SCHEMA_CHANGE JOB cf_mod2 {
	ALTER VERTEX User ADD ATTRIBUTE (name string);
    ALTER EDGE Liked ADD ATTRIBUTE (weight float DEFAULT 1);
}
RUN JOB cf_mod2

As of v1.2, the schema_change job here needs to be GLOBAL because the User vertex and Liked edge are global types (they were defined before an active graph was set.)

Changing the schema may necessitate changing queries and other tasks, such as REST endpoints. In this example, the collaborative filtering query will still run with the the new weight attribute, but it will ignore the weight in its calculations.

Users vertices and Edges from id2, after Modification 2

GSQL > SELECT * FROM User LIMIT 1000
[
  {
    "v_id": "id2",
    "attributes": {"name": ""},
    "v_type": "User"
  },
  {
    "v_id": "id5",
    "attributes": {"name": ""},
    "v_type": "User"
  },
  {
    "v_id": "id4",
    "attributes": {"name": ""},
    "v_type": "User"
  },
  {
    "v_id": "id1",
    "attributes": {"name": ""},
    "v_type": "User"
  }
]
GSQL > SELECT * FROM User-(Liked)->User WHERE from_id=="id2"
[{
  "from_type": "User",
  "to_type": "User",
  "directed": true,
  "from_id": "id2",
  "to_id": "id1",
  "attributes": {"weight": 1},
  "e_type": "Liked"
}]

Modification Type 3: Modifying the attributes of existing objects

Now that we have added a weight attribute, we probably want to assign some weight values to the graph. The following example updates the weight values of two edges. For edge upserts, the first two arguments in the VALUES list specify the FROM vertex id and the TO vertex_id, respectively. Similarly, for vertex upserts, the first argument in the VALUES list specifies the PRIMARY_ID id. Since id values may not be updated, the GSQL shell implicitly applies a conditional test: "If the specified id value(s) exist, than update the non-id attributes in the VALUES list; otherwise, insert a new data record using these values."

cf_mod3.gsql

UPSERT User VALUES ("id1", "Aaron")
UPSERT User VALUES ("id2", "Bobbie")
UPSERT User-(Liked)->User VALUES ("id2","id1",2.5)

Users vertices and Edges from id2, after Modification 3

GSQL > SELECT * FROM User LIMIT 1000
[
  {
    "v_id": "id2",
    "attributes": {"name": "Bobbie"},
    "v_type": "User"
  },
  {
    "v_id": "id5",
    "attributes": {"name": ""},
    "v_type": "User"
  },
  {
    "v_id": "id4",
    "attributes": {"name": ""},
    "v_type": "User"
  },
  {
    "v_id": "id1",
    "attributes": {"name": "Aaron"},
    "v_type": "User"
  }
]
GSQL > SELECT * FROM User-(Liked)->User WHERE from_id=="id2"
[{
  "from_type": "User",
  "to_type": "User",
  "directed": true,
  "from_id": "id2",
  "to_id": "id1",
  "attributes": {"weight": 2.5},
  "e_type": "Liked"
}]

Other Modes for Graph Updates

In addition to making graph updates within the GSQL Shell, there are two other ways: sending a query string directly to the Standard Data Manipulation REST API, or writing a custom REST endpoint. For details about the first method, see the GET, POST, and DELETE /graphendpoints in the RESTPP API User Guide . The functionality in GSQL and in the Standard Query API is the same; GSQL commands are translated into REST GET, POST, and DELETE requests and submitted to the Standard Query API.

The REST API equivalent of the GSQL Modification 3 upsert example above is as follows:

curl -X POST --data-binary @ data/cf_mod3_input.json http://hostName:9000/graph

where serverIP is the IP address of your REST server (default = localhost) and data/cf_mod3_input.json is a text file containing the following JSON-encoded data:

cf_upsert.json

{
    "vertices": {
        "User":{
            "id1":{
                "name":{
                    "value":"Aaron"
                }
            }
        },
        "User":{
            "id2":{
                "name":{
                    "value":"Bobbie"
                }
            }
        }  
    },
    "edges": {
         "User":{
             "id2":{
                 "Liked":{
                     "User":{
                         "id1":{
                             "weight" : {
                                 "value":2.5
                             }
                         }
                     }
                 }
             }
        }
    }
}

Example 2. Page Rank

This example shows the use of WHILE loop iteration, global variables , and the built-in outdegree attribute.

It is recommended that you do the Collaborative Filtering Use Case first, because it contains additional tips on running the TigerGraph system.

Remember that if you have a text file containing GSQL commands (e.g., commands.gsql), you can run it one of two ways:

From Linux: gsql commands.gsql
From inside the GSQL shell: @commands.gsql

To run a single command (such as DROP ALL):

From Linux: gsql 'DROP ALL'
From inside the GSQL shell: DROP ALL

Setting the working graph

If a graph has been defined, then all subsequent gsql commands need to specify which graph is being used. If your command file does not contain a "USE GRAPH" statement, then you can specify the graph when invoking gsql:gsql -g graph_name commands.gsql

gsql -g graph_name commands.gsql

If you are always using the same graph, you can define a Linux alias to automatically include your graph name:

alias gsql='gsql -g graph_name'

You can add this line to the .bashrc in your home directory so that the alias is defined each time you open a bash shell.

Step 1: Create a graph model.

In this example, there is only one type of vertex and one type of edge, and edges are directed.

pagerank_model.gsql

CREATE VERTEX Page (PRIMARY_ID pid string, page_id string)
CREATE DIRECTED EDGE Linkto (FROM Page, TO Page)
#CREATE GRAPH gsql_demo(*)

Note how the Page vertex type has both a PRIMARY_ID and a page_id attribute. As will be seen in step 2, the same data will be loaded into both fields. While this seems redundant, this is a useful technique in TigerGraph graph stores. The PRIMARY_ID is not treated as an ordinary attribute. In exchange for high-performance storage, the PRIMARY_ID lacks some of the filtering and querying features available to regular attributes. The Linkto edge does not have any attributes. In general, a TigerGraph graph can have attributes on both vertices and edges, and it can also have different types of edges connecting the same two vertices.

The CREATE GRAPH command is commented out for the following reason:

Our examples have been designed to run either as individual graphs or merged together into one multi-application graph. The CREATE GRAPH command should be run only once, after all the vertex and edge types for all the examples have been created. (Naturally, every model uses unique vertex and edge names, to avoid conflicts.) In other words, run ' CREATE GRAPH gsql_demo(*) ' as a separate command after you have created all your vertex and edge types.

Please see the GSQL Language Reference which provides a more complete description of the graph schema definition language with additional examples .

Step 2: Load initial data

A similar graph to what was used for the Collaborative Filtering user-user network example can be used for an example here. That is, each row has two values which are node IDs, meaning that there is a connection from the first node to the second node. However, we will introduce a difference to demonstrate the flexibility of the TigerGraph loading system. We will modify the data file to use the tab character as a field separator instead of the comma.

pagerank_data.tsv

Loading job:

Create your loading job and load the data.

Per Example: Load data into the graph (file: pagerank_load.gsql)

# define the loading job
CREATE LOADING JOB load_pagerank FOR GRAPH gsql_demo {
  DEFINE FILENAME f;
  LOAD f
      TO VERTEX Page VALUES ($0, $0),
      TO VERTEX Page VALUES ($1, $1),
      TO EDGE Linkto VALUES ($0, $1)
      USING SEPARATOR="\t";
}

# load the data
RUN LOADING JOB load_pagerank USING f="../pagerank/pagerank_data.tsv"

The above loading job will read each line of the input file (pagerank_data.tsv), create one vertex based on the value in the first column (referenced as $0), another vertex based on the value in the second column ($1), and one edge pointing from the first vertex to the second vertex. If either of the two column values has already been seen before, that vertex won't be created. Instead the existing vertex will be used. For example, the first row of pagerank_data.tsv, will create two vertices, with ids 1 and 2, and one edge (1, 2). The second row, however, will create only one new vertex, id 3, and one edge (1, 3), because id 1 already exists.

Note how the LOAD statement specifies the SEPARATOR character is the tab character.

Step 3: Write a Graph-based query solution

GSQL includes not only data definition and simple inspection of the data, but also advanced querying which traverses the graph and which supports aggregation and iteration. This example uses iterations, repeating the computation block until the maximum score change at any vertex is no more than a user-provided threshold, or until it reaches a user-specified maximum number of allowed iterations. Note the arrow -> in the FROM clause used to represent the direction of a directed edge.

pagerank_query.gsql

CREATE QUERY pageRank (float maxChange, int maxIteration, float dampingFactor)
FOR GRAPH gsql_demo
{
  # In each iteration, compute a score for each vertex:
  #   score = dampingFactor + (1-dampingFactor)* sum(received scores from its neighbors).
  # The pageRank algorithm stops when either of the following is true:
  #  a) it reaches maxIterations iterations;
  #  b) max score difference of any vertex compared to the last iteration <= maxChange.
  #   @@ prefix means a global accumulator;
  #   @ prefix means an individual accumulator associated with each vertex

  MaxAccum<float> @@maxDifference = 9999; # max score change in an iteration
  SumAccum<float> @received_score = 0; # sum of scores each vertex receives from neighbors
  SumAccum<float> @score = 1;   # initial score for every vertex is 1.

  AllV = {Page.*};   #  Start with all vertices of type Page
  WHILE @@maxDifference > maxChange LIMIT maxIteration DO
    @@maxDifference = 0;
    S = SELECT s
         FROM AllV:s-(Linkto)->:t
         ACCUM t.@received_score += s.@score/s.outdegree()
         POST-ACCUM s.@score = dampingFactor + (1-dampingFactor) * s.@received_score,
                    s.@received_score = 0,
                    @@maxDifference +=   abs(s.@score - s.@score');
    PRINT @@maxDifference; # print to default json result
  END; # end while loop
  #PRINT AllV.page_id, AllV.@score;       # print the results, JSON output API version v1
  PRINT AllV[AllV.page_id, AllV.@score];  # print the results, JSON output API version v2
} # end query

For JSON output API v2, the PRINT syntax for a vertex set variable is different than the v1 syntax.

After executing the CREATE QUERY command, remember to install the query, either by itself or together with other queries:

Install the query

INSTALL QUERY pageRank

Run the query:

We will use the typical dampingFactor of 0.15, iterate until the pagerank values change by less than 0.001, up to a maximum of 100 iterations. For these conditions, the PageRank values for the 4 vertices (1,2,3,4) are ( 0.65551, 0.93379, 1.22156, 1.18914), respectively.

RUN QUERY pageRank(0.001, 100, 0.15)
{
  "error": false,
  "message": "",
  "version": {
    "edition": "developer",
    "schema": 0,
    "api": "v2"
  },
  "results": [
    {"@@maxDifference": 0.425},
    {"@@maxDifference": 0.36125},
    {"@@maxDifference": 0.15353},
    {"@@maxDifference": 0.19575},
    {"@@maxDifference": 0.16639},
    {"@@maxDifference": 0.09429},
    {"@@maxDifference": 0.08014},
    {"@@maxDifference": 0.05961},
    {"@@maxDifference": 0.04705},
    {"@@maxDifference": 0.03999},
    {"@@maxDifference": 0.017},
    {"@@maxDifference": 0.02},
    {"@@maxDifference": 0.017},
    {"@@maxDifference": 0.00953},
    {"@@maxDifference": 0.0081},
    {"@@maxDifference": 0.00616},
    {"@@maxDifference": 0.00479},
    {"@@maxDifference": 0.00407},
    {"@@maxDifference": 0.00178},
    {"@@maxDifference": 0.00205},
    {"@@maxDifference": 0.00174},
    {"@@maxDifference": 9.6E-4},
    {"AllV": [
      {
        "v_id": "2",
        "attributes": {
          "AllV.page_id": "2",
          "AllV.@score": 0.93379
        },
        "v_type": "Page"
      },
      {
        "v_id": "4",
        "attributes": {
          "AllV.page_id": "4",
          "AllV.@score": 1.18914
        },
        "v_type": "Page"
      },
      {
        "v_id": "1",
        "attributes": {
          "AllV.page_id": "1",
          "AllV.@score": 0.65551
        },
        "v_type": "Page"
      },
      {
        "v_id": "3",
        "attributes": {
          "AllV.page_id": "3",
          "AllV.@score": 1.22156
        },
        "v_type": "Page"
      }
    ]}
  ]
}

Step 4 (Optional): Update Your Data.

Details about updating were discussed in Use Case 1 (Collaborative Filtering). We will go right to the graph modification examples for the PageRank case.

To show the effect of each modification, we use two built-in queries. The first one lists all the Page vertices. The second one lists all the edges which start at Page 4.

pagerank_mod_check.gsql

SELECT * FROM Page LIMIT 1000
SELECT * FROM Page-(Linkto)->Page WHERE from_id=="4"

These are the results of the diagnostic queries, before any graph modifications. There are 4 vertices total and 2 edges which start at page 4.

Page vertices and Linkto edges from Page 4, before modifications

SELECT * FROM Page LIMIT 1000
[
  {
    "v_id": "2",
    "attributes": {"page_id": "2"},
    "v_type": "Page"
  },
  {
    "v_id": "4",
    "attributes": {"page_id": "4"},
    "v_type": "Page"
  },
  {
    "v_id": "1",
    "attributes": {"page_id": "1"},
    "v_type": "Page"
  },
  {
    "v_id": "3",
    "attributes": {"page_id": "3"},
    "v_type": "Page"
  }
]
SELECT * FROM Page-(Linkto)->Page WHERE from_id=="4"
[
  {
    "from_type": "Page",
    "to_type": "Page",
    "directed": true,
    "from_id": "4",
    "to_id": "2",
    "attributes": {},
    "e_type": "Linkto"
  },
  {
    "from_type": "Page",
    "to_type": "Page",
    "directed": true,
    "from_id": "4",
    "to_id": "1",
    "attributes": {},
    "e_type": "Linkto"
  }
]

Modification 1: Adding or deleting

Adding is simply running a loading job again with a new data file.

Deleting: Suppose we want to delete vertex url4 and all its connections:

pagerank_mod1.gsql

DELETE FROM Page WHERE page_id=="1"

The GSQL DELETE operation is a cascading deletion. If a vertex is deleted, then all of the edges which connect to it are automatically deleted as well.

Result: one fewer vertex and one fewer edge from Page 4.

Page vertices and Linkto edges from Page 4, after Modification 1

SELECT * FROM Page LIMIT 1000
[
  {
    "v_id": "2",
    "attributes": {"page_id": "2"},
    "v_type": "Page"
  },
  {
    "v_id": "4",
    "attributes": {"page_id": "4"},
    "v_type": "Page"
  },
  {
    "v_id": "3",
    "attributes": {"page_id": "3"},
    "v_type": "Page"
  }
]
SELECT * FROM Page-(Linkto)->Page WHERE from_id=="4"
[{
  "from_type": "Page",
  "to_type": "Page",
  "directed": true,
  "from_id": "4",
  "to_id": "2",
  "attributes": {},
  "e_type": "Linkto"
}]

Modification Type 2: Altering the schema

For example, suppose we wish to add an attribute to the Page vertices to classify what type of Page it is and also a date to the edges.

pagerank_mod2.gsql

CREATE GLOBAL SCHEMA_CHANGE JOB pagerank_mod2 {
    ALTER VERTEX Page ADD ATTRIBUTE (pageType string DEFAULT "");
    ALTER EDGE Linkto ADD ATTRIBUTE (dateLinked string DEFAULT "");
}
RUN JOB pagerank_mod2

Changing the schema may necessitate revising and reinstalling loading jobs and queries. In this case, adding the pageType attribute does not harm the pageRank query.

This schema_change job is GLOBAL because the Page vertex and Linkto edge types are global (defined before setting an active graph).

Page vertices and Linkto edges from Page 4, after Modification 2

SELECT * FROM Page LIMIT 1000
[
  {
    "v_id": "2",
    "attributes": {
      "page_id": "2",
      "pageType": ""
    },
    "v_type": "Page"
  },
  {
    "v_id": "4",
    "attributes": {
      "page_id": "4",
      "pageType": ""
    },
    "v_type": "Page"
  },
  {
    "v_id": "3",
    "attributes": {
      "page_id": "3",
      "pageType": ""
    },
    "v_type": "Page"
  }
]
SELECT * FROM Page-(Linkto)->Page WHERE from_id=="4"
[{
  "from_type": "Page",
  "to_type": "Page",
  "directed": true,
  "from_id": "4",
  "to_id": "2",
  "attributes": {"dateLinked": ""},
  "e_type": "Linkto"
}]

Modification Type 3: Modifying the attributes of existing objects

The following example updates the type values of two vertices and one edge.

UPSERT Page VALUES (2,2,"info")
UPSERT Page VALUES (3,3,"commerce")
UPSERT Page-(Linkto)->Page VALUES (4,2,"2016-08-31")

Page vertices and Linkto edges from Page 4, after Modification 3

SELECT * FROM Page LIMIT 1000
[
  {
    "v_id": "2",
    "attributes": {
      "page_id": "2",
      "pageType": "info"
    },
    "v_type": "Page"
  },
  {
    "v_id": "4",
    "attributes": {
      "page_id": "4",
      "pageType": ""
    },
    "v_type": "Page"
  },
  {
    "v_id": "3",
    "attributes": {
      "page_id": "3",
      "pageType": "commerce"
    },
    "v_type": "Page"
  }
]
SELECT * FROM Page-(Linkto)->Page WHERE from_id=="4"
[[{
  "from_type": "Page",
  "to_type": "Page",
  "directed": true,
  "from_id": "4",
  "to_id": "2",
  "attributes": {"dateLinked": "2016-08-31"},
  "e_type": "Linkto"
}]

Other Modes for Graph Updates

In addition to making graph updates within the GSQL Shell, there are two other ways: sending a query string directly to the Standard Data Manipulation REST API, or writing a custom REST endpoint. For details about the first method, see the Standard Data Manipulation REST API User Guide . The functionality in GSQL and in the Standard Query API is essentially the same; GSQL commands are translated into REST GET, POST, and DELETE requests and submitted to the Standard Query API.

The REST API equivalent of the GSQL Modification 3 upsert example above is as follows:

curl -X POST --data-binary @data/pagerank_mod3_input.json http://hostName:9000/graph

where hostName is the IP address of your REST server, and data /pagerank_mod3_input.json is a text file containing the following JSON-encoded data:

{
   "vertices": {
        "Page":{
            "2":{
                "pageType" : {
                    "value":"info"
                }
            }
        },
        "Page":{
            "3":{
                "pageType" : {
                    "value":"commerce"
                }
            }
        }
    },
    "edges": {
         "Page":{
             "4":{
                 "Linkto":{
                     "Page":{
                         "2":{
                             "dateLinked" : {
                                 "value":"2016-08-31"
                             }
                         }
                     }
                 }
             }
        }
    }
}

Example 3. Simple Product Recommendation

This example introduces the technique of flattening – splitting a data field which contains a set of elements into individual vertices and edges, one for each element.

Input Data: A list of products. Each Product has a 64-bit image hash value and a list of words describing the product. Query Task : Find the products which are most similar to a given product. Formally, given a product id P and an integer K,return the top K products similar to the product P. The similarity between a product P and another product Q is based on the number of words found in the product descriptions for both product P and product Q.

Step 1: Create a graph model for the use case, using the data definition language (DDL) aspect of the GSQL language.

simprod_model.gsql

CREATE VERTEX Product (PRIMARY_ID pid string, image_hash uint)
CREATE VERTEX DescWord (PRIMARY_ID id string)
CREATE UNDIRECTED EDGE Has_desc (FROM Product, TO DescWord)

Then run

CREATE GRAPH gsql_demo(*)

The above statements create two types of vertices, Product and DescWord, and one type of edge connecting the two vertex types. The edge is undirected so that you can just as easily traverse from a Product to its descriptive words or from a descriptive word to Products which are described by it.

The generated graph schema for this case is shown below. The GSQL Language Reference manual provides a more complete description of the language with more examples .

Step 2: Load Input Data.

In this example, the input data are all stored in a single file having a 3-column format with a header column. Below are the test data:

simprod_data.csv

id,hash,words
62abcax334,15243242,"word1,word2,word3"
dell laptop,1837845,"word2,word4,word5"
mac book, 128474373,"word4"
surface pro,8439828,"word1,word3,word6"
hp book,29398439828,"word2,word3,word1"
linux abc,298439234,"word4,word2,word1"
linux def,295839234,"word4,word2,word6,word7"

Column 1 is the product id; column 2 is the image hash code, and column 3 is a list of words describing the product. Note how double quotation marks are used to enclose the list of words. Each row from the input file may lead to the creation of one Product vertex, multiple DescWord vertices, and multiple edges, one edge connecting the Product to each DescWord vertex.

The loading job below makes use of several features of the loading language to intelligently transform this data file into the appropriate vertices and edges.

The HEADER="true" option tells the loader that the data file's first line contains column headings instead of data. It will read the column headings and permit these heading names to be used instead of index numbers $1, $2, etc.
DEFINE HEADER and USER_DEFINED_HEADER allow the loading job to define its own names for the columns ("id", "hash", "words"), instead of the index numbers ($0, $1, $2) and overriding the file's own headings.
QUOTE="double" informs the loader that double quotation marks enclose strings. This allows the separator character (e.g., comma) to appear in the string, without triggering the end of the token. QUOTE="single" is also available.
The special TEMP_TABLE and flatten() function are used to split the list of tokens into separate items and to store them temporarily. The temporary items are then used to assemble the final edge objects.

simprod_load.gsql

# define the loading job
CREATE LOADING JOB load_simprod FOR GRAPH gsql_demo {
  DEFINE HEADER head1 = "id","hash","words";
  DEFINE FILENAME f1;
  LOAD f1
    TO VERTEX Product values ($"id", $"hash"),
    TO TEMP_TABLE t (pid, description) VALUES ($"id", flatten($"words", ",", 1))
    USING QUOTE="double", HEADER="true", USER_DEFINED_HEADER="head1";
  LOAD TEMP_TABLE t
    TO VERTEX DescWord VALUES ($"description"),
    TO EDGE Has_desc VALUES ($"pid", $"description");
}

# load the data
RUN LOADING JOB load_simprod USING f1="../simprod/data/simprod_data.csv"

In general, the GSQL language can map and transform multiple input files to multiple vertex and edge types. More advanced data transformation and filtering features are also available. See the GSQL Language Reference manual for more information.

An example of the resulting data graph is shown below. Products (P1, P2, etc.) connect to various DescWords (Word1, Word2, etc.). Each Product connects to many DescWords, and each DescWord is used in multiple Products.

Step 3: Write a graph-based solution using TigerGraph's high-level GSQL query language, to solve the use case and auto-generate the REST GET/POST endpoints for real-time accesses to TigerGraph's system.

simprod_query.gsql

CREATE QUERY productSuggestion (vertex<Product> seed, int threshold_cnt, int k = 10)
FOR GRAPH gsql_demo
{
    # an accumulator variable attachable to any vertex
    SumAccum<int>     @cnt = 0;
 
    # T0 is the set of products from which we want to start traversal in the graph
    T0={seed};
 
    /**
    *  Compute the collection of words describing the input
    *  product. tgt is the alias of vertex type DescWord.
    *  In other words, for every edge of the given type (Has_desc)
    *  that has one vertex in the set T0 and the other vertex being of
    *  the DescWord type, add its DescWord vertex to the output set.
    */
    ProductWords = SELECT tgt
               FROM T0-(Has_desc)-DescWord:tgt; 
 
    /**
     *  The output set of the previous query (ProductWords) becomes the input set
     *  of this query. From each word in ProductWords, activate all other products
     *  which contain the word in their description, and accumulate a count for
     *  each activated product to record how many words it has in common with the
     *  input product.  Then rank each related product using the count of common
     *  words; the count must exceed the query parameter threshold_cnt.
     */
    Results = SELECT tgt
        FROM ProductWords-(Has_desc)->Product:tgt
        WHERE tgt != seed
        ACCUM tgt.@cnt += 1
        HAVING tgt.@cnt > threshold_cnt 
        ORDER BY tgt.@cnt DESC
        LIMIT k;
 
    PRINT Results; # default print output is the REST call response in JSON format.
}

Query Result:

For product id= 62abcax334, find the top 3 similar products, which have more than 1 descriptive word in common with product 62abcax334.

//INSTALL QUERY productSuggestion
RUN QUERY productSuggestion("62abcax334", 1, 3)
{
  "error": false,
  "message": "",
  "version": {
    "edition": "developer",
    "schema": 0,
    "api": "v2"
  },
  "results": [{"Results": [
    {
      "v_id": "hp book",
      "attributes": {
        "@cnt": 3,
        "image_hash": 29398439828
      },
      "v_type": "Product"
    },
    {
      "v_id": "surface pro",
      "attributes": {
        "@cnt": 2,
        "image_hash": 8439828
      },
      "v_type": "Product"
    },
    {
      "v_id": "linux abc",
      "attributes": {
        "@cnt": 2,
        "image_hash": 298439234
      },
      "v_type": "Product"
    }
  ]}]
}

When installing the above GSQL query, a REST GET endpoint for this query will automatically be generated. Instead of running the query as a GSQL command, clients can also invoke the query by formatting the query as a HTTP request query string and sending a GET request, e.g.,

Example 4. Same Name Search

This example introduces the CASE...WHEN...THEN structure, which can also be used as an if...then block.

Input Data: A social network, where each person has a first and last name and may also display a picture of themselves.

Query Task : Find the users who are most "similar" to a user X. Specifically, a user X searches for other users whose first or last name matches user X's name. The query returns the list of users (Y1,Y2,...Yk) within two steps (two steps means friend-of-friend), who have matching names, and who offer a picture. The list is sorted and ranked by the relevance score between X and another user Yi, where the score is a linear function of four factors:

curl -X GET "http://hostName:port/query/gsql_demo/productSuggestion?seed=62abcax334&threshold_cnt=1&k=3"

For the standard TigerGraph configuration, hostName:port is localhost:9000

depth : how far X is from Yi (the shortest distance)
count : the number of shortest paths between X and Yi
match : whether Yi matches the input first name (match=1), the input last name (match=2), or both input names (match=3)
profile : whether Yi has a profile picture

Using the graph above as an example, suppose we want to compute relevance scores for the social network of the Tom Smith on the left.

match=1 for Tom Lee (first names are the same)
match=2 for May Smith (last names are the same)
match=3 for Tom Smith on the right (both names are the same).

There is no direct connection to the other Tom Smith, but there are several paths:

Paths with a depth = 2:
- Tom Smith → Ming Wu → Tom Smith
- Tom Smith → Ron Glass → Tom Smith
- Tom Smith → Tom Lee → Tom Smith
- Tom Smith → May Smith → Tom Smith

There are also some longer paths (e.g., Tom Smith → Ron Glass → Tom Lee → Tom Smith), but since they are longer, they are not of interest. Therefore, for the relationship (Tom Smith, Tom Smith), depth = 2 and count = 4.

The four factors (depth, count, match, hasPicture) are combined to compute an overall relevance score:

score = match * 100 + (4-depth) * 50 + count + hasPicture? 200 : 0

The clause hasPicture? 200 : 0 uses the ternary conditional operator. If hasPicture is TRUE, evaluate to 200. Otherwise, evaluate to 0.

To design the graph schema, consider what attributes are needed for each vertex and attribute. The User vertices need to have a first name, a last name, and a profile picture. We assume that the social network is stored in two data files, one for vertices and one for edges.

Vertex file format: id, firstname, lastname, img_url Edge file format: user1, user2

The following code creates the schema and loads the data:

name_model.gsql

CREATE VERTEX NameUser (PRIMARY_ID id string, firstname string, lastname string, imag_url string)
CREATE UNDIRECTED EDGE NameConn (FROM NameUser, TO NameUser)

CREATE GRAPH gsql_demo(*)

name_load.gsql

# define the loading job
CREATE LOADING JOB load_nameV FOR GRAPH gsql_demo {
  DEFINE FILENAME f1;
  LOAD f1 TO VERTEX NameUser VALUES ($0, $1, $2, $3);
}
CREATE LOADING JOB load_nameE FOR GRAPH gsql_demo {
  DEFINE FILENAME f2;
  LOAD f2 TO EDGE NameConn VALUES ($0, $1);
}

# load the data
RUN lOADING JOB load_nameV USING f1="../name/data/name_search_vertex.csv"
RUN LOADING JOB load_nameE USING f2="../name/data/name_search_edge.csv"

Test data files

name_search_vertex.csv

0,michael,jackson,
1,michael,franklin,abc.com
2,michael,lili,def.com
3,franklin,lili,
4,lucia,franklin,lucia.org
5,michael,jackson,
6,michael,jackson,abc.com
7,lucia,jackson,
8,hahah,jackson,haha.net

name_search_edge.csv

0,1
0,3
0,4
1,5
1,3
1,2
2,6
2,7
2,1
2,0
3,0
3,1
3,5
3,7
4,5
5,6
5,7
6,7
6,1
6,2
6,0
6,4
7,5
8,5

The query algorithm is a bit long but straightforward:

Select all the depth=1 neighbors. For each neighbor:
1. Use a CASE structure to check for matching first and last names and assign a match value.
2. Check for an image.
3. We know depth=1 and count=1, so compute the relevance score.
Starting from the depth=1 neighbors, move to the depth=2 neighbors. For each such neighbor:
1. Use a CASE structure to check for matching first and last names and assign a match value.
2. Use ACCUM to count up the paths.
3. Check for an image.
4. Depth=2. Compute the relevance score.

The complete query is shown below:

name_query.gsql

CREATE QUERY namesSimilar (vertex<NameUser> seed, string firstName, string lastName, int k)
FOR GRAPH gsql_demo
{
  # define a tuple to store neighbor score
  typedef tuple<vertex<NameUser> uid, float score> neighbor;
  # runtime variables used to compute neighbor score
  SumAccum<int> @count = 0;
  SumAccum<int> @depth = 0;
  SumAccum<int> @match= 0;
  SumAccum<float> @score = 0.0;
  SumAccum<int> @hasImgURL = 0;
 
  # global heap variable used to store final top k users, sorted by score field
  #  in the neighbor tuple
  HeapAccum<neighbor>(k, score DESC) @@finalTopKUsers;
 
  # starting user
  StartP = {seed};
 
  # flag first level neighbor with @depth = 1
  # count number of incoming connections
  # flag match category
  # flag img_url count greater than 0
  # finally, push the user and their score into global top-k heap.
  FirstLevelConnection = SELECT u
		FROM   StartP -(NameConn)-> :u
		ACCUM u.@depth = 1, u.@count += 1,
			CASE WHEN u.firstname == firstName AND u.lastname == lastName
					THEN  u.@match = 3
				WHEN u.firstname != firstName AND u.lastname == lastName
                    THEN  u.@match = 2
                WHEN u.firstname == firstName AND u.lastname != lastName
                    THEN u.@match = 1
			END,
			CASE WHEN u.imag_url != ""
					THEN u.@hasImgURL = 1
			END
        POST-ACCUM @@finalTopKUsers += neighbor(u, u.@match * 100 + (4-u.@depth) * 50 + u.@count + u.@hasImgURL*200);
 
  # similarly, do the topk heap update using second level neighbor
  SecondLevelConnection = SELECT u2
		FROM   FirstLevelConnection -(NameConn)-> :u2
		WHERE  u2 != seed AND  u2.@depth != 1
		ACCUM  u2.@depth = 2, u2.@count +=1,
			CASE WHEN u2.firstname == firstName AND u2.lastname == lastName
					THEN  u2.@match = 3
				WHEN u2.firstname != firstName AND u2.lastname == lastName
					THEN  u2.@match = 2
				WHEN u2.firstname == firstName AND u2.lastname != lastName
					THEN u2.@match = 1
			END,
			CASE WHEN u2.imag_url !=""
					THEN u2.@hasImgURL = 1
			END
		POST-ACCUM @@finalTopKUsers += neighbor(u2, u2.@match*100 + (4-u2.@depth)*50 + u2.@count + u2.@hasImgURL*200);
 
    #  print the result
    PRINT @@finalTopKUsers;
}

Query result

Starting from user 0, who is named "Michael Jackson", find the top 100 most similar persons, according to the scoring function described above.

//INSTALL QUERY namesSimilar
RUN QUERY namesSimilar (0,"michael","jackson",100)
{
  "error": false,
  "message": "",
  "version": {
    "edition": "developer",
    "schema": 0,
    "api": "v2"
  },
  "results": [{"@@finalTopKUsers": [
    {
      "uid": "6",
      "score": 651
    },
    {
      "uid": "2",
      "score": 451
    },
    {
      "uid": "1",
      "score": 451
    },
    {
      "uid": "5",
      "score": 404
    },
    {
      "uid": "4",
      "score": 351
    },
    {
      "uid": "7",
      "score": 303
    },
    {
      "uid": "3",
      "score": 151
    }
  ]}]
}

Example 5. Content-Based Filtering Recommendation of Videos

This example demonstrates conditional loading to be selective about which data records to load into which vertices or edges.

Input Data: A network of video programs, a set of tags which describe each video, and a set of users who have watched and rated videos.

Query Task: Recommend video programs that a given user might like.

Step 1: Create Graph Schema

The principle behind content-based recommendation is that people are often interested in products which have attributes similar to the ones which they have selected in the past. Suppose we have a video store. If the store tracks what videos each customer has selected in the past, and also records attributes about its videos, it can use this data to recommend more videos to the customer. Formally, for an input user (seed), first find which videos the user has watched. Then, from all the watched videos, find the top k attributes. From the top k attributes, find the top n videos that the seed user has not watched.

This suggests that we should have a graph with three types of vertices: user, video, and attributes (of a video). The schema is shown below.

video_model.gsql

CREATE VERTEX VidUser (PRIMARY_ID user_id uint, content string, date_time uint) 
CREATE VERTEX Video (PRIMARY_ID content_id uint, content string, date_time uint)
CREATE VERTEX AttributeTag (PRIMARY_ID tag_id string, content string, att_type uint)
CREATE UNDIRECTED EDGE Video_AttributeTag (FROM Video, TO AttributeTag, weight float DEFAULT 1.0, date_time uint)
CREATE UNDIRECTED EDGE User_Video (FROM VidUser, TO Video, rating float DEFAULT 1.0, date_time uint)

Then run

CREATE GRAPH gsql_demo(*)

Step 2: Load Input Data

In this example, there is one data file which contains data for all three type of vertices – VidUser, Video, and AttributeTag. The first field of each line indicates the vertex type. Similarly, there is one edge data file for two types of edges – User_Video and Video_AttributeTag. The WHERE clause is used to conditionally load only certain data into each type of vertex or edge. Further, these data files do not contain information for every attribute. When "_" is used in the VALUES list of a LOAD statement, it means not to load data from the input. The default value will be written (or it will remain as it is, if there is already a vertex or edge with that ID).

Test data files

video_recommendation_v.csv

type,id,content
User,0,
User,1,
User,2,
User,3,
Video,0,v0
Video,1,v1
Video,2,v2
Video,3,v3
Video,4,v4
Tag,action,
Tag,comedy,
Tag,mystery,
Tag,technical,

video_recommendation_e.csv

type,from,to,rating
UV,0,0,6.8
UV,0,2,5.2
UV,0,3,10.0
UV,1,1,1.2
UV,2,0,7.4
UV,3,0,6.6
UV,3,4,8.4
VA,0,action,
VA,0,comedy,
VA,1,mystery,
VA,2,technical,
VA,2,mystery,
VA,2,action,
VA,3,comedy,
VA,4,technical,
VA,4,action,

Loading jobs

video_load.gsql

# define the loading job
CREATE LOADING JOB load_videoV FOR GRAPH gsql_demo {
  DEFINE FILENAME f1;
  LOAD f1
    TO VERTEX VidUser VALUES  ($1,_,_) WHERE $0 == "User",
    TO VERTEX Video VALUES ($1,$2,_) WHERE $0 == "Video",
    TO VERTEX AttributeTag VALUES ($1,_,_) WHERE $0 == "Tag";
}
CREATE LOADING JOB load_videoE FOR GRAPH gsql_demo {
  DEFINE FILENAME f2;
  LOAD f2
    TO EDGE User_Video VALUES  ($1,$2,$3, _) WHERE $0 == "UV",
    TO EDGE Video_AttributeTag VALUES  ($1,$2,_, _) WHERE $0 == "VA";
}

# load the data
RUN LOADING JOB load_videoV USING f1="../video/data/video_recommendation_v.csv"
RUN LOADING JOB load_videoE USING f2="../video/data/video_recommendation_e.csv

Step 3: Query the data

The query has the three stages characteristic of content-based recommendation:

Find products (videos) previously selected
Find the top attributes of those products
Find the products which have the most attributes in common with the seed products

video_query.gsql

CREATE QUERY videoRecommendation (vertex<VidUser> seed, int k, int n) FOR GRAPH gsql_demo
{
  OrAccum          @viewedBySeed;
  SumAccum<float>  @score;
  
  Start = {seed};
 
  # get viewed videos
  Viewed = SELECT v
           FROM   Start -(User_Video:e)-> Video:v
           ACCUM  v.@viewedBySeed += true,
                  v.@score += e.rating;
 
  # get attribute
  Attribute = SELECT att
              FROM Viewed:v -(Video_AttributeTag)-> AttributeTag:att
              ACCUM att.@score += v.@score
              ORDER BY  att.@score
              LIMIT k;
  
 # get recommended videos
  Recommend = SELECT v
              FROM Attribute:att -(Video_AttributeTag)-> Video:v
              WHERE v.@viewedBySeed != true
              ACCUM v.@score += att.@score
              ORDER BY v.@score DESC
              LIMIT n;
 
  PRINT Recommend;
}

Query result

Recommend up to 10 videos to user 0, using the top 10 attributes from the client's favorite videos.

//INSTALL QUERY videoRecommendation
RUN QUERY videoRecommendation (0, 10, 10)
{
  "error": false,
  "message": "",
  "version": {
    "edition": "developer",
    "schema": 0,
    "api": "v2"
  },
  "results": [{"Recommend": [
      {
      "v_id": "4",
      "attributes": {
        "date_time": 0,
        "@score": 17.2,
        "@viewedBySeed": false,
        "content": "v4"
      },
      "v_type": "Video"
    },
    {
      "v_id": "1",
      "attributes": {
        "date_time": 0,
        "@score": 5.2,
        "@viewedBySeed": false,
        "content": "v1"
      },
      "v_type": "Video"
    }
  ]}]
}

Example 6. People You May Know

This example shows a technique for passing intermediate results from one stage to another.

Input Data : A social network with weighted connections.

Query Task: Recommend the Top K people you may know but who are not yet in your set of connections. Scoring is based on a variation of cosine similarity of two users:

    tgt.@edge_strength = e.strength

This is a way to "transport" a value as the query travels through the graph .

The graph schema and loading jobs:

Step 1

people_model.gsql

CREATE VERTEX Person (PRIMARY_ID id uint)
CREATE UNDIRECTED EDGE PersonConn (FROM Person, TO Person, strength float)

This example shows that the computation of a moderately complex formula is simple in the GSQL language. It also demonstrates a technique of copying an attribute from an edge or a source vertex to the (temporary) accumulator of the edge's target vertex:

Then

CREATE GRAPH gsql_demo(*)

Step 2

people_load.gsql

# define the loading jobs
CREATE LOADING JOB load_peopleV FOR GRAPH gsql_demo {
  DEFINE FILENAME f1;
  LOAD f1 TO VERTEX Person VALUES ($0);
}
CREATE LOADING JOB load_peopleE FOR GRAPH gsql_demo {
  DEFINE FILENAME f2;
  LOAD f2 TO EDGE PersonConn VALUES ($0,$1,$2);
}

# load the data
RUN LOADING JOB load_peopleV USING f1="../people/data/people_user.dat"
RUN LOADING JOB load_peopleE USING f2="../people/data/people_conn.dat"

Test data:

people_user.dat

people_conn.dat

1,2,0.6
2,3,0.5
2,6,0.5
3,6,0.3
3,5,0.2
3,4,0.8
5,8,0.8
6,8,0.2

Step 3

If you have worked through the previous examples, you perhaps can now see that we need a two-stage query: from A to A's neighbors, and then from A's neighbors to their neighbors. Also, you may realize that we will use the ACCUM clause to perform summation in the second stage. But, how will we know during the second stage what was the strength of the first stage edge? By storing a copy of the edge's weight in an accumulator attached to the edge's target vertex, which becomes a source vertex in the second stage.

people_query.gsql

CREATE QUERY peopleYouMayKnow(vertex<Person> startP, int TopK) FOR GRAPH gsql_demo
{
  SumAccum<float> @edge_strength = 0;
  SumAccum<int>   @depth = 0;
  SumAccum<float> @sum = 0;
  SumAccum<float> @score = 0;
  SumAccum<int>   @@startPdegree = 0;
 
  Start = {startP};
  L1 = SELECT tgt
        FROM Start:src-(PersonConn:e)->Person:tgt
        ACCUM tgt.@edge_strength = e.strength, tgt.@depth=1, # copy edge strength to target vertex
                  @@startPdegree += src.outdegree(); #  save seed outdegree
 
  # second level connections
  L2 = SELECT tgt2
        FROM L1:u-(PersonConn:e)->Person:tgt2
        WHERE tgt2 != startP AND tgt2.@depth != 1
        ACCUM tgt2.@sum += u.@edge_strength*e.strength
        POST-ACCUM tgt2.@score += tgt2.@sum/(@@startPdegree * tgt2.outdegree() )
        ORDER BY tgt2.@score DESC
        LIMIT TopK;
  #PRINT L2.@score;         # JSON output API version v1
  PRINT L2 [L2.@score];    # JSON output API version v2
}

In JSON output API v2, the PRINT syntax for a vertex set variable is different than the v1 syntax.

Query result:

Recommend up to 10 persons whom Person 1 might like to get to know.

# INSTALL QUERY peopleYouMayKnow
RUN QUERY peopleYouMayKnow (1, 10)
{
  "error": false,
  "message": "",
  "version": {
    "edition": "developer",
    "schema": 0,
    "api": "v2"
  },
  "results": [{"L2": [
    {
      "v_id": "6",
      "attributes": {"L2.@score": 0.1},
      "v_type": "Person"
    },
    {
      "v_id": "3",
      "attributes": {"L2.@score": 0.075},
      "v_type": "Person"
    }
  ]}]
}

Input Data: A social network in which each user has two attributes (besides their name): the time that they joined the network, and a boolean flag which says whether they are active or not.

Query Tasks: We show several query examples, making use the the time attribute and directed links between users.

Part 1: Create Graph

social_model.gsql

CREATE VERTEX SocialUser (PRIMARY_ID uid string, name string, isActive bool, registration_timestamp uint)
CREATE DIRECTED EDGE SocialConn (FROM SocialUser, TO SocialUser) WITH REVERSE_EDGE="reverse_conn" #

Then

CREATE GRAPH gsql_demo(*)

Part 2: Load Data

Test data:

social_users.csv

id,name,active,timestamp
0,luke,1,1400000000
1,john,1,1410000000
2,matthew,0,1420000000
3,mark,1,143000000
4,paul,1,144000000
5,steven,0,145000000
6,peter,1,146000000
7,james,1,147000000
8,joseph,1,148000000
9,thomas,1,149000000

social_connections.csv

0,1
0,2
0,3
0,4
0,5
1,3
1,4
1,5
1,6
0,7
7,0
7,3
7,4
7,5
0,8
8,3
8,4
0,9
9,3

We have two data files. For variety, we will create two loading jobs, one for each file. Moreover, we will define the specific file locations in the loading jobs themselves. Then, it is not necessary to provide the filepaths in the RUN LOADING JOB statements. Also, the file social_users.csv has a header, so we can use the column headings to refer to the columns.

social_load.gsql

# define the loading job
CREATE LOADING JOB load_social1 FOR GRAPH gsql_demo {
  LOAD "../social/data/social_users.csv"
    TO VERTEX SocialUser VALUES ($"id",$"name",$"active",$"timestamp")
    USING HEADER="true", QUOTE="double";
}
CREATE LOADING JOB load_social2 FOR GRAPH gsql_demo {
  LOAD "../social/data/social_connection.csv"
    TO EDGE SocialConn VALUES ($0, $1);
}

# load the data
RUN LOADING JOB load_social1
RUN LOADING JOB load_social2

Part 3 : Create, install, and run queries.

This case study presents four queries and their results, one at a time, so there are four separate "INSTALL QUERY" commands. Alternately, all four can be installed at once, which will execute faster than separate install commands: INSTALL QUERY socialFromUser, socialToUser, socialMutualConnections, socialOneWay

INSTALL QUERY ALL

Q1 (socialFromUser): find users who have a direct connection from a given input user, with some filtering conditions on the candidate users' attributes

socialFromUser from social_query.gsql

CREATE QUERY socialFromUser(vertex<SocialUser> uid, bool is_active, int reg_time_min,
int reg_time_max, int k) FOR GRAPH gsql_demo
{
        L0={uid};
    RESULT = SELECT tgt
             FROM L0:u-(SocialConn)->SocialUser:tgt
             WHERE tgt.registration_timestamp >= reg_time_min AND
                   tgt.registration_timestamp <= reg_time_max AND
                   tgt.isActive == is_active
             LIMIT k;
    PRINT RESULT;
}

Test query and result:

#INSTALL QUERY socialFromUser
RUN QUERY socialFromUser("0", "true", 0, 147000000, 10)
{
  "error": false,
  "message": "",
  "version": {
    "edition": "developer",
    "schema": 0,
    "api": "v2"
  },
  "results": [{"RESULT": [
    {
      "v_id": "7",
      "attributes": {
        "registration_timestamp": 147000000,
        "name": "james",
        "isActive": true
      },
      "v_type": "SocialUser"
    },
    {
      "v_id": "4",
      "attributes": {
        "registration_timestamp": 144000000,
        "name": "paul",
        "isActive": true
      },
      "v_type": "SocialUser"
    },
    {
      "v_id": "3",
      "attributes": {
        "registration_timestamp": 143000000,
        "name": "mark",
        "isActive": true
      },
      "v_type": "SocialUser"
    }
  ]}]
}

Q2 (socialToUser): similar to Q1, but return users who have a connection pointing to the input user.

socialToUser from social_query.gsql

CREATE QUERY socialToUser(vertex<SocialUser> uid, bool is_active, int reg_time_min,
int reg_time_max, int K) FOR GRAPH gsql_demo
{
        L0={uid};
    Result = SELECT tgt
             FROM L0:u-(reverse_conn)->SocialUser:tgt
             WHERE tgt.registration_timestamp >= reg_time_min AND
                   tgt.registration_timestamp <= reg_time_max AND
                   tgt.isActive == is_active
             LIMIT K;
    PRINT Result;
}

Test query and result:

#INSTALL QUERY socialToUser
RUN QUERY socialToUser("4", "true", 0, 150000000, 10)
{
  "error": false,
  "message": "",
  "version": {
    "edition": "developer",
    "schema": 0,
    "api": "v2"
  },
  "results": [{"Result": [
    {
      "v_id": "8",
      "attributes": {
        "registration_timestamp": 148000000,
        "name": "joseph",
        "isActive": true
      },
      "v_type": "SocialUser"
    },
    {
      "v_id": "7",
      "attributes": {
        "registration_timestamp": 147000000,
        "name": "james",
        "isActive": true
      },
      "v_type": "SocialUser"
    }
  ]}]
}

Q3 (socialMutualConnections): return the set of users who have connections from both input user A and input user B.

socialMutualConnection from social_query.gsql

CREATE QUERY socialMutualConnections(vertex<SocialUser> uid1, vertex<SocialUser> uid2, bool is_active,
int reg_time_min, int reg_time_max, int k) FOR GRAPH gsql_demo
{
    SumAccum<int> @cnt =0;

    Start = {uid1,uid2};
    Result = SELECT tgt
             FROM  Start-(SocialConn)->SocialUser:tgt
             WHERE tgt.registration_timestamp >= reg_time_min AND
                   tgt.registration_timestamp <= reg_time_max AND
                   tgt.isActive == is_active
             ACCUM tgt.@cnt +=1
             HAVING tgt.@cnt == 2
             LIMIT k;
PRINT Result;
}

Test query and result:

#INSTALL QUERY socialMutualConnections
RUN QUERY socialMutualConnections("1", "7", "false", 0, 2000000000, 10)
{
  "error": false,
  "message": "",
  "version": {
    "edition": "developer",
    "schema": 0,
    "api": "v2"
  },
  "results": [{"Result": [{
    "v_id": "5",
    "attributes": {
      "registration_timestamp": 145000000,
      "@cnt": 2,
      "name": "steven",
      "isActive": false
    },
    "v_type": "SocialUser"
  }]}]
}

Q4 (socialOneWay): find all A->B user relationships such that there is an edge from A to B but there is no edge from B to A, and also requires that A and B connect to at least some number of common friends.

socialOneWay from social_query.gsql

CREATE QUERY socialOneWay(int mutual_contacts_min = 5, int mutual_contacts_max = 10)
FOR GRAPH gsql_demo
{
    typedef tuple<vertex<SocialUser> id, string name, int cnt> recTuple;
    # SumAccum<list<recTuple>> @recList; # v0.1 to v0.1.2
    ListAccum<recTuple> @recList; # v0.2

    Start = {SocialUser.*};

    Result = SELECT B
        FROM Start:A-(SocialConn)->SocialUser:B
          #  B.neighbors() is a built-in function which returns the list of neighbors which B points to.
          #  B.neighbors('edgeType1') returns only the neighbors connected by the given edge type.
        WHERE B NOT IN A.neighbors("reverse_conn") AND
            COUNT(A.neighbors("SocialConn") INTERSECT B.neighbors("SocialConn")) >= mutual_contacts_min AND
            COUNT(A.neighbors("SocialConn") INTERSECT B.neighbors("SocialConn")) <= mutual_contacts_max
        ACCUM  B.@recList += recTuple(A, A.name, COUNT(A.neighbors("SocialConn") INTERSECT B.neighbors("SocialConn")));
    PRINT Result; # the result includes B's static attributes and B.@fromName.
}

Test query and result: There are three such pairs

From vertex 0 to 1. Vertices 0 and 1 have 3 neighbors in common.
From vertex 0 to 8. Vertices 0 and 8 have 2 neighbors in common.
From vertex 0 to 9. Vertices 0 and 9 have 1 neighbor in common.

//INSTALL QUERY socialOneWay
RUN QUERY socialOneWay(1,10)
{
  "error": false,
  "message": "",
  "version": {
    "edition": "developer",
    "schema": 0,
    "api": "v2"
  },
  "results": [{"Result": [
    {
      "v_id": "8",
      "attributes": {
        "registration_timestamp": 148000000,
        "name": "joseph",
        "isActive": true,
        "@recList": [{
          "name": "luke",
          "cnt": 2,
          "id": "0"
        }]
      },
      "v_type": "SocialUser"
    },
    {
      "v_id": "9",
      "attributes": {
        "registration_timestamp": 149000000,
        "name": "thomas",
        "isActive": true,
        "@recList": [{
          "name": "luke",
          "cnt": 1,
          "id": "0"
        }]
      },
      "v_type": "SocialUser"
    },
    {
      "v_id": "1",
      "attributes": {
        "registration_timestamp": 1410000000,
        "name": "john",
        "isActive": true,
        "@recList": [{
          "name": "luke",
          "cnt": 3,
          "id": "0"
        }]
      },
      "v_type": "SocialUser"
    }
  ]}]
}

Suggested variant query:

socialTwoWay: Find all A↔B relationships such that there are connected edges both from A to B and from B to A, and A and B have some common neighbors.

Test query and result:

There is one such pair (0, 7), but the query reports it twice: first as (7, 0) and then as (0, 7). Vertices 7 and 0 have 3 neighbors in common.

RUN QUERY socialTwoWay(1,10)
{
  "error": false,
  "message": "",
  "version": {
    "edition": "developer",
    "schema": 0,
    "api": "v2"
  },
  "results": [{"Result": [
    {
      "v_id": "1",
      "attributes": {
        "registration_timestamp": 1400000000,
        "name": "luke",
        "isActive": true,
        "@recList": [{
          "name": "james",
          "cnt": 3,
          "id": "7"
        }]
      },
      "v_type": "SocialUser"
    },
    {
      "v_id": "7",
      "attributes": {
        "registration_timestamp": 147000000,
        "name": "james",
        "isActive": true,
        "@recList": [{
          "name": "luke",
          "cnt": 3,
          "id": "0"
        }]
      },
      "v_type": "SocialUser"
    }
  ]}]
}