1 of 63

TigerGraph Graph Data Science Library

Overview

TigerGraph In-Database Graph Data Science Library is a collection of expertly written GSQL queries, each of which implements a standard graph algorithm. Each algorithm is ready to be installed and used, either as a stand-alone query or as a building block of a larger analytics application.

We renamed our library (formerly known as GSQL Graph Algorithm Library) to emphasize our focus on graph data science. As the world’s only scalable graph analytics platform, TigerGraph is committed to providing the best graph analytics framework for data scientists.

GSQL running on the TigerGraph platform is particularly well-suited for graph algorithms for several reasons:

Turing-complete with full support for imperative and procedural programming, ideal for algorithmic computation.
Parallel and Distributed Processing, enabling computations on larger graphs.
User-Extensible. Because the algorithms are written in standard GSQL and compiled by the user, they are easy to modify and customize.
Open-Source. Users can study the GSQL implementations to learn by example, and they can develop and submit additions to the library.

Library Structure

The library contains two folders: algorithms and graphs.

`algorithms`

The algorithms folder contains the GSQL implementation of all the graph algorithms offered by the library. Within the algorithms folder are six subfolders that group the algorithms into six classes:

`graphs`

The graphs folder contains small sample graphs that you can use to experiment with the algorithms. In this document, we use the test graphs to show you the expected result for each algorithm. The graphs are small enough that you can manually calculate and sometimes intuitively see what the answers should be.

Release Branches

Starting with TigerGraph product version 2.6, the Library has release branches:

Product version branches (2.6, 3.0, etc.) are snapshots created shortly after a product version is released. They contain the best version of the graph algorithm library at the time of that product version's initial release. They will not be updated, except to fix bugs.
Master branch: the newest released version. This should be at least as new as the newest. It may contain new or improved algorithms.
Other branches are development branches.

It is possible to run newer algorithms on an older product version, as long as the algorithm does not rely on features available only in newer product versions.

Run an algorithm

All GSQL graph algorithms are schema-free, which means they are ready to use with any graph, regardless of the graph's data model or schema. The algorithms have run-time input parameters for the vertex type(s), edge type(s), and attributes which the user wishes to use.

Installing a query also creates a REST endpoint. The same query could be run thus:

GSQL lets you run queries from within other queries. This means you can use a library algorithm as a building block for more complex analytics.

Centrality Algorithms

PageRank

The PageRank algorithm measures the influence of each vertex on every other vertex. PageRank influence is defined recursively: a vertex's influence is based on the influence of the vertices which refer to it. A vertex's influence tends to increase if (1) it has more referring vertices or if (2) its referring vertices have higher influence. The analogy to social influence is clear.

A common way of interpreting PageRank value is through the Random Network Surfer model. A vertex's PageRank score is proportional to the probability that a random network surfer will be at that vertex at any given time. A vertex with a high PageRank score is a vertex that is frequently visited, assuming that vertices are visited according to the following Random Surfer scheme:

Assume a person travels or surfs across a network's structure, moving from vertex to vertex in a long series of rounds.
The surfer can start anywhere. This start-anywhere property is part of the magic of PageRank, meaning the score is a truly fundamental property of the graph structure itself.
Each round, the surfer randomly picks one of the outward connections from the surfer's current location. The surfer repeats this random walk for a long time.
But wait. The surfer doesn't always follow the network's connection structure. There is a probability (1-damping, to be precise), that the surfer will ignore the structure and will magically teleport to a random vertex.

For more information, see the Google paper on PageRank.

Specifications

tg_pageRank (STRING v_type, STRING e_type,  FLOAT max_change=0.001, INT max_iter=25, FLOAT damping=0.85, INT top_k = 100,   BOOL print_accum = TRUE, STRING result_attr =  "", STRING file_path = "",   BOOL display_edges = FALSE)

Example

 # Use _ for default values RUN QUERY tg_pageRank("Person", "Friend", 0.001, 25, 0.85, 100 _, _, _, _)

We ran pageRank on our test10 graph (using Friend edges) with the following parameter values: damping=0.85, max_change=0.001, and max_iter=25. We see that Ivy (center bottom) has the highest pageRank score (1.12). This makes sense since there are 3 neighboring persons who point to Ivy, more than for any other person. Eddie and Justin have scores of exactly 1 because they do not have any out-edges. This is an artifact of our particular version pageRank. Likewise, Alex has a score of 0.15, which is (1-damping), because Alex has no in-edges.

Weighted PageRank

The only difference between weighted PageRank and standard PageRank is that edges have weights, and the influence that a vertex receives from an in-neighbor is multiplied by the weight of the in-edge.

Specifications

Personalized PageRank

In the original PageRank, the damping factor is the probability of the surfer continues browsing at each step. The surfer may also stop browsing and start again from a random vertex. In personalized PageRank, the surfer can only start browsing from a given set of source vertices both at the beginning and after stopping.

Specifications

tg_pageRank_pers(SET<VERTEX> source, STRING e_type,
FLOAT max_change=0.001, INT max_iter=25, FLOAT damping = 0.85, INT top_k = 100
BOOL print_accum = TRUE, STRING result_attr = "", STRING file_path = "")

Example

We ran Personalized PageRank on the graph social10 using Friend edges with the following parameter values:

# Using "_" to use default values
RUN QUERY tg_pageRank_pers([("Fiona","Person")], "Friend", _, _, _, _, _, _,
_)

In this case, the random walker can only start or restart walking from Fiona. In the figure below, we see that Fiona has the highest PageRank score in the result. Ivy and George have the next highest scores because they are direct out-neighbors of Ivy and there are looping paths that lead back to them again. Half of the vertices have a score of 0 since they can not be reached from Fiona.

Betweenness Centrality

The Betweenness Centrality of a vertex is defined as the number of shortest paths that pass through this vertex, divided by the total number of shortest paths. That is

The TigerGraph implementation is based on A Faster Algorithm for Betweenness Centrality by Ulrik Brandes, Journal of Mathematical Sociology 25(2):163-177, (2001). For every vertex s in the graph, the pair dependency starting from vertex s to all other vertices t via all other vertices v is computed first,

Then betweenness centrality is computed as

According to Brandes, the accumulated pair dependency can be calculated as

For every vertex, the algorithm works in two phases. The first phase calculates the number of shortest paths passing through each vertex. Then starting from the vertex on the most outside layer in a non-incremental order with pair dependency initial value of 0, traverse back to the starting vertex

This algorithm query employs a subquery called bc_subquery. Both queries are needed to run the algorithm.

Specifications

Parameters

Example

In the example below, Claire is in the very center of the graph and has the highest betweenness centrality. Six shortest paths pass through Sam (i.e. paths from Victor to all other 6 people except for Sam and Victor), so the score of Sam is 6. David also has a score of 6, since Brian has 6 paths to other people that pass through David.

In the following example, both Charles and David have 9 shortest paths passing through them. Ellen is in a similar position as Charles, but her centrality is weakened due to the path between Frank and Jack.

Eigenvector Centrality (Beta)

Eigenvector centrality (also called eigencentrality or prestige score) is a measure of the influence of a vertex in a network. Relative scores are assigned to all vertices in the network based on the concept that connections to high-scoring vertices contribute more to the score of the vertex in question than equal connections to low-scoring vertices. A high eigenvector score means that a vertex is connected to many vertices who themselves have high scores.

For more information, see Eigenvector centrality.

Specification

CREATE QUERY tg_eigenvector_cent(SET<STRING> v_type, SET<STRING> e_type, 
 INT maxIter = 100, FLOAT convLimit = 0.000001, INT top_k = 100, 
 BOOL print_accum = True, STRING result_attr = "", STRING file_path = ""
 )

Parameters

Return value

The vertices with the highest Eigenvector centrality scores along with their score.

Example

Suppose we have the following graph:

Running the algorithm on the graph will show that Dan has the highest centrality score.

RUN QUERY tg_eigenvector_cent(["person"], ["friendship"])

{
  "error": false,
  "message": "",
  "version": {
    "schema": 2,
    "edition": "enterprise",
    "api": "v2"
  },
  "results": [{"top_scores": [
    {
      "score": 0.59598,
      "Vertex_ID": "Dan"
    },
    {
      "score": 0.50223,
      "Vertex_ID": "Jenny"
    },
    {
      "score": 0.44381,
      "Vertex_ID": "Tom"
    },
    {
      "score": 0.28786,
      "Vertex_ID": "Nancy"
    },
    {
      "score": 0.24085,
      "Vertex_ID": "Kevin"
    },
    {
      "score": 0.20296,
      "Vertex_ID": "Amily"
    },
    {
      "score": 0.11633,
      "Vertex_ID": "Jack"
    }
  ]}]
}

Degree Centrality (Beta)

Degree centrality is defined as the number of edges incident upon a node (i.e., the number of ties that a node has). The degree can be interpreted in terms of the immediate risk of a node for catching whatever is flowing through the network (such as a virus, or some information).

Specification

CREATE QUERY tg_degree_cent(SET<STRING> v_type, SET<STRING> e_type, 
  SET<STRING> re_type, BOOL in_degree = TRUE, BOOL out_degree = TRUE,
  INT top_k=100, BOOL print_accum = True, STRING result_attr = "",
  STRING file_path = "")

Parameters

Return value

The vertices with the highest degree centrality scores along with their scores.

Example

Suppose we have the following graph:

Running the query on the graph will show that Dan has the highest degree centrality

RUN QUERY tg_degree_cent(["person"], ["friendship"],["friendship"])

{
  "error": false,
  "message": "",
  "version": {
    "schema": 2,
    "edition": "enterprise",
    "api": "v2"
  },
  "results": [{"top_scores": [
    {
      "score": 8,
      "Vertex_ID": "Dan"
    },
    {
      "score": 6,
      "Vertex_ID": "Jenny"
    },
    {
      "score": 4,
      "Vertex_ID": "Nancy"
    },
    {
      "score": 2,
      "Vertex_ID": "Kevin"
    },
    {
      "score": 2,
      "Vertex_ID": "Amily"
    },
    {
      "score": 2,
      "Vertex_ID": "Jack"
    }
  ]}]
}

Closeness Centrality

We all have an intuitive understanding when we say a home, an office, or a store is "centrally located." Closeness Centrality provides a precise measure of how "centrally located" a vertex is. The steps below show the steps for one vertex v:

This algorithm query employs a subquery called cc_subquery. Both queries are needed to run the algorithm.

Specifications

Parameters

Example

Closeness centrality can be measured for either directed edges (from v to others) or for undirected edges. Directed graphs may seem less intuitive, however, because if the distance from Alex to Bob is 1, it does not mean the distance from Bob to Alex is also 1.

For our example, we wanted to use the topology of the Likes graph, but to have undirected edges. We emulated an undirected graph by using both Friend and Also_Friend (reverse-direction) edges.

Approximate Closeness Centrality

In the Closeness Centrality algorithm, to obtain the closeness centrality score for a vertex, we measure the distance from the source vertex to every single vertex in the graph. In large graphs, running this calculation for every vertex can be highly time-consuming.

The Approximate Closeness Centrality algorithm (based on Cohen et al. 2014) calculates the approximate closeness centrality score for each vertex by combining two estimation approaches - sampling and pivoting. This hybrid estimation approach offers near-linear time processing and linear space overhead within a small relative error. It runs on graphs with unweighted edges (directed or undirected).

This query uses another subquerycloseness_cent_approx_sub, which needs to be installed before closeness_approx can be installed.

Specifications

tg_closeness_approx (
    SET<STRING> v_type, 
    SET<STRING> e_type,
        INT k = 100,  # sample num
        INT max_hops = 10,  # max BFS explore steps 
        DOUBLE epsilon = 0.1,  # error parameter
    BOOL print_accum = true, # output to console
        STRING file_path = "",  # output file 
        INT debug = 0,  # debug flag -- 0: No LOG;1: LOG without the sample-node bfs loop;2: ALL LOG.
        INT sample_index = 0,  # random sample group
        INT maxsize = 1000,  # max size of connected components using exact closeness algorithm
        BOOL wf = True # Wasserman and Faust formula 
)

Parameters

Result

The result is a list of all vertices in the graph with their approximate closeness centrality score. It is available both in JSON and CSV format.

Example

Below is an example of running the algorithm on the social10 test graph and an excerpt of the response.

RUN QUERY tg_closeness_aprox(["Person"], ["Friend", "Coworker"], 6, 3   \
0.1, true, "", 0, 0, 100, false)

[
  {
    "Start": [
      {
        "attributes": {
          "Start.@closeness": 0.58333
        },
        "v_id": "Fiona",
        "v_type": "Person"
      },
      {
        "attributes": {
          "Start.@closeness": 0.44444
        },
        "v_id": "Justin",
        "v_type": "Person"
      },
      {
        "attributes": {
          "Start.@closeness": 0.53333
        },
        "v_id": "Bob",
        "v_type": "Person"
      }
]

Classification Algorithms

Greedy Graph Coloring

This algorithm assigns a unique integer value known as its color to the vertices of a graph such that no neighboring vertices share the same color. The reason why this is called color is that this task is equivalent to assigning a color to each nation on a map so that no neighboring nations share the same color.

Given a set of k vertices, the algorithm first colors all vertices with the same color - the first color. It then starts from all the vertices and has each vertex send its own colors to its neighbors. If there are two neighboring vertices with the same color, the algorithm will reassign colors where there is a conflict. The same process is repeated until all conflicts are resolved.

The algorithm has a worst-case time complexity of O(V^2 + E), where V is the number of vertices and E is the number of edges.

Specifications

Parameters

Example

k-Nearest Neighbors

The k-Nearest Neighbors (kNN) algorithm is one of the simplest classification algorithms. It assumes that some or all the vertices in the graph have already been classified. The classification is stored as an attribute called the label. The goal is to predict the label of a given vertex, by seeing what are the labels of the nearest vertices.

Given a source vertex in the dataset and a positive integer k, the algorithm calculates the distance between this vertex and all other vertices and selects the k vertices that are nearest. The prediction of the label of this node is the majority label among its k-nearest neighbors.

Specifications

The algorithm will not output more than K vertex pairs, so the algorithm may arbitrarily choose to output one vertex pair over another if there are tied similarity scores.

Example

For the movie graph, we add the following labels to the Person vertices.

When we install the algorithm, answer the questions like:

We then run kNN, using Neil as the source person and k=3. This is the JSON output :

If we run cosine_nbor_ss, using Neil as the source person and k=3, we can see the persons with the top 3 similarity score:

Kat has a label "b", Kevin has a label "a", and Jing does not have a label. Since "a" and "b" are tied, the prediction for Neil is just one of the labels.

If Jing had label "b", then there would be 2 "b"s, so "b" would be the prediction.

If Jing had label "a", then there would be 2 "a"s, so "a" would be the prediction.

k-Nearest Neighbors (Batch Version)

Specifications

Example

For the movie graph shown in the single vertex version, run knn_cosine_all, using topK=3. Then you get the following result:

k-Nearest Neighbors (Cross-Validation Version)

k-Nearest Neighbors (kNN) is often used for machine learning. You can choose the value for topK based on your experience, or using cross-validation to optimize the hyperparameters. In our library, Leave-one-out cross-validation for selecting optimal k is provided. Given a k value, we run the algorithm repeatedly using every vertex with a known label as the source vertex and predict its label. We assess the accuracy of the predictions for each value of k, and then repeat for different values of k in the given range. The goal is to find the value of k with highest predicting accuracy in the given range, for that dataset.

Specifications

Example

Run knn_cosine_cv with min_k=2, max_k = 5. The JOSN result:

Community Algorithms

Weakly Connected Components

A component is the maximal set of vertices, plus their connecting edges, which are interconnected. That is, you can reach each vertex from each other vertex. In the example figure below, there are three components.

This particular algorithm deals with undirected edges. If the same definition (each vertex can reach each other vertex) is applied to directed edges, then the components are called Strongly Connected Components. If you have directed edges but ignore the direction (permitting traversal in either direction), then the algorithm finds Weakly Connected Components.

Specifications

tg_conn_comp (SET<STRING> v_type, SET<STRING> e_type, INT output_limit = 100,
  BOOL print_accum = TRUE, STRING result_attr = "", STRING file_path = "")

Example

It is easy to see in this small graph that the algorithm correctly groups the vertices:

Alex, Bob, and Justin all have Community ID = 2097152
Chase, Damon, and Eddie all have Community ID = 5242880
Fiona, George, Howard, and Ivy all have Community ID = 0

Our algorithm uses the TigerGraph engine's internal vertex ID numbers; they cannot be predicted.

RUN QUERY tg_conn_comp(["Person"], ["Coworker"], _, _, _, _)

Small-World Optimized Version

In addition to the regular weakly connected component algorithm, we also provide a version that is optimized for small-world graphs. A small world graph means the graph has a hub community, where the vast majority of the vertices in the graph are weakly connected.

This version improves upon the performance of the original algorithm when dealing with small-world graphs by combining several different methods used to find connected components in a multi-step process proposed by Slota et al. in BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems.

The algorithm starts by selecting an initial pivot vertex v with a high product of indegree and outdegree. From the initial pivot vertex , the algorithm uses Breadth-First Search to determine the massive weakly connected component. The vertices that are not included in this SCC are passed off to the next step.

After identifying the first WCC, the algorithm uses the coloring method to idenify the WCCs in the remaining vertices.

For more details, see Slota et al., BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems.

Specifications

CREATE QUERY tg_wcc_small_world(STRING v_type, STRING e_type, 
UINT threshold = 100000, BOOL to_show_cc_count=FALSE)

Parameters

Result

If to_show_cc_count is set to true, the algorithm will return the number of vertices in each weakly connected component.

Example

Suppose we have the following graph. We can see that there are three connected components. The first one has 5 vertices, while the two others have 3 vertices.

Running the algorithm on the graph will show that there are three weakly connected components, and have 5, 3, and 3 vertices respectively.

RUN QUERY tg_wcc_small_world("Person", "Coworker", _, true)

{
  "error": false,
  "message": "",
  "version": {
    "schema": 0,
    "edition": "enterprise",
    "api": "v2"
  },
  "results": [{"@@CC_count": {
    "1048576": 5,
    "1048577": 3,
    "4194306": 3
  }}]
}

K-Core Decomposition

A k-core of a graph is a maximal connected subgraph in which every vertex is connected to at least k vertices in the subgraph. To obtain the k-core of a graph, the algorithm first deletes the vertices whose outdegree is less than k. It then updates the outdegree of the neighbors of the deleted vertices, and if that causes a vertex's outdegree to fall below k, it will also delete that vertex. The algorithm repeats this operation until every vertex left in the subgraph has an outdegree of at least k.

Our algorithm takes a range of values for k and returns the set of the vertices that constitute the k-core with the highest possible value of k within the range. It is an implementation of Algorithm 2 in Scalable K-Core Decomposition for Static Graphs Using a Dynamic Graph Data Structure, Tripathy et al., IEEE Big Data 2018.

Time complexity

O(E), where E is the number of edges in the graph.

Specifications

tg_kcore(STRING v_type, STRING e_type, INT k_min = 0, INT k_max = -1, 
BOOL print_accum = TRUE, STRING result_attr = "", STRING file_path = "", 
BOOL print_all_k = FALSE, BOOL show_shells=FALSE)

Parameters

Example

In the example below based on the social graph from GSQL 101, we can see that Dan, Tom, and Jenny make up a 2-core, which is the max-core of the graph:

If we run the kcore algorithm on this small graph like so:

RUN QUERY tg_kcore("person", "friendship", 0, -1, TRUE, "", "", FALSE, FALSE)

Here is the returned JSON response, which includes a 2-core that is comprised of Dan, Jenny, and Tom:

[
  {
    "core_size": 3,
    "k": 2,             // the k-core with the highest possible k is returned 
    "max_core": [
      {
        "attributes": {
          "@core": 2,
          "@deg": 0,
          "age": 40,
          "gender": "male",
          "name": "Tom",
          "state": "ca"
        },
        "v_id": "Tom",
        "v_type": "person"
      },
      {
        "attributes": {
          "@core": 2,
          "@deg": 0,
          "age": 34,
          "gender": "male",
          "name": "Dan",
          "state": "ny"
        },
        "v_id": "Dan",
        "v_type": "person"
      },
      {
        "attributes": {
          "@core": 2,
          "@deg": 0,
          "age": 25,
          "gender": "female",
          "name": "Jenny",
          "state": "tx"
        },
        "v_id": "Jenny",
        "v_type": "person"
      }
    ]
  }
]

Strongly Connected Components

A strongly connected component (SCC) is a subgraph such that there is a path from any vertex to every other vertex. A graph can contain more than one separate SCC. An SCC algorithm finds the maximal SCCs within a graph. Our implementation is based on the Divide-and-Conquer Strong Components (DCSC) algorithm[1]. In each iteration, pick a pivot vertex v randomly, and find its descendant and predecessor sets, where descendant set D_v is the vertex reachable from v, and predecessor set P_v is the vertices which can reach v (stated another way, reachable from v through reverse edges). The intersection of these two sets is a strongly connected component SCC_v. The graph can be partitioned into 4 sets: SCC_v, descendants D_v excluding SCC_v, predecessors P_v excluding SCC, and the remainders R_v. It is proved that any SCC is a subset of one of the 4 sets [1]. Thus, we can divide the graph into different subsets and detect the SCCs independently and iteratively.

The problem of this algorithm is unbalanced load and slow convergence when there are a lot of small SCCs, which is often the case in real-world use cases [3]. We added two trimming stages to improve the performance: size-1 SCC trimming[2] and weakly connected components[3].

The implementation of this algorithm requires reverse edges for all directed edges considered in the graph.

[1] Fleischer, Lisa K., Bruce Hendrickson, and Ali Pınar. "On identifying strongly connected components in parallel." International Parallel and Distributed Processing Symposium. Springer, Berlin, Heidelberg, 2000.

[2] Mclendon Iii, William, et al. "Finding strongly connected components in distributed graphs." Journal of Parallel and Distributed Computing 65.8 (2005): 901-910.

[3] Hong, Sungpack, Nicole C. Rodia, and Kunle Olukotun. "On fast parallel detection of strongly connected components (SCC) in small-world graphs." Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 2013.

Specifications

Example

We ran scc on the social26 graph. A portion of the JSON result is shown below.

The first element "i"=1 means the whole graph is processed in just one iteration. The 5 "trim_set.size()" elements mean there were 5 rounds of size-1 SCC trimming. The final "@@.cluster_dist_heap" object" reports on the size distribution of SCCs.There is one SCC with 9 vertices, and 17 SCCs with only 1 vertex in the graph.

Node Embeddings

Pathfinding Algorithms

Similarity Algorithms

Topological Link Prediction