GSQL Graph Algorithm Library

Graph algorithms are functions for measuring characteristics of graphs, vertices, or relationships. Graph algorithms can provide insights into the role or relevance of individual entities in a graph. For example: How centrally located is this vertex? How much influence does this vertex exert over the others?

Some graph algorithms measure or identify global characteristics: What are the natural community groupings in the graph? What is the density of connections?

Mar 4, 2021 Updates:

  • New algorithms, over the past few months:

    • Approximate Closeness Centrality

    • Greedy Graph Coloring

    • Jaccard Similarity, All-Pairs in Batch mode

    • Cosine Similarity, All-Pairs in Batch mode

Sept 27, 2020 Updates:

Using the GSQL Graph Algorithm Library

The GSQL Graph Algorithm Library is a collection of expertly written GSQL queries, each of which implements a standard graph algorithm. Each algorithm is ready to be installed and used, either as a stand-alone query or as a building block of a larger analytics application.

GSQL running on the TigerGraph platform is particularly well-suited for graph algorithms for several reasons:

  • Turing-complete with full support for imperative and procedural programming, ideal for algorithmic computation.

  • Parallel and Distributed Processing, enabling computations on larger graphs.

  • User-Extensible. Because the algorithms are written in standard GSQL and compiled by the user, they are easy to modify and customize.

  • Open-Source. Users can study the GSQL implementations to learn by example, and they can develop and submit additions to the library.

Library Structure

You can download the library from Github: https://github.com/tigergraph/gsql-graph-algorithm

The library contains two main sections: algorithms and tests. Within the algorithms folder are four subfolders:

  • schema-free: This contains algorithms that are ready to use (e.g. INSTALL QUERY) as-is.

  • templates: This contains algorithms that need to be prepared with the install.sh script to target them for a specific graph schema.

  • generated: This contains the algorithms converted from template to graph-specific format by the install.sh script.

  • examples: This contains examples of generated algorithms

The tests folder contains small sample graphs that you can use to experiment with the algorithms. In this document, we use the test graphs to show you the expected result for each algorithm. The graphs are small enough that you can manually calculate and sometimes intuitively see what the answers should be.

Release Branches

Starting with TigerGraph product version 2.6, the GSQL Graph Algorithm Library has release branches:

  • Product version branches (2.6, 3.0, etc.) are snapshots created shortly after a product version is released. They contain the best version of the graph algorithm library at the time of that product version's initial release. They will not be updated, except to fix bugs.

  • Master branch: the newest released version. This should be at least as new as the newest. It may contain new or improved algorithms.

  • Other branches are development branches.

It is possible to run newer algorithms on an older product version, as long as the algorithm does not rely on features available only in newer product versions.

Schema-Free Algorithms

Most GSQL graph algorithms are schema-free, which means they are ready to use with any graph, regardless of the graph's data model or schema. Schema-free algorithms have run-time input parameters for the vertex type(s), edge type(s), and attributes which the user wishes to use.

To use a schema-free algorithm, the algorithm (GSQL query) must first be installed. If your database is on a distributed cluster, you should use the DISTRIBUTED option when installing the query to install it in Distributed Query Mode.

Installing Template Algorithms

Remember that GSQL graph algorithms are simply GSQL queries. A few algorithms make use of GSQL features that do not yet accept run-time parameters. Instead, these algorithms are in template format. A script is needed to personalize these algorithms before they are installed.

Make sure that install.sh is owned by the tigergraph user.

  1. Within the algorithms folder is a script install.sh. When you run the script, it will first ask you which graph schema you wish to work on. (The TigerGraph platform supports multiple concurrent graphs.)

  2. It then asks you to choose from a menu of available algorithms.

  3. After knowing your graph schema and your algorithm, the installer will ask you some questions for that particular algorithm:

    • the installer will guide you in selecting appropriate vertex types and edge types. Note this does not have to be all the vertex or edge types in your graph. For example, you may have a social graph with three categories of persons and five types of relationships. You might decide to compute PageRank using Member and Guest vertices and Recommended edges.

    • Some algorithms use edge weights as input information (such as Shortest Path where each edge has a weight meaning the "length" of that edge. The installer will ask for the name of that edge attribute.

  4. Single Node Mode or Distributed Mode? Queries that analyze the entire graph (such as PageRank and Community Detection) will run better in Distributed Mode if you have a cluster of machines.

  5. It will then ask you what type of output you would like. It will proceed to create up to three versions of your algorithm, based on the three ways of receiving the algorithm's output:

    • Stream the output in JSON format, the default behavior for most GSQL queries.

    • Save the output value(s) in CSV format to a file. For some algorithms, this option will add an input parameter to the query, to let the user specify how many total values to output.

    • Store the results as vertex or edge attribute values. The attributes must already exist in the graph schema, and the installer will ask you which attributes to use.

  6. After creating queries for one algorithm, the installer will loop back to let you choose another algorithm (returning to step 2 above).

  7. If you choose to exit, the installer makes a last request: Do you want to install your queries? Installation is when the code is compiled and bound into the query engine. It takes a few minutes, so it is best to create all your personalized queries at once and then install them as a group.

Example:

$ bash install.sh
*** GSQL Graph Algorithm Installer ***
Available graphs:
- Graph social(Person:v, Friend:e, Also_Friend:e, Coworker:e)
Graph name? social
Please enter the number of the algorithm to install:
1) EXIT
2) Weighted PageRank
3) Personalized PageRank
4) Triangle Counting(minimal memory)
5) Triangle Counting(fast, more memory)
6) Cosine Neighbor Similarity (single vertex)
7) Cosine Neighbor Similarity (all vertices)
8) Jaccard Neighbor Similarity (single vertex)
9) Jaccard Neighbor Similarity (all vertices)
#? 2
Weighted pageRank() works on directed edges
Available vertex and edge types:
- VERTEX Person(PRIMARY_ID id STRING, name STRING, score FLOAT, tag STRING) WITH STATS="OUTDEGREE_BY_EDGETYPE"
- DIRECTED EDGE Friend(FROM Person, TO Person, weight FLOAT, tag STRING) WITH REVERSE_EDGE="Also_Friend"
- DIRECTED EDGE Also_Friend(FROM Person, TO Person, weight FLOAT, tag STRING) WITH REVERSE_EDGE="Friend"
- UNDIRECTED EDGE Coworker(FROM Person, TO Person, weight FLOAT, tag STRING)
Please enter the vertex type(s) and edge type(s) for running PageRank.
Use commas to separate multiple types [ex: type1, type2]
Leaving this blank will select all available types
Similarity algorithms only take single vertex type
Vertex types: Person
Edge types: Friend
The query pageRank is dropped.
The query pageRank_file is dropped.
The query pageRank_attr is dropped.
Please choose query mode:
1) Single Node Mode
2) Distributed Mode
#? 1
Please choose a way to show result:
1) Show JSON result 3) Save to Attribute/Insert Edge
2) Write to File 4) All of the above
#? 4
gsql -g social ./templates/pageRank.gsql
The query pageRank has been added!
gsql -g social ./templates/pageRank_file.gsql
The query pageRank_file has been added!
If your graph schema has appropriate vertex or edge attributes,
you can update the graph with your results.
Do you want to update the graph [yn]? y
Vertex attribute to store FLOAT result (e.g. pageRank): score
gsql -g social ./templates/pageRank_attr.gsql
The query pageRank_attr has been added!
Created the following algorithms:
- pageRank(float maxChange, int maxIter, float damping, bool display, int outputLimit)
- pageRank_attr(float maxChange, int maxIter, float damping, bool display)
- pageRank_file(float maxChange, int maxIter, float damping, bool display, file f)
Please enter the number of the algorithm to install:
1) EXIT
2) Closeness Centrality
3) Connected Components
4) Label Propagation
5) Community detection: Louvain
6) PageRank
7) Shortest Path, Single-Source, Any Weight
8) Triangle Counting(minimal memory)
9) Triangle Counting(fast, more memory)
#? 1
Exiting
Algorithm files have been created. Do want to install them now [yn]? y
Start installing queries, about 1 minute ...
c
pageRank query: curl -X GET 'http://127.0.0.1:9000/query/social/pageRank?maxChange=VALUE&maxIter=VALUE&damping=VALUE&display=VALUE&outputLimit=VALUE'. Add -H "Authorization: Bearer TOKEN" if authentication is enabled.
pageRank_file query: curl -X GET 'http://127.0.0.1:9000/query/social/pageRank_file?maxChange=VALUE&maxIter=VALUE&damping=VALUE&display=VALUE&f=VALUE'. Add -H "Authorization: Bearer TOKEN" if authentication is enabled.
pageRank_attr query: curl -X GET 'http://127.0.0.1:9000/query/social/pageRank_attr?maxChange=VALUE&maxIter=VALUE&damping=VALUE&display=VALUE'. Add -H "Authorization: Bearer TOKEN" if authentication is enabled.
[======================================================================================================] 100% (3/3)
$

After the algorithms are installed, you will see them listed among the rest of your GSQL queries.

GSQL > ls
...
Queries:
- cc_subquery(vertex v, int numVert, int maxHops) (installed v2)
- closeness_cent(bool display, int outputLimit) (installed v2)
- closeness_cent_attr(bool display) (installed v2)
- closeness_cent_file(bool display, file f) (installed v2)
- conn_comp() (installed v2)
- conn_comp_attr() (installed v2)
- conn_comp_file(file f) (installed v2)
- label_prop(int maxIter) (installed v2)
- label_prop_attr(int maxIter) (installed v2)
- label_prop_file(int maxIter, file f) (installed v2)
- louvain() (installed v2)
- louvain_attr() (installed v2)
- louvain_file(file f) (installed v2)
- pageRank(float maxChange, int maxIter, float damping, bool display, int outputLimit) (installed v2)
- pageRank_attr(float maxChange, int maxIter, float damping, bool display) (installed v2)
- pageRank_file(float maxChange, int maxIter, float damping, bool display, file f) (installed v2)
- tri_count() (installed v2)
- tri_count_fast() (installed v2)

Running an Algorithm

Running an algorithm is the same as running a GSQL query. For example, if you selected the JSON option for pageRank, you could run it from GSQL as below:

GSQL > RUN QUERY pageRank("Page","Links_to",_,30,_,50,_,_,_)

Installing a query also creates a REST endpoint. The same query could be run thus:

curl -X GET 'http://127.0.0.1:9000/query/alg_graph/pageRank?v_type=Page&e_type=Links_to&max_iter=30&top_k=50'

GSQL lets you run queries from within other queries. This means you can use a library algorithm as a building block for more complex analytics.

Library Overview

The following algorithms are currently available. The algorithms are grouped into five classes:

  • Path

  • Centrality

  • Community

  • Similarity

  • Classification

Some algorithms are only appropriate for certain types of graphs. For example, Strong Connected Components (SCC) is designed for graphs with directed edges.

  • Coming soon means that TigerGraph plans to release this variant of the algorithm soon.

  • n/a means that this variant of the algorithm is typically not used

Algorithm

Class

Undirected

Edges

Directed

Edges

Weighted

Edges

Single-Source Shortest Path

Path

Yes

Yes

Yes

All Pairs Shortest Path

Path

Yes

Yes

Yes

Minimum Spanning Tree

Path

Yes

n/a

Yes

Minimum Spanning Forest

Path

Yes

n/a

Yes

Maximal Independent Set

Path

Yes

Coming Soon

n/a

Cycle Detection

Path

no

Yes

n/a

Estimated Diameter

Path

Yes

n/a

n/a

PageRank

Centrality

n/a

Yes

n/a

Weighted PageRank

Centrality

n/a

Yes

Yes

Personalized PageRank

Centrality

n/a

Yes

Coming soon

Closeness Centrality

Centrality

Yes

n/a

Coming soon

Approximate Closeness Centrality (NEW)

Centrality

Yes

n/a

Coming soon

Betweenness Centrality

Centrality

Yes

n/a

Coming soon

Connected Components

Community

Yes

n/a

n/a

Strongly Connected Components

Community

n/a

Yes

n/a

K-Core

Community

Yes

n/a

n/a

Label Propagation

Community

Yes

n/a

n/a

Louvain Modularity

Community

Yes

n/a

n/a

Triangle Counting

Community

Yes

n/a

n/a

Cosine Similarity of Neighborhoods (single-source, all-pairs and batch (NEW))

Similarity

Yes

Yes

Yes

Jaccard Similarity of Neighborhoods (single-source, all-pairs and batch (NEW))

Similarity

Yes

Yes

No

Greedy Graph Coloring (NEW)

Classification

Yes

Yes

Yes

K-Nearest Neighbors (with cosine similarity for "nearness")

Classification

Yes

Yes

Yes

Computational Complexity

Computational Complexity is a formal mathematical term, referring to how an algorithm's requirements scale according to the size of the data or other key parameters. Computational complexity is useful for comparing one algorithm to another, but it does not describe speed in absolute terms.

For graphs, there are two key data parameters:

  • V (or sometimes n), the number of vertices

  • E (or sometimes m), the number of edges

The notation O(V^2) (read "big O V squared") means that when V is large, the computational time is proportional to V^2.

Time complexity describes how the execution time is expected to vary with the data size and other key parameters. Normally, time complexity is based on simplified and idealized computer architecture: memory accesses and arithmetic operations always take one unit of time.

Memory complexity describes how the run-time memory usage scales with the data size and other key parameters.

Standard Parameters

The GSQL Algorithm library has consistent parameter names and order. Input parameters come first, parameters for the body of the algorithm come next, and output configuration parameters come last.

In GSQL, to accept a default parameter value, use _ E.g., closeness_cent(["Person", "Organization"], ["Likes"], _, _, _, _, _, _, _)

Parameters for the Graph Schema

Schema-free algorithms need to know the name of the vertex types, edge types, and the edge weight attribute for weighted edge algorithms.

Parameter Type and Name

Description

SET<STRING> v_type

The name(s) of the vertex types to include.

SET<STRING> e_type

The name(s) of the edge types to include.

STRING wt_edge

The name of the edge weight attribute to use.

STRING wt_type

The data type of the edge weight. Must be INT, FLOAT, or DOUBLE.

Set notation for GSQL parameters

Use square brackets to enclose a set-type parameter, even if there is just a single item in the set, e.g. closeness_cent(["Person", "Organization"], ["Likes"], _, _, _, _, _, _, _)

Parameters for Output Options

There are usually three options for output:

  • Send JSON output to standard output.

  • Write results to an output file in CSV format.

  • Store the output values in a user-specified attribute of a vertex or edge type.

Beginning with v3.0, each of the options is selected independently by setting appropriate parameters. More than one option may be selected:

Parameter type and name

Default

Description

BOOL print_accum

true

If true, the output will be in JSON format

INT output_limit

-1

If output_limit >= 0, limit the number of vertices in the JSON output to this value.

If output_limit < 0, then do not limit JSON output.

STRING result_attr

Empty

string

The name of an attribute. If not the empty string, take the algorithm's output values and store them in the given attribute.

STRING file_path

Empty

string

The path to the output file. If not the empty string, write output to this file.

BOOL display_edges

false

If true, and if print_accum is true, include relevant edges in the JSON output, so that the graph can be displayed.

Path Algorithms

These algorithms help find the shortest path or evaluate the availability and quality of routes.

Single-Source Shortest Path, Unweighted

This algorithm finds an unweighted shortest path from one source vertex to each possible destination vertex in the graph. That is, it finds n paths.

If your graph has weighted edges, see the next algorithm. With weighted edges, it is necessary to search the whole graph, whether you want the path for just one destination or for all destinations.

Description and Uses

If a graph has unweighted edges, then finding the shortest path from one vertex to another is the same as finding the path with the fewest hops. Think of Six Degrees of Separation and Friend of a Friend. Unweighted Shortest Path answers the question "How are you two related?" The two entities do not have to be persons. Shortest Path is useful in a host of applications, from estimating influences or knowledge transfer, to criminal investigation.

When the graph is unweighted, we can use a "greedy" approach to find the shortest path. In computer science, a greedy algorithm makes intermediate choices based on the data being considered at the moment, and then does not revisit those choices later on. In this case, once the algorithm finds any path to a vertex T, it is certain that that is a shortest path.

Specifications

CREATE QUERY shortest_ss_no_wt (VERTEX source, SET<STRING> v_type,
SET<STRING> e_type, INT output_limit = -1, BOOL print_accum =TRUE,
STRING result_attr ="", STRING file_path ="", BOOL display_edges =FALSE)

Characteristic

Value

Result

Computes a shortest distance (INT) and shortest path (STRING) from vertex source to each other vertex.

Input Parameters

  • VERTEX source: ID of the source vertex

  • SET<STRING> v_type: Names of vertex types to use

  • SET<STRING> e_type: Names of edge types to use

  • INT output_limit: If >=0, max number of vertices to output to JSON.

  • BOOL print_accum: If True, output JSON to standard output

  • STRING result_attr: If not empty, store distance values (INT) to this attribute

  • STRING file_path: If not empty, write output to this file.

  • BOOL display_edges: If true, include the graph's edges in the JSON output, so that the full graph can be displayed.

Result Size

V = number of vertices

Time Complexity

O(E), E = number of edges

Graph Types

Directed or Undirected edges, Unweighted edges

Example

In the below graph, we do not consider the weight on edges. Using vertex A as the source vertex, the algorithm discovers that the shortest path from A to B is A-B, and the shortest path from A to C is A-D-C, etc.

Generic graph with unweighted edges
[
{
"ResultSet": [
{
"v_id": "B",
"v_type": "Node",
"attributes": {
"A",
"B"
]
}
},
{
"v_id": "A",
"v_type": "Node",
"attributes": {
"A"
]
}
},
{
"v_id": "C",
"v_type": "Node",
"attributes": {
"A",
"D",
"C"
]
}
},
{
"v_id": "E",
"v_type": "Node",
"attributes": {
"A",
"D",
"E"
]
}
},
{
"v_id": "D",
"v_type": "Node",
"attributes": {
"A",
"D"
]
}
}
]
}
]

Single-Source Shortest Path, Weighted

Description and Uses

Finding shortest paths in a graph with weighted edges is algorithmically harder than in an unweighted graph because even after you find a path to a vertex T, you cannot be certain that it is a shortest path. If edge weights are always positive, then you must keep trying until you have considered every in-edge to T. If edge weights can be negative, then it's even harder. You must consider all possible paths.

A classic application for weighted shortest path is finding the shortest travel route to get from A to B. (Think of route planning "GPS" apps.) In general, any application where you are looking for the cheapest route is a possible fit.

Specifications

The shortest path algorithm can be optimized if we know all the weights are nonnegative. If there can be negative weights, then sometimes a longer path will have a lower cumulative weight. Therefore, we have two versions of this algorithm

shortest_ss_pos_wt (VERTEX source, SET<STRING> v_type, SET<STRING> e_type,
STRING wt_attr, STRING wt_type, INT output_limit = -1, BOOL print_accum = TRUE,
STRING result_attr = "", STRING file_path = "", BOOL display_edges = FALSE)
shortest_ss_any_wt (VERTEX source, SET<STRING> v_type, SET<STRING> e_type,
STRING wt_attr, STRING wt_type, INT output_limit = -1, BOOL print_accum = TRUE,
STRING result_attr = "", STRING file_path = "", BOOL display_edges = FALSE)

Characteristic

Value

Result

Computes a shortest distance (INT) and shortest path (STRING) from vertex source to each other vertex.

Input Parameters

  • VERTEX source: Id of the source vertex

  • SET<STRING> v_type: Names of vertex types to use

  • SET<STRING> e_type: Names of edge types to use

  • STRING wt_attr: Name of edge weight attribute

  • STRING wt_type: Data type of edge weight attribute: "INT", "FLOAT", or "DOUBLE"

  • INT output_limit: If >=0, max number of vertices to output to JSON.

  • BOOL print_accum: If True, output JSON to standard output

  • STRING result_attr: If not empty, store distance values (INT) to this attribute

  • STRING file_path: If not empty, write output to this file.

  • BOOL display_edges: If true, include the graph's edges in the JSON output, so that the full graph can be displayed.

Result Size

V = number of vertices

Time Complexity

O(V*E), V = number of vertices, E = number of edges

Graph Types

Directed or Undirected edges, Weighted edges

The shortest_path_any_wt query is an implementation of the Bellman-Ford algorithm. If there is more than one path with the same total weight, the algorithm returns one of them.

Currently, shortest_path_pos_wt also uses Bellman-Ford. The well-known Dijsktra's algorithm is designed for serial computation and cannot work with GSQL's parallel processing.

Example

The graph below has only positive edge weights. Using vertex A as the source vertex, the algorithm discovers that the shortest weighted path from A to B is A-D-B, with distance 8. The shortest weighted path from A to C is A-D-B-C with distance 9.

Generic graph with only positive weights
[
{
"ResultSet": [
{
"v_id": "B",
"v_type": "Node",
"attributes": {
"D",
"B"
]
}
},
{
"v_id": "A",
"v_type": "Node",
"attributes": {
}
},
{
"v_id": "C",
"v_type": "Node",
"attributes": {
"D",
"B",
"C"
]
}
},
{
"v_id": "E",
"v_type": "Node",
"attributes": {
"D",
"E"
]
}
},
{
"v_id": "D",
"v_type": "Node",
"attributes": {
"D"
]
}
}
]
}
]

The graph below has both positive and negative edge weights. Using vertex A as the source vertex, the algorithm discovers that the shortest weighted path from A to E is A-D-C-B-E, with a cumulative score of 7 - 3 - 2 - 4 = -2.

Example results on a graph with negative weights on edges

Single-Pair Shortest Path

The Single-Pair Shortest Path task seeks the shortest path between a source vertex S and a target vertex T. If the edges are unweighted, then use the query in our tutorial document GSQL Demo Examples.

If the edges are weighted, then use the Single-Source Shortest Path algorithm. In the worst case, it takes the same computational effort to find the shortest path for one pair as to find the shortest paths for all pairs from the same source S. The reason is that you cannot know whether you have found the shortest (least weight) path until you have explored the full graph. If the weights are always positive, however, then a more efficient algorithm is possible. You can stop searching when you have found paths that use each of the in-edges to T.

All-Pairs Shortest Path

The All-Pairs Shortest Path algorithm is costly for large graphs because the computation time is O(V^3) and the output size is O(V^2). Be cautious about running this on very large graphs.

The All-Pairs Shortest Path (APSP) task seeks to find the shortest paths between every pair of vertices in the entire graph. In principle, this task can be handled by running the Single-Source Shortest Path (SSSP) algorithm for each input vertex, e.g.,

CREATE QUERY all_pairs_shortest(SET<STRING> v_type, SET<STRING> e_type,
STRING wt_attr, STRING wt_type, STRING result_attr = "", STRING file_path = "")
{
Start = {v_type};
Result = SELECT s FROM Start:s
POST-ACCUM
shortest_ss_any_wt(s, v_type, e_type, wt_attr, wt_type,
result_attr, file_path+s);
}

This example highlights one of the strengths of GSQL: treating queries as stored procedures that can be called from within other queries. We only show the result_attr and file_path options, because subqueries cannot send their JSON output.

For large graphs (with millions of vertices or more), however, this is an enormous task. While the massively parallel processing of the TigerGraph platform can speed up the computation by 10x or 100x, consider what it takes just to store or report the results. If there are 1 million vertices, then there are nearly 1 trillion output values.

There are more efficient methods than calling the single-source shortest path algorithm n times, such as the Floyd-Warshall algorithm, which computes APSP in O(V^3) time.

Our recommendation:

  • If you have a smaller graph (perhaps thousands or tens of thousands of vertices), the APSP task may be tractable.

  • If you have a large graph, avoid using APSP.

Minimum Spanning Tree (MST)

Description and Uses

Given an undirected and connected graph, a minimum spanning tree is a set of edges that can connect all the vertices in the graph with the minimal sum of edge weights. The library implements a parallel version of Prim's algorithm:

  1. Start with a set A = { a single vertex seed }

  2. For all vertices in A, select a vertex y such that

    1. y is not in A, and

    2. There is an edge from y to a vertex x in A, and

    3. The weight of the edge e(x,y) is the smallest among all eligible pairs (x,y).

  3. Add y to A, and add the edge (x,y) to MST.

  4. Repeat steps 2 and 3 until A has all vertices in the graph.

If the user specifies a source vertex, this will be used as the seed. Otherwise, the algorithm will select a random seed vertex.

If the graph contains multiple components (i.e., some vertices are disconnected from the rest of the graph, then the algorithm will span only the component of the seed vertex.

If you do not have a preferred vertex, and the graph might have more than one component, then you should use the Minimum Spanning Forest (MDF) algorithm instead.

Specifications

mst (VERTEX opt_source, SET<STRING> v_type, SET<STRING> e_type,
STRIN wt_attr, STRING wt_type, INT max_iter = -1,
BOOL print_accum = TRUE, STRING result_attr = "", STRING file_path = "")

Characteristic

Value

Result

Computes a minimum spanning tree. If the JSON or file output selected, the output is the set of edges that form the MST. If the result_attr option is selected, the edges which are part of the MST are tagged True; other edges are tagged False.

Input Parameters

  • VERTEX opt_source: ID of a source vertex (optional)

  • SET<STRING> v_type: Names of vertex types to use

  • SET<STRING> e_type: Names of edge types to use

  • STRING wt_attr: Name of edge weight attribute

  • STRING wt_type: Data type of edge weight attribute: "INT", "FLOAT", or "DOUBLE"

  • INT max_iter: Maximum of edges to include. If less than (V-1), then the result is not a true spanning tree.

  • BOOL print_accum: If True, output JSON to standard output

  • STRING result_attr: If not empty, store result values (BOOL) to this attribute

  • STRING file_path: If not empty, write output to this file.

Result Size

V - 1 = number of vertices - 1

Time Complexity

O(V^2)

Graph Types

Undirected edges and connected

Example

In the graph social10, we consider only the undirected Coworker edges.

Visualized results of example graph social10 graph with Coworker edges

This graph has 3 components. Minimum Spanning Tree finds a tree for one component, so which component it will work on depends on what vertex we give as the starting point. If we select Fiona, George, Howard, or Ivy as the start vertex, then it works on the 4-vertex component on the left. You can start from any vertex in the component and get the same or an equivalent MST result.

The figure below shows the result of

# Use _ for default values
RUN QUERY mst(("Ivy", "Person"), ["Person"], ["Coworker"] "weight", "INT",
_, _, _, _)

Note that the value for the one vertex is ("Ivy", "Person"). In GSQL, this 2-tuple format which explicitly gives the vertex type is used when the query is written to accept a vertex of any type.

Visualized results of example query on social10 graph

File output:

From,To,Weight
Ivy,Fiona,6
Ivy,Howard,4
Ivy,George,4

The attribute version requires a boolean attribute on the edge, and it will assign the attribute to "true" if that edge is selected in the MST:

Visualized results of example query on social10 graph, with Coworker edges & edge attribute "flag"

Minimum Spanning Forest (MSF)

Description and Uses

Given an undirected graph with one or more connected components, a minimum spanning forest is a set of minimum spanning trees, one for each component. The library implements the algorithm in section 6.2 of Qin et al. 2014: http://www-std1.se.cuhk.edu.hk/~hcheng/paper/SIGMOD2014qin.pdf.

Specifications

msf (SET<STRING> v_type, SET<STRING> e_type, STRING wt_attr, STRING wt_type,
BOOL print_accum = TRUE, STRING result_attr = "", STRING file_path = "")

Characteristic

Value

Result

Computes a minimum spanning forest. If the JSON or file output selected, the output is the set of edges that form the MSF. If the result_attr option is selected, the edges which are part of the MSF are tagged True; other edges are tagged False.

Input Parameters

  • SET<STRING> v_type: Names of vertex types to use

  • SET<STRING> e_type: Names of edge types to use

  • STRING wt_attr: Name of edge weight attribute

  • STRING wt_type: Data type of edge weight attribute: "INT", "FLOAT", or "DOUBLE"

  • BOOL print_accum: If True, output JSON to standard output

  • STRING result_attr: If not empty, store result values (BOOL) to this attribute

  • STRING file_path: If not empty, write output to this file.

Result Size

V - c,

V = number of vertices, c = number of components

Time Complexity

O((V+E) * logV)

Graph Types

Undirected edges

Example

Refer to the example for the MST algorithm. This graph has 3 components. MSF will find an MST for each of the three components.

Maximal Independent Set

Description and Uses

An independent set of vertices does not contain any pair of vertices that are neighbors, i.e., ones which have an edge between them. A maximal independent set is the largest independent set that contains those vertices; you cannot improve upon it unless you start over with a different independent set. However, the search for the largest possible independent set (the maximum independent set, as opposed to the maximal independent set) is an NP-hard problem: there is no known algorithm that can find that answer in polynomial time. So we settle for the maximal independent set.

This algorithm finds use in applications wanting to find the most efficient configuration which "covers" all the necessary cases. For example, it has been used to optimize delivery or transit routes, where each vertex is one transit segment and each edge connects two segments that can NOT be covered by the same vehicle.

Specifications

maximal_indep_set(STRING v_type, STRING e_type,
INT max_iter = 100, BOOL print_accum = TRUE, STRING file_path = "")

Characteristic

Value

Result

A set of vertices that form a maximal independent set.

Input Parameters

  • STRING v_type: Name of vertex type to use

  • STRING e_type: Name of edge type to use

  • INT max_iter: maximum number of iterations for the search

  • BOOL print_accum: If True, output JSON to standard output

  • STRING file_path: If not empty, write output to this file.

Result Size

Size of the MIS: unknown. Worst case: If the graph is a set of N unconnected vertices, then the MIS is all N vertices.

Time Complexity

O(E), E = number of edges

Graph Types

Undirected edges

Example

Consider our social10 graph, with three components.

It is clear that for each of the two triangles -- (Alex, Bob, Justin) and (Chase, Damon, Eddie) -- we can select one vertex from each triangle to be part of the MIS. For the 4-vertex component (Fiona, George, Howard, Ivy), it is less clear what will happen. If the algorithm selects either George or Ivy, then no other independent vertices remain in the component. However, the algorithm could select both Fiona and Howard; they are independent of one another.

This demonstrates the uncertainty of the Maximal Independent Set algorithm and how it differs from Maximum Independent Set. A maximum independent set algorithm would always select Fiona and Howard, plus 2 others, for a total of 4 vertices. The maximal independent set algorithm relies on chance. It could return either 3 or 4 vertices.

Cycle Detection

Description and Uses

The Cycle Detection problem seeks to find all the cycles (loops) in a graph. We apply the usual restriction that the cycles must be "simple cycles", that is, they are paths that start and end at the same vertex but otherwise never visit any vertex twice.

There are two versions of the task: for directed graphs and undirected graphs. The GSQL algorithm library currently supports only directed cycle detection. The Rocha–Thatte algorithm is an efficient distributed algorithm, which detects all the cycles in a directed graph. The algorithm will self-terminate, but it is also possible to stop at k iterations, which finds all the cycles having lengths up to k edges.

The basic idea of the algorithm is to (potentially) traverse every edge in parallel, again and again, forming all possible paths. At each step, if a path forms a cycle, it records it and stops extending it. More specifically:

Initialization: For each vertex, record one path consisting of its own id. Mark the vertex as Active.

Iteration steps: For each Active vertex v:

  1. Send its list of paths to each of its out-neighbors.

  2. Inspect each path P in the list of the paths received:

    • If the first id in P is also id(v), a cycle has been found:

      • Remove P from its list.

      • If id(v) is the least id of any id in P, then add P to the Cycle List. (The purpose is to count each cycle only once.)

    • Else, if id(v) is somewhere else in the path, then remove P from the path list (because this cycle must have been counted already).

    • Else, append id(v) to the end of each of the remaining paths in its list.

Specifications

cycle_detection (SET<STRING> v_type, SET<STRING> e_type, INT depth,
BOOL print_accum = TRUE, STRING file_path = "")

Characteristic

Value

Result

Computes a list of vertex id lists, each of which is a cycle. The result is available in 2 forms:

  • streamed out in JSON format

  • written to a file in tabular format

Input Parameters

  • SET<STRING> v_type: Names of vertex types to use

  • SET<STRING> e_type: Names of edge types to use

  • INT depth: Maximum cycle length to search for = maximum number of iterations

  • BOOL print_accum: If True, output JSON to standard output

  • STRING file_path: If not empty, write output to this file.

Result Size

Number of cycles * average cycle length

Both of these measures are not known in advance.

Time Complexity

O(E *k), E = number of edges.

k = min(max. cycle length, depth parameter)

Graph Types

Directed

Example

In the social10 graph, there are 5 cycles, all with the Fiona-George-Howard-Ivy cluster.

Visualized results of cycle_detection("Person", "Friend", 10) on social10 graph
[
{
"@@cycles": [
[
"Fiona",
"Ivy"
],
[
"George",
"Ivy"
],
[
"Fiona",
"George",
"Ivy"
],
[
"George",
"Howard",
"Ivy"
],
[
"Fiona",
"George",
"Howard",
"Ivy"
]
]
}
]

Estimated Diameter

Description and Uses

The diameter of a graph is the worst-case length of a shortest path between any pair of vertices in a graph. It is the farthest distance to travel, to get from one vertex to another, if you always take the shortest path. Finding the diameter requires calculating (the lengths of) all shortest paths, which can be quite slow.

This algorithm uses a simple heuristic to estimate the diameter. rather than calculating the distance from each vertex to every other vertex, it selects K vertices randomly, where K is a user-provided parameter. It calculates the distances from each of these K vertices to all other vertices. So, instead of calculating V*(V-1) distances, this algorithm only calculates K*(V-1) distances. The higher the value of K, the greater the likelihood of hitting the actual longest shortest path.

The current version only computes unweighted distances.

This algorithm query employs a subquery called max_BFS_depth. Both queries are needed to run the algorithm.

Specifications

estimate_diameter (SET<STRING> v_type, SET<STRING> e_type, INT seed_set_length,
BOOL print_accum = TRUE, STRING file_path = "", BOOL display = FALSE)

Characteristic

Value

Result

Returns the estimated value for the diameter of the graph

Input Parameters

  • SET<STRING> v_type: Names of vertex types to use

  • SET<STRING> e_type: Names of edge types to use

  • INT seed_set_length: The number (K) of random seed vertices to use

  • BOOL print_accum: If True, output JSON to standard output

  • STRING file_path: If not empty, write output to this file.

Result Size

one integer

Time Complexity

O(k*E), E = number of edges, k = number of seed vertices

Graph Types

Directed

Node embeddings

Node2Vec

Description

Node2Vec is a node embedding algorithm that uses random walks in the graph to create a vector representation of a node.

A random walk starts with a node, and the algorithm iteratively selects neighboring nodes to visit, and each neighboring node has an assigned probability. This transforms graph structure into a collection of linear sequences of nodes. For each node, we will be left with a list of other nodes from their local or extended neighborhoods.

Once the above step is complete, the algorithm uses a variation of the word2vec model from the language modeling community to turn each node into a vector of probabilities. The probabilities represent the likelihood of visiting a given node in a random walk from each starting node.

Specification

random_walk(INT step = 8, INT path_size = 4,
STRING filepath = "/home/tigergraph/path.csv", SET<STRING> edge_types,
INT sample_num)
node2vec_query(STRING filepath = "/home/tigergraph/path.csv",
STRING output_file = "/home/tigergraph/embedding.csv",
INT dimension)

Installing this query requires installing a UDF, which can be found in the Github repository of the query. If you are running the query on a cluster, you need to manually install the UDF on every node of the cluster.

Parameters

Parameter

Description

Data type

step

Number of random walks per node

INT

path_size

Number of hops per walk

INT

filepath

File path to output results to

STRING

edge_types

Edge types to traverse

SET<STRING>

sample_num

Number of nodes to be used in the random sample

INT

Centrality Algorithms

Centrality algorithms determine the importance of each vertex within a network. Typical applications:

PageRank is designed for directed edges. The classic interpretation is to find the most "important" web pages, based on hyperlink referrals, but it can be used for another network where entities make positive referrals of one another.

Closeness Centrality and Betweenness Centrality both deal with the idea of "centrally located."

PageRank

Description and Uses

The PageRank algorithm measures the influence of each vertex on every other vertex. PageRank influence is defined recursively: a vertex's influence is based on the influence of the vertices which refer to it. A vertex's influence tends to increase if (1) it has more referring vertices or if (2) its referring vertices have higher influence. The analogy to social influence is clear.

A common way of interpreting PageRank value is through the Random Network Surfer model. A vertex's PageRank score is proportional to the probability that a random network surfer will be at that vertex at any given time. A vertex with a high PageRank score is a vertex that is frequently visited, assuming that vertices are visited according to the following Random Surfer scheme:

  • Assume a person travels or surfs across a network's structure, moving from vertex to vertex in a long series of rounds.

  • The surfer can start anywhere. This start-anywhere property is part of the magic of PageRank, meaning the score is a truly fundamental property of the graph structure itself.

  • Each round, the surfer randomly picks one of the outward connections from the surfer's current location. The surfer repeats this random walk for a long time.

  • But wait. The surfer doesn't always follow the network's connection structure. There is a probability (1-damping, to be precise), that the surfer will ignore the structure and will magically teleport to a random vertex.

Specifications

pageRank (STRING v_type, STRING e_type,
FLOAT max_change=0.001, INT max_iter=25, FLOAT damping=0.85, INT top_k = 100,
BOOL print_accum = TRUE, STRING result_attr = "", STRING file_path = "",
BOOL display_edges = FALSE)

Characteristic

Value

Result

Computes a PageRank value (FLOAT type) for each vertex.

Input Parameters

  • STRING v_type: Names of vertex type to use

  • STRING e_type: Names of edge type to use

  • FLOAT max_change: PageRank will stop iterating when the largest difference between any vertex's current score and its previous score ≤ max_change. That is, the scores have become very stable and are changing by less than max_change from one iteration to the next.

  • INT max_iter: Maximum number of iterations.

  • FLOAT damping: Fraction of score that is due to the score of neighbors. The balance (1 - damping) is a minimum baseline score that every vertex receives.

  • INT top_k: Sort the scores highest first and output only this many scores

  • BOOL print_accum: If True, output JSON to standard output

  • STRING result_attr: If not empty, store PageRank values (FLOAT) to this attribute

  • STRING file_path: If not empty, write output to this file.

  • BOOL display_edges: If true, include the graph's edges in the JSON output, so that the full graph can be displayed.

Result Size

V = number of vertices

Time Complexity

O(E*k), E = number of edges, k = number of iterations.

The number of iterations is data-dependent, but the user can set a maximum. Parallel processing reduces the time needed for computation.

Graph Types

Directed edges

Example

# Use _ for default values
RUN QUERY pageRank("Person", "Friend", 0.001, 25, 0.85, 100
_, _, _, _)

We ran pageRank on our test10 graph (using Friend edges) with the following parameter values: damping=0.85, max_change=0.001, and max_iter=25. We see that Ivy (center bottom) has the highest pageRank score (1.12). This makes sense since there are 3 neighboring persons who point to Ivy, more than for any other person. Eddie and Justin have scores of exactly 1 because they do not have any out-edges. This is an artifact of our particular version pageRank. Likewise, Alex has a score of 0.15, which is (1-damping), because Alex has no in-edges.

Visualized results of example query on social10 graph, with Friend edges

Weighted PageRank

Description and Uses

The only difference between weighted PageRank and standard PageRank is that edges have weights, and the influence that a vertex receives from an in-neighbor is multiplied by the weight of the in-edge.

Specifications

pageRank_wt (SET<STRING> v_type, SET<STRING> e_type, STRING wt_attr,
FLOAT max_change=0.001, INT max_iter=25, FLOAT damping=0.85, INT top_k=100,
BOOL print_accum = TRUE, STRING result_attr = "", STRING file_path = "",
BOOL display_edges = FALSE)

Characteristic

Value

Result

Computes a weighted PageRank value (FLOAT type) for each vertex.

Input Parameters

<b></b>

  • STRING v_type: Names of vertex type to use

  • STRING e_type: Names of edge type to use

  • STRING wt_attr: Name of edge weight attribute

  • FLOAT max_change: PageRank will stop iterating when the largest difference between any vertex's current score and its previous score ≤ max_change. That is, the scores have become very stable and are changing by less than max_change from one iteration to the next.

  • INT max_iter: Maximum number of iterations.

  • FLOAT damping: Fraction of score that is due to the score of neighbors. The balance (1 - damping) is a minimum baseline score that every vertex receives.

  • INT top_k: Sort the scores highest first and output only this many scores

  • BOOL print_accum: If True, output JSON to standard output

  • STRING result_attr: If not empty, store PageRank values (FLOAT) to this attribute

  • STRING file_path: If not empty, write output to this file.

  • BOOL display_edges: If true, include the graph's edges in the JSON output, so that the full graph can be displayed.

Result Size

V = number of vertices

Time Complexity

O(E*k), E = number of edges, k = number of iterations.

The number of iterations is data-dependent, but the user can set a maximum. Parallel processing reduces the time needed for computation.

Graph Types

Directed edges

Personalized PageRank

Description and Uses

In the original PageRank, the damping factor is the probability of the surfer continues browsing at each step. The surfer may also stop browsing and start again from a random vertex. In personalized PageRank, the surfer can only start browsing from a given set of source vertices both at the beginning and after stopping.

Specifications

pageRank_pers(SET<VERTEX> source, STRING e_type,
FLOAT max_change=0.001, INT max_iter=25, FLOAT damping = 0.85, INT top_k = 100
BOOL print_accum = TRUE, STRING result_attr = "", STRING file_path = "")

Characteristic

Value

Result

Computes a personalized PageRank value (FLOAT type) for each vertex.

Input Parameters

  • SET<VERTEX> source: Set of seed vertices

  • STRING e_type: Names of edge type to use

  • FLOAT max_change: PageRank will stop iterating when the largest difference between any vertex's current score and its previous score ≤ max_change. That is, the scores have become very stable and are changing by less than max_change from one iteration to the next.

  • INT max_iter: Maximum number of iterations.

  • FLOAT damping: Fraction of score that is due to the score of neighbors. The balance (1 - damping) is a minimum baseline score that every vertex receives.

  • INT top_k: Sort the scores highest first and output only this many scores

  • BOOL print_accum: If True, output JSON to standard output

  • STRING result_attr: If not empty, store PageRank values (FLOAT) to this attribute

  • STRING file_path: If not empty, write output to this file.

Result Size

V = number of vertices

Time Complexity

O(E*k), E = number of edges, k = number of iterations.

The number of iterations is data-dependent, but the user can set a maximum. Parallel processing reduces the time needed for computation.

Graph Types

Directed edges

Example

We ran Personalized PageRank on the graph social10 using Friend edges with the following parameter values:

# Using "_" to use default values
RUN QUERY pageRank_pers([("Fiona","Person")], "Friend", _, _, _, _, _, _,
_)

In this case, the random walker can only start or restart walking from Fiona. In the figure below, we see that Fiona has the highest PageRank score in the result. Ivy and George have the next highest scores because they are direct out-neighbors of Ivy and there are looping paths that lead back to them again. Half of the vertices have a score of 0 since they can not be reached from Fiona.

Visualized results of example query on social10 graph, with Friend edges

Closeness Centrality

We all have an intuitive understanding when we say a home, an office, or a store is "centrally located." Closeness Centrality provides a precise measure of how "centrally located" a vertex is. The steps below show the steps for one vertex v:

Step

Mathematical Formula

1. Compute the average distance from vertex v to every other vertex:

davg(v)=uvdist(v,u)/(n1)d_{avg}(v) = \sum_{u \ne v} dist(v,u)/(n-1)

2. Invert the average distance, so we have average closeness of v:

CC(v)=1/davg(v)CC(v) = 1/d_{avg}(v)

TigerGraph's closeness centrality algorithm uses multi-source breadth-first search (MS-BFS) to traverse the graph and calculate the sum of a vertex's distance to every other vertex in the graph, which vastly improves the performance of the algorithm. The algorithm's implementation of MS-BFS is based on the paper The More the Merrier: Efficient Multi-source Graph Traversal by Then et al.

This algorithm query employs a subquery called cc_subquery. Both queries are needed to run the algorithm.

Specifications

closeness_cent (SET<STRING> v_type, SET<STRING> e_type, INT max_hops=10,
INT top_k=100, BOOL wf = TRUE, BOOL print_accum = True, STRING result_attr = "",
STRING file_path = "", BOOL display_edges = FALSE)

Parameters

Characteristic

Value

Result

Computes a Closeness Centrality value (FLOAT type) for each vertex.

Required Input Parameters

  • SET<STRING> v_type: Names of vertex types to use

  • SET<STRING> e_type: Names of edge types to use

  • INT max_hops: If >=0, look only this far from each vertex

  • INT top_k: Sort the scores highest first and output only this many scores

  • BOOL wf: If True, use Wasserman-Faust normalization for multi-component graphs

  • BOOL print_accum: If true, output JSON to standard output

  • STRING result_attr: If not empty, store centrality values (FLOAT) to this attribute

  • STRING file_path: If not empty, write output to this file in CSV.

  • BOOL display_edges: If true, include the graph's edges in the JSON output, so that the full graph can be displayed.

Result Size

V = number of vertices

Time Complexity

O(E), E = number of edges.

Parallel processing reduces the time needed for computation.

Graph Types

Directed or Undirected edges, Unweighted edges

Example

Closeness centrality can be measured for either directed edges (from v to others) or for undirected edges. Directed graphs may seem less intuitive, however, because if the distance from Alex to Bob is 1, it does not mean the distance from Bob to Alex is also 1.

For our example, we wanted to use the topology of the Likes graph, but to have undirected edges. We emulated an undirected graph by using both Friend and Also_Friend (reverse-direction) edges.

# Use _ for default values
RUN QUERY closeness_cent(["Person"], ["Friend", "Also_Friend"], _, _,
_, _, _, _, _)
Visualized results of example query on social10 graph, with Friend and Also_Friend edges

Approximate Closeness Centrality

In the Closeness Centrality algorithm, to obtain the closeness centrality score for a vertex, we measure the distance from the source vertex to every single vertex in the graph. In large graphs, running this calculation for every vertex can be highly time-consuming.

The Approximate Closeness Centrality algorithm (based on Cohen et al. 2014) calculates the approximate closeness centrality score for each vertex by combining two estimation approaches - sampling and pivoting. This hybrid estimation approach offers near-linear time processing and linear space overhead within a small relative error. It runs on graphs with unweighted edges (directed or undirected).

This query uses another subquerycloseness_cent_approx_sub, which needs to be installed before closeness_approx can be installed.

Specifications

closeness_approx (
SET<STRING> v_type,
SET<STRING> e_type,
INT k = 100, # sample num
INT max_hops = 10, # max BFS explore steps
DOUBLE epsilon = 0.1, # error parameter
BOOL print_accum = true, # output to console
STRING file_path = "", # output file
INT debug = 0, # debug flag -- 0: No LOG;1: LOG without the sample-node bfs loop;2: ALL LOG.
INT sample_index = 0, # random sample group
INT maxsize = 1000, # max size of connected components using exact closeness algorithm
BOOL wf = True # Wasserman and Faust formula
)

Parameters:

Name

Description

v_type

Vertex types to calculate approximate closeness centrality for.

e_type

Edge types to traverse.

k

Size of the sample.

max_hops

Upper limit of how many jumps the algorithm will perform from each vertex.

epsilon

The maximum relative error, between 0 and 1. Setting a lower value produces a more accurate estimate but increases run time.

print_accum

Boolean value that indicates whether or not to output to console in JSON format.

file_path

If provided, the algorithm will output to this file path in CSV format

debug

There are many conditional logging statements inside the query. If the input is 0, nothing will be logged. If the input is 1, everything else but the breadth-first-search process from the sample-node. If the input is 2, everything will be logged.

sample_index

The algorithm will partition the graph based on the sample size. This index indicates which partition to use when estimating closeness centrality.

maxsize

If the number of vertices in the graph is lower than maxsize, the exact closeness centrality is calculated instead and nothing will be approximated.

wf

Boolean value that indicates whether to use the Wasserman and Faust formula to calculate closeness centrality rather than the classic formula.

Result

The result is a list of all vertices in the graph with their approximate closeness centrality score. It is available both in JSON and CSV format.

Example

Below is an example of running the algorithm on the social10 test graph and an excerpt of the response.

RUN QUERY closeness_aprox(["Person"], ["Friend", "Coworker"], 6, 3 \
0.1, true, "", 0, 0, 100, false)
[
{
"Start": [
{
"attributes": {
"[email protected]": 0.58333
},
"v_id": "Fiona",
"v_type": "Person"
},
{
"attributes": {
"[email protected]": 0.44444
},
"v_id": "Justin",
"v_type": "Person"
},
{
"attributes": {
"[email protected]": 0.53333
},
"v_id": "Bob",
"v_type": "Person"
}
]

Betweenness Centrality

The Betweenness Centrality of a vertex is defined as the number of shortest paths that pass through this vertex, divided by the total number of shortest paths. That is

BC(v)=svtPDst(v)=svtSPst(v)/SPst,BC(v) =\sum_{s \ne v \ne t}PD_{st}(v)= \sum_{s \ne v \ne t} SP_{st}(v)/SP_{st} ,

where PDPDis called the pair dependency, SPstSP_{st}is the total number of shortest paths from node s to node t and SPst(v)SP_{st}(v)is the number of those paths that pass through v.

The TigerGraph implementation is based on A Faster Algorithm for Betweenness Centrality by Ulrik Brandes, Journal of Mathematical Sociology 25(2):163-177, (2001). For every vertex s in the graph, the pair dependency starting from vertex s to all other vertices t via all other vertices v is computed first,

PDs(v)=t:sVPDst(v)PD_{s*}(v) = \sum_{t:s \in V} PD_{st}(v).

Then betweenness centrality is computed as

BC(v)=s:sVPDs(v)/2BC(v) =\sum_{s:s \in V}PD_{s*}(v)/2.

According to Brandes, the accumulated pair dependency can be calculated as

PDs(v)=w:vPs(w)SPsv(v)/SPsw(1+PDs(w)),PD_{s*}(v) =\sum_{w:v \in P_s(w)} SP_{sv}(v)/SP_{sw} \cdot (1+PD_{s*}(w)) ,

wherePs(w)P_s(w), the set of predecessors of vertex w on shortest paths from s, is defined as

Ps(w)={uV:{u,w}E,dist(s,w)=dist(s,u)+dist(u,w)}.P_s(w) = \{u \in V: \{u, w\} \in E, dist(s,w) = dist(s,u)+dist(u,w) \} .

For every vertex, the algorithm works in two phases. The first phase calculates the number of shortest paths passing through each vertex. Then starting from the vertex on the most outside layer in a non-incremental order with pair dependency initial value of 0, traverse back to the starting vertex

This algorithm query employs a subquery called bc_subquery. Both queries are needed to run the algorithm.

Specifications

betweenness_cent(SET<STRING> v_type, SET<STRING> e_type, INT max_hops = 10,
INT top_k=100, BOOL print_accum=TRUE, STRING result_attr="", STRING file_path="",
BOOL display_edges = FALSE)

Parameters

Characteristic

Value

Result

Computes a Betweenness Centrality value (FLOAT type) for each vertex.

Required Input Parameters

  • SET<STRING> v_type: Names of vertex types to use

  • SET<STRING> e_type: Names of edge types to use

  • INT max_hops: If >=0, look only this far from each vertex

  • INT top_k: Sort the scores highest first and output only this many scores

  • BOOL print_accum: If True, output JSON to standard output

  • STRING result_attr: If not empty, store centrality values (FLOAT) to this attribute

  • STRING file_path: If not empty, write output to this file.

  • BOOL display_edges: If true, include the graph's edges in the JSON output, so that the full graph can be displayed.

Result Size

V = number of vertices

Time Complexity

O(E*V), E = number of edges, V = number of vertices.

Considering the high time cost of running this algorithm on a big graph, the users can set a maximum number of iterations. Parallel processing reduces the time needed for computation.

Graph Types

Undirected edges, Unweighted edges

Example

In the example below, Claire is in the very center of the graph and has the highest betweenness centrality. Six shortest paths pass through Sam (i.e. paths from Victor to all other 6 people except for Sam and Victor), so the score of Sam is 6. David also has a score of 6, since Brian has 6 paths to other people that pass through David.

# Use _ for default values
RUN QUERY (["Person"], ["Friend"], _, _, _, _, _, _)
Visualized results of example query on a social graph with undirected edges Friend
[
{
"@@BC": {
"Alice": 0,
"Frank": 0,
"Claire": 17,
"Sam": 6,
"Brian": 0,
"David": 6,
"Richard": 0,
"Victor": 0
}
}
]

In the following example, both Charles and David have 9 shortest paths passing through them. Ellen is in a similar position as Charles, but her centrality is weakened due to the path between Frank and Jack.

Visualized results of example query on a social graph with undirected edges Friend
[
{
"@@BC": {
"Alice": 0,
"Frank": 0,
"Charles": 9,
"Ellen": 8,
"Brian": 0,
"David": 9,
"Jack": 0
}
}
]

Community Algorithms

These algorithms evaluate how a group is clustered or partitioned, as well as its tendency to strengthen or break apart.

Connected Components

Description and Uses

A component is the maximal set of vertices, plus their connecting edges, which are interconnected. That is, you can reach each vertex from each other vertex. In the example figure below, there are three components.

This particular algorithm deals with undirected edges. If the same definition (each vertex can reach each other vertex) is applied to directed edges, then the components are called Strongly Connected Components. If you have directed edges but ignore the direction (permitting traversal in either direction), then the algorithm finds Weakly Connected Components.

Specifications

conn_comp (SET<STRING> v_type, SET<STRING> e_type, INT output_limit = 100,
BOOL print_accum = TRUE, STRING result_attr = "", STRING file_path = "")

Characteristic

Value

Result

Assigns a component id (INT) to each vertex, such that members of the same component have the same id value.

Input Parameters

  • SET<STRING> v_type: Names of vertex types to use

  • SET<STRING> e_type: Names of edge types to use

  • INT output_limit: If >=0, max number of vertices to output to JSON.

  • BOOL print_accum: If True, output JSON to standard output

  • STRING result_attr: If not empty, store community ID values (INT) to this attribute

  • STRING file_path: If not empty, write output to this file.

Result Size

V = number of vertices

Time Complexity

O(E*d), E = number of edges, d = max(diameter of components)

Graph Types

Undirected edges

Example

It is easy to see in this small graph that the algorithm correctly groups the vertices:

  • Alex, Bob, and Justin all have Community ID = 2097152

  • Chase, Damon, and Eddie all have Community ID = 5242880

  • Fiona, George, Howard, and Ivy all have Community ID = 0

Our algorithm uses the TigerGraph engine's internal vertex ID numbers; they cannot be predicted.

RUN QUERY conn_comp(["Person"], ["Coworker"], _, _, _, _)
Visualized results of example query on social10 graph with Coworker edges

K-core Decomposition

A k-core of a graph is a maximal connected subgraph in which every vertex is connected to at least k vertices in the subgraph. To obtain the k-core of a graph, the algorithm first deletes the vertices whose outdegree is less than k. It then updates the outdegree of the neighbors of the deleted vertices, and if that causes a vertex's outdegree to fall below k, it will also delete that vertex. The algorithm repeats this operation until every vertex left in the subgraph has an outdegree of at least k.

Our algorithm takes a range of values for k and returns the set of the vertices that constitute the k-core with the highest possible value of k within the range. It is an implementation of Algorithm 2 in Scalable K-Core Decomposition for Static Graphs Using a Dynamic Graph Data Structure, Tripathy et al., IEEE Big Data 2018.

Time complexity

O(E), where E is the number of edges in the graph.

Specifications

kcore(STRING v_type, STRING e_type, INT k_min = 0, INT k_max = -1,
BOOL print_accum = TRUE, STRING attr = "", STRING file_path = "",
BOOL show_membership = FALSE, BOOL show_shells=FALSE)

Parameters

Parameter

Description

v_type

Vertex type to include in the k-core

e_type

Edge type to count for k-core connections

k_min

Minimum value of k. If the actual maximum core is below k_min, the algorithm will return an empty set.

k_max

Maximum value of k. If k_max is smaller than k_min, the algorithm will ignore this parameter and keep looking for k-cores until it reaches a value of k where a k-core cannot be found.

show_membership

If show_membership is true, the algorithm will return the k-cores found for every value of k within the range provided. For each k-core, the results will include its member vertices.

show_shells

The k-shell is the set of vertices that are part of the k-core but not part of the (k+1)-core. If show_shells is true, the algorithm will return the k-shells found for every value of k. within the range provided. For each k-shell, the results will include its member vertices.

print_accum

Boolean value that decides whether the algorithm will return output in JSON

attr

Optional. An attribute of the vertex to save the core level of the vertex to. If attr is provided, the core level of the vertex will be saved to this attribute of the vertex.

file_path

Optional. If file_path is provided, the algorithm will output results to a file specified by the file path in CSV format.

Example

In the example below based on the social graph from GSQL 101, we can see that Dan, Tom, and Jenny make up a 2-core, which is the max-core of the graph:

Strongly Connected Components

If we run the kcore algorithm on this small graph like so:

RUN QUERY kcore("person", "friendship", 0, -1, TRUE, "", "", FALSE, FALSE)

Here is the returned JSON response, which includes a 2-core that is comprised of Dan, Jenny, and Tom:

[
{
"core_size": 3,
"k": 2, // the k-core with the highest possible k is returned
"max_core": [
{
"attributes": {
"@core": 2,
"@deg": 0,
"age": 40,
"gender": "male",
"name": "Tom",
"state": "ca"
},
"v_id": "Tom",
"v_type": "person"
},
{
"attributes": {
"@core": 2,
"@deg": 0,
"age": 34,
"gender": "male",
"name": "Dan",
"state": "ny"
},
"v_id": "Dan",
"v_type": "person"
},
{
"attributes": {
"@core": 2,
"@deg": 0,
"age": 25,
"gender": "female",
"name": "Jenny",
"state": "tx"
},
"v_id": "Jenny",
"v_type": "person"
}
]
}
]

Strongly Connected Components

Description and Uses

A strongly connected component (SCC) is a subgraph such that there is a path from any vertex to every other vertex. A graph can contain more than one separate SCC. An SCC algorithm finds the maximal SCCs within a graph. Our implementation is based on the Divide-and-Conquer Strong Components (DCSC) algorithm[1]. In each iteration, pick a pivot vertex v randomly, and find its descendant and predecessor sets, where descendant set D_v is the vertex reachable from v, and predecessor set P_v is the vertices which can reach v (stated another way, reachable from v through reverse edges). The intersection of these two sets is a strongly connected component SCC_v. The graph can be partitioned into 4 sets: SCC_v, descendants D_v excluding SCC_v, predecessors P_v excluding SCC, and the remainders R_v. It is proved that any SCC is a subset of one of the 4 sets [1]. Thus, we can divide the graph into different subsets and detect the SCCs independently and iteratively.

The problem of this algorithm is unbalanced load and slow convergence when there are a lot of small SCCs, which is often the case in real-world use cases [3]. We added two trimming stages to improve the performance: size-1 SCC trimming[2] and weakly connected components[3].

The implementation of this algorithm requires reverse edges for all directed edges considered in the graph.

[1] Fleischer, Lisa K., Bruce Hendrickson, and Ali Pınar. "On identifying strongly connected components in parallel." International Parallel and Distributed Processing Symposium. Springer, Berlin, Heidelberg, 2000.

[2] Mclendon Iii, William, et al. "Finding strongly connected components in distributed graphs." Journal of Parallel and Distributed Computing 65.8 (2005): 901-910.

[3] Hong, Sungpack, Nicole C. Rodia, and Kunle Olukotun. "On fast parallel detection of strongly connected components (SCC) in small-world graphs." Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 2013.

Specifications

scc (SET<STRING> v_type, SET<STRING> e_type, SET<STRING> rev_e_type,
INT top_k_dist, INT output_limit, INT max_iter = 500, INT iter_wcc = 5,
BOOL print_accum = TRUE, STRING attr= "", STRING file_path="")

Characteristic

Value

Result

Assigns a component id (INT) to each vertex, such that members of the same component have the same id value.

Input Parameters

  • SET<STRING> v_type: Names of vertex types to use

  • SET<STRING> e_type: Names of edge types to use

  • SET<STRING> rev_e_type: Names of reverse direction edge types to use

  • INT top_k_dist: top k result in SCC distribution

  • INT output_limit: If >=0, max number of vertices to output to JSON.

  • INT max_iter: number of maximum iteration of the algorithm

  • INT iter_wcc: find weakly connected components for the active vertices in this iteration, since the largest SCCs are already found after several iterations; usually a small number(3 to 10)

  • BOOL print_accum: If True, output JSON to standard output

  • STRING result_attr: If not empty, store community ID values (INT) to this attribute

  • STRING file_path: If not empty, write output to this file.

Result Size

V = number of vertices

Time Complexity

O(iter*d), d = max(diameter of components)

Graph Types

Directed edges with reverse direction edges as well

Example

We ran scc on the social26 graph. A portion of the JSON result is shown below.

[
{
"i": 1
},
{
"trim_set.size()": 8
},
{
"trim_set.size()": 5
},
{
"trim_set.size()": 2
},
{
"trim_set.size()": 2
},
{
"trim_set.size()": 0
},
{
"@@cluster_dist_heap": [
{
"csize": 9,
"num": 1
},
{
"csize": 1,
"num": 17
}
]
},

The first element "i"=1 means the whole graph is processed in just one iteration. The 5 "trim_set.size()" elements mean there were 5 rounds of size-1 SCC trimming. The final "@@.cluster_dist_heap" object" reports on the size distribution of SCCs.There is one SCC with 9 vertices, and 17 SCCs with only 1 vertex in the graph.

Label Propagation

Description and Uses

Label Propagation is a heuristic method for determining communities. The idea is simple: If the plurality of your neighbors all bear the label X, then you should label yourself as also a member of X. The algorithm begins with each vertex having its own unique label. Then we iteratively update labels based on the neighbor influence described above. It is important that the order for updating the vertices be random. The algorithm is favored for its efficiency and simplicity, but it is not guaranteed to produce the same results every time.

In a variant version, some vertices could initially be known to belong to the same community. If they are well-connected to one another, they are likely to preserve their common membership and influence their neighbors,

Specifications

label_prop (SET<STRING> v_type, SET<STRING> e_type, INT max_iter, INT output_limit,
BOOL print_accum = TRUE, STRING file_path = "", STRING attr = "")

Characteristic

Value

Result

Assigns a component id (INT) to each vertex, such that members of the same component have the same id value.

Input Parameters

  • SET<STRING> v_type: Names of vertex types to use

  • SET<STRING> e_type: Names of edge types to use

  • INT max_iter: Number of maximum iteration of the algorithm

  • INT output_limit: If >=0, max number of vertices to output to JSON.

  • BOOL print_accum: If True, output JSON to standard output

  • STRING attr: If not empty, store community id values (INT) to this attribute

  • STRING file_path: If not empty, write output to this file.

Result Size

V = number of vertices

Time Complexity

O(E*k), E = number of edges, k = number of iterations.

Graph Types

Undirected edges

Example

This is the same graph that was used in the Connected Component example. The results are different, though. The quartet of Fiona, George, Howard, and Ivy have been split into 2 groups:

  • (George & Ivy) each connect to (Fiona & Howard) and to one another.

  • (Fiona & Howard) each connect to (George & Ivy) but not to one another.

Label Propagation tries to find natural clusters and separations within connected components. That is, it looks at the quality and pattern of connections. The Component Component algorithm simply asks the Yes or No question: Are these two vertices connected?

We set max_iter to 10, but the algorithm reaches a steady state after 3 iterations:

# Use _ for default/empty values
RUN QUERY(["Person"], ["Coworker"], 10, -1, _, _, _)
Visualized results of example query on social10 graph with Coworker edges

Local Clustering Coefficient

Description

The Local Clustering Coefficient algorithm computes the local clustering coefficient of every vertex in a graph. The local clustering coefficient of a vertex (node) in a graph quantifies how close its neighbors are to being a complete graph, where every two distinct vertices are connected by an edge. It is obtained by dividing the number edges between a vertex's neighbors by the number of edges that could possibly exist.

The algorithm does not report the local clustering coefficient for vertices that have only one neighbor.

Time complexity

O(n), where n is the number of vertices in the graph.

Specifications

Local_clustering_coefficient(STRING v_type, STRING e_type,