Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
In the original PageRank, the damping factor is the probability of the surfer continues browsing at each step. The surfer may also stop browsing and start again from a random vertex. In personalized PageRank, the surfer can only start browsing from a given set of source vertices both at the beginning and after stopping.
We ran Personalized PageRank on the graph social10
using Friend edges with the following parameter values:
In this case, the random walker can only start or restart walking from Fiona. In the figure below, we see that Fiona has the highest PageRank score in the result. Ivy and George have the next highest scores because they are direct out-neighbors of Ivy and there are looping paths that lead back to them again. Half of the vertices have a score of 0 since they can not be reached from Fiona.
Characteristic
Value
Result
Computes a personalized PageRank value (FLOAT type) for each vertex.
Input Parameters
SET<VERTEX> source
: Set of seed vertices
STRING e_type
: Names of edge type to use
FLOAT max_change
: PageRank will stop iterating when the largest difference between any vertex's current score and its previous score ≤ max_change
. That is, the scores have become very stable and are changing by less than max_change
from one iteration to the next.
INT max_iter
: Maximum number of iterations.
FLOAT damping
: Fraction of score that is due to the score of neighbors. The balance (1 - damping) is a minimum baseline score that every vertex receives.
INT top_k
: Sort the scores highest first and output only this many scores
BOOL print_accum
: If True, output JSON to standard output
STRING result_attr
: If not empty, store PageRank values (FLOAT) to this attribute
STRING file_path
: If not empty, write output to this file.
Result Size
V = number of vertices
Time Complexity
O(E*k), E = number of edges, k = number of iterations.
The number of iterations is data-dependent, but the user can set a maximum. Parallel processing reduces the time needed for computation.
Graph Types
Directed edges
The PageRank algorithm measures the influence of each vertex on every other vertex. PageRank influence is defined recursively: a vertex's influence is based on the influence of the vertices which refer to it. A vertex's influence tends to increase if (1) it has more referring vertices or if (2) its referring vertices have higher influence. The analogy to social influence is clear.
A common way of interpreting PageRank value is through the Random Network Surfer model. A vertex's PageRank score is proportional to the probability that a random network surfer will be at that vertex at any given time. A vertex with a high PageRank score is a vertex that is frequently visited, assuming that vertices are visited according to the following Random Surfer scheme:
Assume a person travels or surfs across a network's structure, moving from vertex to vertex in a long series of rounds.
The surfer can start anywhere. This start-anywhere property is part of the magic of PageRank, meaning the score is a truly fundamental property of the graph structure itself.
Each round, the surfer randomly picks one of the outward connections from the surfer's current location. The surfer repeats this random walk for a long time.
But wait. The surfer doesn't always follow the network's connection structure. There is a probability (1-damping, to be precise), that the surfer will ignore the structure and will magically teleport to a random vertex.
For more information, see the Google paper on PageRank.
We ran pageRank on our test10 graph (using Friend edges) with the following parameter values: damping=0.85, max_change=0.001, and max_iter=25. We see that Ivy (center bottom) has the highest pageRank score (1.12). This makes sense since there are 3 neighboring persons who point to Ivy, more than for any other person. Eddie and Justin have scores of exactly 1 because they do not have any out-edges. This is an artifact of our particular version pageRank. Likewise, Alex has a score of 0.15, which is (1-damping), because Alex has no in-edges.
Characteristic
Value
Result
Computes a PageRank value (FLOAT type) for each vertex.
Input Parameters
STRING v_type
: Names of vertex type to use
STRING e_type
: Names of edge type to use
FLOAT max_change
: PageRank will stop iterating when the largest difference between any vertex's current score and its previous score ≤ max_change.
That is, the scores have become very stable and are changing by less than max_change
from one iteration to the next.
INT max_iter
: Maximum number of iterations.
FLOAT damping
: Fraction of score that is due to the score of neighbors. The balance (1 - damping) is a minimum baseline score that every vertex receives.
INT top_k
: Sort the scores highest first and output only this many scores
BOOL print_accum
: If True, output JSON to standard output
STRING result_attr
: If not empty, store PageRank values (FLOAT) to this attribute
STRING file_path
: If not empty, write output to this file.
BOOL display_edges
: If true, include the graph's edges in the JSON output, so that the full graph can be displayed.
Result Size
V = number of vertices
Time Complexity
O(E*k), E = number of edges, k = number of iterations.
The number of iterations is data-dependent, but the user can set a maximum. Parallel processing reduces the time needed for computation.
Graph Types
Directed edges
Degree centrality is defined as the number of edges incident upon a node (i.e., the number of ties that a node has). The degree can be interpreted in terms of the immediate risk of a node for catching whatever is flowing through the network (such as a virus, or some information).
The vertices with the highest degree centrality scores along with their scores.
Suppose we have the following graph:
Running the query on the graph will show that Dan has the highest degree centrality
TigerGraph In-Database Graph Data Science Library is a collection of expertly written GSQL queries, each of which implements a standard graph algorithm. Each algorithm is ready to be installed and used, either as a stand-alone query or as a building block of a larger analytics application.
We renamed our library (formerly known as GSQL Graph Algorithm Library) to emphasize our focus on graph data science. As the world’s only scalable graph analytics platform, TigerGraph is committed to providing the best graph analytics framework for data scientists.
GSQL running on the TigerGraph platform is particularly well-suited for graph algorithms for several reasons:
Turing-complete with full support for imperative and procedural programming, ideal for algorithmic computation.
Parallel and Distributed Processing, enabling computations on larger graphs.
User-Extensible. Because the algorithms are written in standard GSQL and compiled by the user, they are easy to modify and customize.
Open-Source. Users can study the GSQL implementations to learn by example, and they can develop and submit additions to the library.
The library contains two folders: algorithms
and graphs
.
algorithms
The algorithms
folder contains the GSQL implementation of all the graph algorithms offered by the library. Within the algorithms folder are six subfolders that group the algorithms into six classes:
graphs
The graphs
folder contains small sample graphs that you can use to experiment with the algorithms. In this document, we use the test graphs to show you the expected result for each algorithm. The graphs are small enough that you can manually calculate and sometimes intuitively see what the answers should be.
Starting with TigerGraph product version 2.6, the Library has release branches:
Product version branches (2.6, 3.0, etc.) are snapshots created shortly after a product version is released. They contain the best version of the graph algorithm library at the time of that product version's initial release. They will not be updated, except to fix bugs.
Master branch: the newest released version. This should be at least as new as the newest. It may contain new or improved algorithms.
Other branches are development branches.
It is possible to run newer algorithms on an older product version, as long as the algorithm does not rely on features available only in newer product versions.
All GSQL graph algorithms are schema-free, which means they are ready to use with any graph, regardless of the graph's data model or schema. The algorithms have run-time input parameters for the vertex type(s), edge type(s), and attributes which the user wishes to use.
Installing a query also creates a REST endpoint. The same query could be run thus:
GSQL lets you run queries from within other queries. This means you can use a library algorithm as a building block for more complex analytics.
The Betweenness Centrality of a vertex is defined as the number of shortest paths that pass through this vertex, divided by the total number of shortest paths. That is
The TigerGraph implementation is based on A Faster Algorithm for Betweenness Centrality by Ulrik Brandes, Journal of Mathematical Sociology 25(2):163-177, (2001). For every vertex s
in the graph, the pair dependency starting from vertex s
to all other vertices t
via all other vertices v
is computed first,
Then betweenness centrality is computed as
According to Brandes, the accumulated pair dependency can be calculated as
For every vertex, the algorithm works in two phases. The first phase calculates the number of shortest paths passing through each vertex. Then starting from the vertex on the most outside layer in a non-incremental order with pair dependency initial value of 0, traverse back to the starting vertex
This algorithm query employs a subquery called bc_subquery. Both queries are needed to run the algorithm.
In the example below, Claire is in the very center of the graph and has the highest betweenness centrality. Six shortest paths pass through Sam (i.e. paths from Victor to all other 6 people except for Sam and Victor), so the score of Sam is 6. David also has a score of 6, since Brian has 6 paths to other people that pass through David.
In the following example, both Charles and David have 9 shortest paths passing through them. Ellen is in a similar position as Charles, but her centrality is weakened due to the path between Frank and Jack.
You can download the library from Github:
To use an algorithm, the algorithm (GSQL query) must first be . If your database is on a distributed cluster, you should use the DISTRIBUTED
option when installing the query to install it in .
Running an algorithm is the same as running a GSQL query. For example, if you selected the JSON option for , you could run it from GSQL as below:
where is called the pair dependency, is the total number of shortest paths from node s
to node t
and is the number of those paths that pass through v
.
.
.
where, the set of predecessors of vertex w
on shortest paths from s
, is defined as
Name
Description
Data type
v_type
A set of vertex types.
SET<STRING>
e_type
A set of edge types.
SET<STRING>
re_type
A set of reverse edge types. If an edge is undirected, put the edge name in the set as well.
`SET<STRING>
in_degree
Boolean value that indicates whether to count the incoming edges as part of a vertex's degree centrality.
BOOL
out_degree
Boolean value that indicates whether to count the outgoing edges as part of a vertex's degree centrality.
BOOL
top_k
The number of vertices with the highest scores to return.
INT
print_accum
If true, print results to JSON output.
BOOL
result_attr
If not empty, save the degree centrality score of each vertex to this attribute.
STRING
file_attr
If not empty, save results in CSV to this file.
STRING
Characteristic | Value |
Result | Computes a Betweenness Centrality value (FLOAT type) for each vertex. |
Required Input Parameters |
|
Result Size | V = number of vertices |
Time Complexity | O(E*V), E = number of edges, V = number of vertices. Considering the high time cost of running this algorithm on a big graph, the users can set a maximum number of iterations. Parallel processing reduces the time needed for computation. |
Graph Types | Undirected edges, Unweighted edges |
Characteristic | Value |
Result | Computes a weighted PageRank value (FLOAT type) for each vertex. |
Input Parameters | <b></b>
|
Result Size | V = number of vertices |
Time Complexity | O(E*k), E = number of edges, k = number of iterations. The number of iterations is data-dependent, but the user can set a maximum. Parallel processing reduces the time needed for computation. |
Graph Types | Directed edges |
Eigenvector centrality (also called eigencentrality or prestige score) is a measure of the influence of a vertex in a network. Relative scores are assigned to all vertices in the network based on the concept that connections to high-scoring vertices contribute more to the score of the vertex in question than equal connections to low-scoring vertices. A high eigenvector score means that a vertex is connected to many vertices who themselves have high scores.
For more information, see Eigenvector centrality.
The vertices with the highest Eigenvector centrality scores along with their score.
Suppose we have the following graph:
Running the algorithm on the graph will show that Dan has the highest centrality score.
In the Closeness Centrality algorithm, to obtain the closeness centrality score for a vertex, we measure the distance from the source vertex to every single vertex in the graph. In large graphs, running this calculation for every vertex can be highly time-consuming.
The Approximate Closeness Centrality algorithm (based on Cohen et al. 2014) calculates the approximate closeness centrality score for each vertex by combining two estimation approaches - sampling and pivoting. This hybrid estimation approach offers near-linear time processing and linear space overhead within a small relative error. It runs on graphs with unweighted edges (directed or undirected).
This query uses another subquerycloseness_cent_approx_sub
, which needs to be installed before closeness_approx
can be installed.
The result is a list of all vertices in the graph with their approximate closeness centrality score. It is available both in JSON and CSV format.
Below is an example of running the algorithm on the social10 test graph and an excerpt of the response.
We all have an intuitive understanding when we say a home, an office, or a store is "centrally located." Closeness Centrality provides a precise measure of how "centrally located" a vertex is. The steps below show the steps for one vertex v
:
This algorithm query employs a subquery called cc_subquery
. Both queries are needed to run the algorithm.
Closeness centrality can be measured for either directed edges (from v
to others) or for undirected edges. Directed graphs may seem less intuitive, however, because if the distance from Alex to Bob is 1, it does not mean the distance from Bob to Alex is also 1.
For our example, we wanted to use the topology of the Likes graph, but to have undirected edges. We emulated an undirected graph by using both Friend
and Also_Friend
(reverse-direction) edges.
This algorithm assigns a unique integer value known as its color to the vertices of a graph such that no neighboring vertices share the same color. The reason why this is called color is that this task is equivalent to assigning a color to each nation on a map so that no neighboring nations share the same color.
Given a set of k vertices, the algorithm first colors all vertices with the same color - the first color. It then starts from all the vertices and has each vertex send its own colors to its neighbors. If there are two neighboring vertices with the same color, the algorithm will reassign colors where there is a conflict. The same process is repeated until all conflicts are resolved.
The algorithm has a worst-case time complexity of O(V^2 + E), where V is the number of vertices and E is the number of edges.
TigerGraph's closeness centrality algorithm uses multi-source breadth-first search (MS-BFS) to traverse the graph and calculate the sum of a vertex's distance to every other vertex in the graph, which vastly improves the performance of the algorithm. The algorithm's implementation of MS-BFS is based on the paper .
On the graph, say we want to color the Person
vertices in such a way that any two vertices that are either connected by a Friend
edge or a Coworker
edge do not have the same color. By running the greedy_graph_color
algorithm, we get the following result:
Name
Description
Data type
v_type
Vertex types to assign scores to.
SET<STRING>
e_type
Edge types to traverse.
SET<STRING>
maxIter
Maximum number of iteration.
INT
convLimit
The convergence limit.
FLOAT
top_k
The number of vertices with the highest scores to return.
INT
print_accum
If true, print results to JSON output.
BOOL
result_attr
If not empty, save the score of each vertex to this attribute.
STRING
file_path
If not empty, print results in CSV to this file.
STRING
Name
Description
v_type
Vertex types to calculate approximate closeness centrality for.
e_type
Edge types to traverse.
k
Size of the sample.
max_hops
Upper limit of how many jumps the algorithm will perform from each vertex.
epsilon
The maximum relative error, between 0 and 1. Setting a lower value produces a more accurate estimate but increases run time.
print_accum
Boolean value that indicates whether or not to output to console in JSON format.
file_path
If provided, the algorithm will output to this file path in CSV format
debug
There are many conditional logging statements inside the query. If the input is 0, nothing will be logged. If the input is 1, everything else but the breadth-first-search process from the sample-node. If the input is 2, everything will be logged.
sample_index
The algorithm will partition the graph based on the sample size. This index indicates which partition to use when estimating closeness centrality.
maxsize
If the number of vertices in the graph is lower than maxsize
, the exact closeness centrality is calculated instead and nothing will be approximated.
wf
Boolean value that indicates whether to use the Wasserman and Faust formula to calculate closeness centrality rather than the classic formula.
Characteristic | Value |
Result | Computes a Closeness Centrality value (FLOAT type) for each vertex. |
Required Input Parameters |
|
Result Size | V = number of vertices |
Time Complexity | O(E), E = number of edges. Parallel processing reduces the time needed for computation. |
Graph Types | Directed or Undirected edges, Unweighted edges |
Name | Description |
| A set of all vertex types to color. |
| A set of all edge types to traverse. |
| The Maximum number of colors that can be used. Use a large number like 999999 unless there is a strict limit. |
| If set to true, the total number of colors used will be displayed |
| If set to true, the output will display all vertices and their associated color |
| If a file path is provided, the output will be saved to the file indicated by the file path in CSV format. |
Step | Mathematical Formula |
1. Compute the average distance from vertex v to every other vertex: |
2. Invert the average distance, so we have average closeness of v: |
The k-Nearest Neighbors (kNN) algorithm is one of the simplest classification algorithms. It assumes that some or all the vertices in the graph have already been classified. The classification is stored as an attribute called the label. The goal is to predict the label of a given vertex, by seeing what are the labels of the nearest vertices.
Given a source vertex in the dataset and a positive integer k, the algorithm calculates the distance between this vertex and all other vertices and selects the k vertices that are nearest. The prediction of the label of this node is the majority label among its k-nearest neighbors.
The algorithm will not output more than K vertex pairs, so the algorithm may arbitrarily choose to output one vertex pair over another if there are tied similarity scores.
For the movie graph, we add the following labels to the Person vertices.
When we install the algorithm, answer the questions like:
We then run kNN, using Neil as the source person and k=3. This is the JSON output :
If we run cosine_nbor_ss, using Neil as the source person and k=3, we can see the persons with the top 3 similarity score:
Kat has a label "b", Kevin has a label "a", and Jing does not have a label. Since "a" and "b" are tied, the prediction for Neil is just one of the labels.
If Jing had label "b", then there would be 2 "b"s, so "b" would be the prediction.
If Jing had label "a", then there would be 2 "a"s, so "a" would be the prediction.
k-Nearest Neighbors (kNN) is often used for machine learning. You can choose the value for topK
based on your experience, or using cross-validation to optimize the hyperparameters. In our library, Leave-one-out cross-validation for selecting optimal k is provided. Given a k value, we run the algorithm repeatedly using every vertex with a known label as the source vertex and predict its label. We assess the accuracy of the predictions for each value of k, and then repeat for different values of k in the given range. The goal is to find the value of k with highest predicting accuracy in the given range, for that dataset.
Run knn_cosine_cv with min_k=2, max_k = 5. The JOSN result:
The distance can be physical distance as well as the reciprocal of similarity score, in which case "nearest" means "most similar". In our algorithm, the distance is the reciprocal of cosine neighbor similarity. The similarity calculation used here is the same as the calculation in . Note that in this algorithm, vertices with zero similarity to the source node are not considered in prediction. For example, if there are 5 vertices with non-zero similarity to the source vertex, and 5 vertices with zero similarity, when we pick the top 7 neighbors, only the label of the 5 vertices with non-zero similarity score will be used in prediction.
This algorithm is a batch version of the . It makes a prediction for every vertex whose label is not known (i.e., the attribute for the known label is empty), based on its k nearest neighbors' labels.
Characteristic | Value |
Result | The predicted label for the source vertex. The result is available in three forms:
|
Input Parameters |
|
Result Size | V = number of vertices |
Time Complexity | O(D^2), D = outdegree of vertex v |
Graph Types | Undirected or directed edges, weighted edges |
Characteristic | Value |
Result | A list of prediction accuracy for every k value in the given range, and the value of k with the highest predicting accuracy in the given range. The result is available in JSON format |
Input Parameters |
|
Result Size | max_k-min_k+1 |
Time Complexity | O(max_k*E^2 / V), V = number of vertices, E = number of edges |
Graph Types | Undirected or directed edges, weighted edges |
Characteristic | Value |
Result | The predicted label for the vertices whose label attribute is empty. The result is available in three forms:
|
Input Parameters |
|
Result Size | V = number of vertices |
Time Complexity | O(E^2 / V), V = number of vertices, E = number of edges |
Graph Types | Undirected or directed edges, weighted edges |
A component is the maximal set of vertices, plus their connecting edges, which are interconnected. That is, you can reach each vertex from each other vertex. In the example figure below, there are three components.
This particular algorithm deals with undirected edges. If the same definition (each vertex can reach each other vertex) is applied to directed edges, then the components are called Strongly Connected Components. If you have directed edges but ignore the direction (permitting traversal in either direction), then the algorithm finds Weakly Connected Components.
It is easy to see in this small graph that the algorithm correctly groups the vertices:
Alex, Bob, and Justin all have Community ID = 2097152
Chase, Damon, and Eddie all have Community ID = 5242880
Fiona, George, Howard, and Ivy all have Community ID = 0
Our algorithm uses the TigerGraph engine's internal vertex ID numbers; they cannot be predicted.
A k-core of a graph is a maximal connected subgraph in which every vertex is connected to at least k vertices in the subgraph. To obtain the k-core of a graph, the algorithm first deletes the vertices whose outdegree is less than k. It then updates the outdegree of the neighbors of the deleted vertices, and if that causes a vertex's outdegree to fall below k, it will also delete that vertex. The algorithm repeats this operation until every vertex left in the subgraph has an outdegree of at least k.
Our algorithm takes a range of values for k and returns the set of the vertices that constitute the k-core with the highest possible value of k within the range. It is an implementation of Algorithm 2 in Scalable K-Core Decomposition for Static Graphs Using a Dynamic Graph Data Structure, Tripathy et al., IEEE Big Data 2018.
O(E), where E is the number of edges in the graph.
In the example below based on the social
graph from GSQL 101, we can see that Dan, Tom, and Jenny make up a 2-core, which is the max-core of the graph:
If we run the kcore
algorithm on this small graph like so:
Here is the returned JSON response, which includes a 2-core that is comprised of Dan, Jenny, and Tom:
In addition to the regular weakly connected component algorithm, we also provide a version that is optimized for small-world graphs. A small world graph means the graph has a hub community, where the vast majority of the vertices in the graph are weakly connected.
This version improves upon the performance of the original algorithm when dealing with small-world graphs by combining several different methods used to find connected components in a multi-step process proposed by Slota et al. in BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems.
The algorithm starts by selecting an initial pivot vertex v with a high product of indegree and outdegree. From the initial pivot vertex , the algorithm uses Breadth-First Search to determine the massive weakly connected component. The vertices that are not included in this SCC are passed off to the next step.
After identifying the first WCC, the algorithm uses the coloring method to idenify the WCCs in the remaining vertices.
For more details, see Slota et al., BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems.
If to_show_cc_count
is set to true, the algorithm will return the number of vertices in each weakly connected component.
Suppose we have the following graph. We can see that there are three connected components. The first one has 5 vertices, while the two others have 3 vertices.
Running the algorithm on the graph will show that there are three weakly connected components, and have 5, 3, and 3 vertices respectively.
A strongly connected component (SCC) is a subgraph such that there is a path from any vertex to every other vertex. A graph can contain more than one separate SCC. An SCC algorithm finds the maximal SCCs within a graph. Our implementation is based on the Divide-and-Conquer Strong Components (DCSC) algorithm[1]. In each iteration, pick a pivot vertex v
randomly, and find its descendant and predecessor sets, where descendant set D_v
is the vertex reachable from v
, and predecessor set P_v
is the vertices which can reach v
(stated another way, reachable from v
through reverse edges). The intersection of these two sets is a strongly connected component SCC_v
. The graph can be partitioned into 4 sets: SCC_v
, descendants D_v
excluding SCC_v
, predecessors P_v
excluding SCC
, and the remainders R_v
. It is proved that any SCC is a subset of one of the 4 sets [1]. Thus, we can divide the graph into different subsets and detect the SCCs independently and iteratively.
The problem of this algorithm is unbalanced load and slow convergence when there are a lot of small SCCs, which is often the case in real-world use cases [3]. We added two trimming stages to improve the performance: size-1 SCC trimming[2] and weakly connected components[3].
The implementation of this algorithm requires reverse edges for all directed edges considered in the graph.
[1] Fleischer, Lisa K., Bruce Hendrickson, and Ali Pınar. "On identifying strongly connected components in parallel." International Parallel and Distributed Processing Symposium. Springer, Berlin, Heidelberg, 2000.
[2] Mclendon Iii, William, et al. "Finding strongly connected components in distributed graphs." Journal of Parallel and Distributed Computing 65.8 (2005): 901-910.
[3] Hong, Sungpack, Nicole C. Rodia, and Kunle Olukotun. "On fast parallel detection of strongly connected components (SCC) in small-world graphs." Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 2013.
We ran scc
on the social26 graph. A portion of the JSON result is shown below.
The first element "i"=1
means the whole graph is processed in just one iteration. The 5 "trim_set.size()"
elements mean there were 5 rounds of size-1 SCC trimming. The final "@@.cluster_dist_heap" object"
reports on the size distribution of SCCs.There is one SCC with 9 vertices, and 17 SCCs with only 1 vertex in the graph.
Characteristic
Value
Result
Assigns a component id (INT) to each vertex, such that members of the same component have the same id value.
Input Parameters
SET<STRING> v_type
: Names of vertex types to use
SET<STRING> e_type
: Names of edge types to use
INT output_limit
: If >=0, max number of vertices to output to JSON.
BOOL print_accum
: If True, output JSON to standard output
STRING result_attr
: If not empty, store community ID values (INT) to this attribute
STRING file_path
: If not empty, write output to this file.
Result Size
V = number of vertices
Time Complexity
O(E*d), E = number of edges, d = max(diameter of components)
Graph Types
Undirected edges
Parameter
Description
v_type
Vertex type to include in the k-core
e_type
Edge type to count for k-core connections
k_min
Minimum value of k. If the actual maximum core is below k_min
, the algorithm will return an empty set.
k_max
Maximum value of k. If k_max
is smaller than k_min
, the algorithm will ignore this parameter and keep looking for k-cores until it reaches a value of k where a k-core cannot be found.
show_membership
If show_membership
is true
, the algorithm will return the k-cores found for every value of k within the range provided. For each k-core, the results will include its member vertices.
show_shells
The k-shell is the set of vertices that are part of the k-core but not part of the (k+1)-core. If show_shells
is true
, the algorithm will return the k-shells found for every value of k. within the range provided. For each k-shell, the results will include its member vertices.
print_accum
Boolean value that decides whether the algorithm will return output in JSON
attr
Optional. An attribute of the vertex to save the core level of the vertex to. If attr
is provided, the core level of the vertex will be saved to this attribute of the vertex.
file_path
Optional. If file_path
is provided, the algorithm will output results to a file specified by the file path in CSV format.
Name
Description
Data type
v_type
The vertex type to count as part of a connected component
STRING
e_type
The edge type to traverse
STRING
threshold
The threshold used to choose initial pivot vertices. Only vertices whose product of indegree and outdegree exceed this threshold will be considered candidates for the pivot vertex. This is an attempt to increase the chances that the initial pivot is contained within the largest WCC.
The default value for this parameter is 100000. It is suggested that you keep this default value when running the algorithm.
UINT
to_show_cc_count
If true, the algorithm will return the number of vertices in each connected component.
BOOL
Characteristic | Value |
Result | Assigns a component id (INT) to each vertex, such that members of the same component have the same id value. |
Input Parameters |
|
Result Size | V = number of vertices |
Time Complexity | O(iter*d), d = max(diameter of components) |
Graph Types | Directed edges with reverse direction edges as well |
Label Propagation is a heuristic method for determining communities. The idea is simple: If the plurality of your neighbors all bear the label X, then you should label yourself as also a member of X. The algorithm begins with each vertex having its own unique label. Then we iteratively update labels based on the neighbor influence described above. It is important that the order for updating the vertices be random. The algorithm is favored for its efficiency and simplicity, but it is not guaranteed to produce the same results every time.
In a variant version, some vertices could initially be known to belong to the same community. If they are well-connected to one another, they are likely to preserve their common membership and influence their neighbors,
This is the same graph that was used in the Connected Component example. The results are different, though. The quartet of Fiona, George, Howard, and Ivy have been split into 2 groups:
(George & Ivy) each connect to (Fiona & Howard) and to one another.
(Fiona & Howard) each connect to (George & Ivy) but not to one another.
Label Propagation tries to find natural clusters and separations within connected components. That is, it looks at the quality and pattern of connections. The Component Component algorithm simply asks the Yes or No question: Are these two vertices connected?
We set max_iter
to 10, but the algorithm reaches a steady state after 3 iterations:
Characteristic
Value
Result
Assigns a component id (INT) to each vertex, such that members of the same component have the same id value.
Input Parameters
SET<STRING> v_type
: Names of vertex types to use
SET<STRING> e_type
: Names of edge types to use
INT max_iter
: Number of maximum iteration of the algorithm
INT output_limit
: If >=0, max number of vertices to output to JSON.
BOOL print_accum
: If True, output JSON to standard output
STRING attr
: If not empty, store community id values (INT) to this attribute
STRING file_path
: If not empty, write output to this file.
Result Size
V = number of vertices
Time Complexity
O(E*k), E = number of edges, k = number of iterations.
Graph Types
Undirected edges
The Local Clustering Coefficient algorithm computes the local clustering coefficient of every vertex in a graph. The local clustering coefficient of a vertex (node) in a graph quantifies how close its neighbors are to being a complete graph, where every two distinct vertices are connected by an edge. It is obtained by dividing the number edges between a vertex's neighbors by the number of edges that could possibly exist.
The algorithm does not report the local clustering coefficient for vertices that have only one neighbor.
O(n)
, where n
is the number of vertices in the graph.
Using the social
graph below as an example, Jenny has three neighbors - Tom, Dan and Amily, but only Tom and Dan are connected. Between the three neighbors, the maximum number of edges that could exist is 3. Therefore, the local clustering coefficient for Jenny is 1/3.
On the other hand, Tom has two neighbors that are connected by one edge. Since the maximum number of edges that could exist between two vertices is 1, Tom has a local clustering coefficient of 1/1 = 1.
By running the algorithm in GSQL, we can confirm that Tom has the highest local clustering coefficient with a score of 1 and Jenny has a score of 1/3.
This allows us to aggressively reduce the dimensionality of a dataset while preserving the distance information. By running the algorithm on a graph, we can preserve the similarity between nodes and their neighbors. Nodes that have similar neighborhoods are mapped to similar embedding vectors, while nodes that are not similar are mapped to dissimilar vectors.
The algorithm starts by assigning random vectors using very sparse random projection. The algorithm continues to average the random vectors, and then iteratively average the neighboring vectors from the previous iteration to construct intermediate embeddings. In each iteration, the algorithm also normalizes the intermediate embedding using a standard Euclidean norm.
In the end, the embedding for each node is a weighted sum of the intermediate embeddings, where the weights are provided as a parameter of the algorithm.
This algorithm is especially useful in the following situations:
Your data is in such high dimension that it is too expensive to perform calculations.
The dimensionality is very high and the number of samples is too sparse to calculate covariances.
You don't have access to the entire dataset, such as when working with real-time data.
This algorithm requires an index attribute on the schema. Before running the algorithm, create an attribute on all vertex types to hold the index. You can use our provided schema change job script in the algorithm library.
The optimal values for the following parameters are dependent on your dataset. In order to obtain the best quality embeddings for your graph, it is a good idea to tune these parameters.
reduced_dimension
The reduced dimension (reduced_dimension
) is the length of the produced vectors. A greater dimension offers greater precision, but is more costly to operate over.
input_weights
and k
The algorithm interactively constructs intermediate embeddings by averaging either neighboring intermediate embeddings from the previous iteration, or the generated random vectors during the first iteration. The final embeddings are weighted sums of the intermediate embeddings from each iteration.
k
is the number of iterations and the input_weights
parameter determine how much each set of intermediate embeddings from each iteration weighs. You should make sure that you supply a value of k
that is in accordance with the number of weights you provide.
With each iteration, the algorithm will consider neighbors that are one step farther away.
normalization_strength
The normalization strength determines how node degrees influence the embedding. Using a negative value will downplay the importance of high degree neighbors, while a positive value will instead increase their importance.
sampling_constant
FastRP uses very sparse random projection to reduce the dimensionality of the data from an nm matrix to an n*d matrix where d <= m by multiplying the original matrix with an m*d matrix. The m*d matrix is made up of independently and identically distributed data sampled from:
Where s is the sampling constant (sampling_constant
). The higher the constant, the higher the number of zeros in the resulting matrix, making it more sparse, but also speeds up the algorithm.
Node2Vec is a node embedding algorithm that uses random walks in the graph to create a vector representation of a node.
A random walk starts with a node, and the algorithm iteratively selects neighboring nodes to visit, and each neighboring node has an assigned probability. This transforms graph structure into a collection of linear sequences of nodes. For each node we will be left with a list of other nodes from their local or extended neighborhoods.
Once the above step is complete, the algorithm uses a variation of the word2vec model from the language modeling community to turn each node into a vector of probabilities. The probabilities represent the likelihood of visiting a given node in a random walk from each starting node.
The Speaker-listener Label Propagation Algorithm (SLPA) is a variation of the Label Propagation algorithm that is able to detect overlapping communities. The main difference between LPA and SLPA is that each node can only hold a single label in LPA while it is allowed to possess multiple labels in SLPA.
The algorithm begins with each vertex having its own unique label. It then iteratively records labels in a local accumulator based on specific speaking and listening rules. Then the post-processing of the record labels is applied. Finally, the algorithm removes the nested communities and outputs all the communities. Note that it is not guaranteed to produce the same results every time.
In the example below, we run the tg_slpa
algorithm on the social10 graph. We set max_iter = 10 and threshold = 0.1.
Fast Random Projection (FastRP) is a scalable and performant node-embedding algorithm. It generates node embeddings (vectors) of low dimensionality through random projections from the graph's adjacency matrix (a high-dimensional matrix) to a low-dimensional matrix, significantly reducing the computing power required to process the data. The algorithm is theoretically backed by , which states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are .
After the attribute has been created, run the query to index all vertices on the graph.
Installing this query requires , which can be found in the . If you are running the query on a cluster, you need to manually install the UDF on every node of the cluster.
For more information, see .
Parameter
Description
Data type
v_type
Vertex type to calculate local clustering coefficient for
STRING
e_type
Edge type to traverse. Only vertices that are connected by edges of this type are treated as connected in this algorithm.
STRING
top_k
Number of the highest local clustering coefficients to report.
INT
print_accum
If true
, output JSON to standard output.
BOOL
result_attr
If provided, the local clustering coefficient of a vertex will be saved to this attribute.
STRING
file_path
If provided, write output to this file in CSV.
STRING
display_edges
If display_edges
is set to true, the algorithm will also return edges, which help produce better visualized results in GraphStudio.
BOOL
Parameter | Description | Data type |
| The total number of edges in the graph. |
|
| The total number of vertices in the graph. |
|
| The sampling constant. A high |
|
| The length of the produced vectors. A greater dimension offers greater precision, but is more costly to operate over. |
|
| The number of iterations for the algorithm to construct intermediate embeddings. |
|
| The normalization strength determines how node degrees influence the embedding. Using a negative value will downplay the importance of high degree neighbors, while a positive value will instead increase their importance. |
|
| A list of floats that determines the weight of the intermediate embeddings constructed in each iteration. |
|
| The name of the index attribute created during preprocessing. |
|
Parameter | Description | Data type |
| Number of random walks per node |
|
| Number of hops per walk |
|
| File path to output results to |
|
| Edge types to traverse |
|
| Number of nodes to be used in the random sample |
|
Characteristic | Value |
Result | Assigns a list of component id (INT) to each vertex, such that members of the same component have the same id value. |
Required Input Parameters |
|
Result Size | V = number of vertices |
Time Complexity | O(E*k), E = number of edges, k = number of iterations. |
Graph Types | Undirected edges |
Why triangles? Think of it in terms of a social network:
If A knows B, and A also knows C, then we complete the triangle if B knows C. If this situation is common, it indicates a community with a lot of interaction.
The triangle is in fact the smallest multi-edge "complete subgraph," where every vertex connects to every other vertex.
Triangle count (or density) is a measure of community and connectedness. In particular, it addresses the question of transitive relationships: If A--> B and B-->C, then what is the likelihood of A--> C?
Note that it is computing a single number: How many triangles are in this graph? It is not finding communities within a graph.
It is not common to count triangles in directed graphs, though it is certainly possible. If you choose to do so, you need to be very specific about the direction of interest: In a directed graph, If A--> B and B--> C, then
if A-->C, we have a "shortcut".
if C-->A, then we have a feedback loop.
This algorithm can only be run on TigerGraph V3.1 or higher.
We present two different algorithms for counting triangles. The first, tri_count(), is the classic edge-iterator algorithm. For each edge and its two endpoint vertices S and T, count the overlap between S's neighbors and T's neighbors.
One side effect of the simple edge-iterator algorithm is that it ends up considering each of the three sides of a triangle. The count needs to be divided by 3, meaning we did 3 times more work than a smaller algorithm would have.
tri_count_fast() is a smarter algorithm which does two passes over the edges. In the first pass we mark which of the two endpoint vertices has fewer neighbors. In the second pass, we count the overlap only between marked vertices. The result is that we eliminate 1/3 of the neighborhood matching, the slowest 1/3, but at the cost of some additional memory.
In the social10 graph with Coworker edges, there are clearly 4 triangles.
Characteristic
Value
Result
Returns the number of triangles in the graph.
Input Parameters
v_type
: Vertex type to count
e_type
: Edge type to traverse
Result Size
1 integer
Time Complexity
O(V * E), V = number of vertices, E = number of edges
Graph Types
Undirected edges
Breadth-first Search (BFS) is an algorithm used to explore the vertexes of a graph layer by layer. It starts at the given vertex and explores all vertices at the present depth prior to moving on to the vertices at the next depth level.
In the example below, we run tg_bfs
algorithm from the source vertex alex
on the social10 graph.
Below is the visualized result of the query:
Characteristic
Value
Result
Returns all the nodes that are accessible from the source vertex
Required Input Parameters
v_type
: vertex types to traverse
e_type
: edge types to traverse
max_hops
: look only this far from each vertex
v_start
: source vertex for traverse
print_accum
: print JSON output
result_attr
: INT attr to store results to
file_path
: file to write CSV output to
display_edges
:output edges for visualization
Result Size
V = number of vertices
Time Complexity
O(E+V), E = number of edges, V = number of vertices.since every vertex and every edge will be explored in the worst case.
Graph Types
Directed or Undirected edges, Weighted or Unweighted edges
Random Walk is an algorithm that generate random paths in a graph. A random walk starts at every vertex that has an outgoing edge, and moves to another vertex at random. The random walk algorithm will print all possible paths of the specified size from the walks that were performed onto a file in the CSV format.
Suppose we have the following social graph:
If we run the random walk algorithm on the graph and supply a path size of 3 and a step of 2, and a sample number of 1. Then from each vertex there is a two-step random walk, and a total of 6 three-vertex paths:
We can also perform a 3 step walk and still ask for three-vertex paths, in which case the number of paths would double, because in each 3 step walk, there are two possible three-vertex paths:
If a graph has unweighted edges, then finding the shortest path from one vertex to another is the same as finding the path with the fewest hops. Think of Six Degrees of Separation and Friend of a Friend. Unweighted Shortest Path answers the question "How are you two related?" The two entities do not have to be persons. Shortest Path is useful in a host of applications, from estimating influences or knowledge transfer, to criminal investigation.
When the graph is unweighted, we can use a "greedy" approach to find the shortest path. In computer science, a greedy algorithm makes intermediate choices based on the data being considered at the moment, and then does not revisit those choices later on. In this case, once the algorithm finds any path to a vertex T, it is certain that that is a shortest path.
This algorithm finds an unweighted shortest path from one source vertex to each possible destination vertex in the graph. That is, it finds n paths.
If your graph has weighted edges, see the next algorithm. With weighted edges, it is necessary to search the whole graph, whether you want the path for just one destination or for all destinations.
In the below graph, we do not consider the weight on edges. Using vertex A as the source vertex, the algorithm discovers that the shortest path from A to B is A-B, and the shortest path from A to C is A-D-C, etc.
Finding shortest paths in a graph with weighted edges is algorithmically harder than in an unweighted graph because even after you find a path to a vertex T, you cannot be certain that it is a shortest path. If edge weights are always positive, then you must keep trying until you have considered every in-edge to T. If edge weights can be negative, then it's even harder. You must consider all possible paths.
A classic application for weighted shortest path is finding the shortest travel route to get from A to B. (Think of route planning "GPS" apps.) In general, any application where you are looking for the cheapest route is a possible fit.
The shortest path algorithm can be optimized if we know all the weights are nonnegative. If there can be negative weights, then sometimes a longer path will have a lower cumulative weight. Therefore, we have two versions of this algorithm
The shortest_path_any_wt query is an implementation of the Bellman-Ford algorithm. If there is more than one path with the same total weight, the algorithm returns one of them.
Currently, shortest_path_pos_wt also uses Bellman-Ford. The well-known Dijsktra's algorithm is designed for serial computation and cannot work with GSQL's parallel processing.
The graph below has only positive edge weights. Using vertex A as the source vertex, the algorithm discovers that the shortest weighted path from A to B is A-D-B, with distance 8. The shortest weighted path from A to C is A-D-B-C with distance 9.
The graph below has both positive and negative edge weights. Using vertex A as the source vertex, the algorithm discovers that the shortest weighted path from A to E is A-D-C-B-E, with a cumulative score of 7 - 3 - 2 - 4 = -2.
Name
Description
Data type
step
The number of hops performed in each random walk.
INT
path_size
The length of the paths to output. The length refers to the number of vertices in a path. For example, if a walk has two steps: A -> B -> C, then there are paths of both lengths 2 and lengths 3 that can be output from this walk. If a path size of 2 is supplied, then the algorithm outputs two paths: A -> B and B -> C. If the path size is 3, then there is one path: A -> B -> C.
INT
filepath
The filepath to output the paths to.
STRING
edge_types
The edge types that the random walk will traverse.
SET<STRING>
sample_num
At each possible step, the number of sample walks to perform.
INT
Characteristic | Value |
Result | Computes a shortest distance (INT) and shortest path (STRING) from vertex source to each other vertex. |
Input Parameters |
|
Result Size | V = number of vertices |
Time Complexity | O(E), E = number of edges |
Graph Types | Directed or Undirected edges, Unweighted edges |
Characteristic | Value |
Result | Computes a shortest distance (INT) and shortest path (STRING) from vertex source to each other vertex. |
Input Parameters |
|
Result Size | V = number of vertices |
Time Complexity | O(V*E), V = number of vertices, E = number of edges |
Graph Types | Directed or Undirected edges, Weighted edges |
A* (pronounced "A-star") is a graph traversal and path search algorithm, which achieves better performance by using heuristics to guide its search.
The algorithm starts from a source node, and at each iteration of its main loop, it selects the path that minimizes where n is the next node on the path, g(n) is the cost of the path from the source node to n, and h(n) is a heuristic function that estimates the cost of the cheapest path from n to the target node.
The algorithm terminates when the path it chooses to extend is a path from start to goal or if there are no paths eligible to be extended. The heuristic function is problem-specific. If the heuristic function is admissible, meaning that it never overestimates the actual cost to get to the target, the algorithm is guaranteed to return a least-cost path from source to target.
In the example below, we run the tg_astar
algorithm to find the shortest path between the source vertex "Kings Cross" and the target vertex "Kentish Town" on a graph which is a transport network of stations. Each station has its geometric attibutes(i.e.latitude and longitude) and the edge weight represents the distance between stations in kilometers. The heuristic function used to guide the search is the Haversine Formula, which computes the distance between two points on a sphere given their longitudes and latitudes
Below is the visualized result of the query:
Start with a set A = { a single vertex seed }
For all vertices in A, select a vertex y such that
y is not in A, and
There is an edge from y to a vertex x in A, and
The weight of the edge e(x,y) is the smallest among all eligible pairs (x,y).
Add y to A, and add the edge (x,y) to MST.
Repeat steps 2 and 3 until A has all vertices in the graph.
If the user specifies a source vertex, this will be used as the seed. Otherwise, the algorithm will select a random seed vertex.
If the graph contains multiple components (i.e., some vertices are disconnected from the rest of the graph, then the algorithm will span only the component of the seed vertex.
If you do not have a preferred vertex, and the graph might have more than one component, then you should use the Minimum Spanning Forest (MDF) algorithm instead.
In the graph social10
, we consider only the undirected Coworker edges.
This graph has 3 components. Minimum Spanning Tree finds a tree for one component, so which component it will work on depends on what vertex we give as the starting point. If we select Fiona, George, Howard, or Ivy as the start vertex, then it works on the 4-vertex component on the left. You can start from any vertex in the component and get the same or an equivalent MST result.
The figure below shows the result of
Note that the value for the one vertex is ("Ivy", "Person")
. In GSQL, this 2-tuple format which explicitly gives the vertex type is used when the query is written to accept a vertex of any type.
File output:
The attribute version requires a boolean attribute on the edge, and it will assign the attribute to "true" if that edge is selected in the MST:
Given an undirected and connected graph, a minimum spanning tree is a set of edges that can connect all the vertices in the graph with the minimal sum of edge weights. The library implements a parallel version of :
The Single-Pair Shortest Path task seeks the shortest path between a source vertex S and a target vertex T. If the edges are unweighted, then use the query in our tutorial document
If the edges are weighted, then use the algorithm. In the worst case, it takes the same computational effort to find the shortest path for one pair as to find the shortest paths for all pairs from the same source S. The reason is that you cannot know whether you have found the shortest (least weight) path until you have explored the full graph. If the weights are always positive, however, then a more efficient algorithm is possible. You can stop searching when you have found paths that use each of the in-edges to T.
Characteristic
Value
Result
Computes a shortest distance (INT) and shortest path (STRING) from vertex source to target vertex.
Required Input Parameters
source_vertex: start vertex
target_vertex: target vertex
e_type: edge types to traverse
wt_type: weight data type (INT,FLOAT,DOUBLE)
wt_attr: attribute for edge weights
display: output edges for visualization
Result Size
1
Time Complexity
O(V^2), V = number of vertices.
Graph Types
Directed or Undirected edges, Weighted edges
Characteristic | Value |
Result | Computes a minimum spanning tree. If the JSON or file output selected, the output is the set of edges that form the MST. If the result_attr option is selected, the edges which are part of the MST are tagged True; other edges are tagged False. |
Input Parameters |
|
Result Size | V - 1 = number of vertices - 1 |
Time Complexity | O(V^2) |
Graph Types | Undirected edges and connected |
The All-Pairs Shortest Path algorithm is costly for large graphs because the computation time is O(V^3) and the output size is O(V^2). Be cautious about running this on very large graphs.
The All-Pairs Shortest Path (APSP) task seeks to find the shortest paths between every pair of vertices in the entire graph. In principle, this task can be handled by running the Single-Source Shortest Path (SSSP) algorithm for each input vertex, e.g.,
This example highlights one of the strengths of GSQL: treating queries as stored procedures that can be called from within other queries. We only show the result_attr and file_path options, because subqueries cannot send their JSON output.
For large graphs (with millions of vertices or more), however, this is an enormous task. While the massively parallel processing of the TigerGraph platform can speed up the computation by 10x or 100x, consider what it takes just to store or report the results. If there are 1 million vertices, then there are nearly 1 trillion output values.
There are more efficient methods than calling the single-source shortest path algorithm n times, such as the Floyd-Warshall algorithm, which computes APSP in O(V^3) time.
Our recommendation:
If you have a smaller graph (perhaps thousands or tens of thousands of vertices), the APSP task may be tractable.
If you have a large graph, avoid using APSP.
The Cycle Detection problem seeks to find all the cycles (loops) in a graph. We apply the usual restriction that the cycles must be "simple cycles", that is, they are paths that start and end at the same vertex but otherwise never visit any vertex twice.
There are two versions of the task: for directed graphs and undirected graphs. The GSQL algorithm library currently supports only directed cycle detection. The Rocha–Thatte algorithm is an efficient distributed algorithm, which detects all the cycles in a directed graph. The algorithm will self-terminate, but it is also possible to stop at k iterations, which finds all the cycles having lengths up to k edges.
The basic idea of the algorithm is to (potentially) traverse every edge in parallel, again and again, forming all possible paths. At each step, if a path forms a cycle, it records it and stops extending it. More specifically:
Initialization For each vertex, record one path consisting of its own id. Mark the vertex as Active.
Iteration steps: For each Active vertex v:
Send its list of paths to each of its out-neighbors.
Inspect each path P in the list of the paths received:
If the first id in P is also id(v), a cycle has been found:
Remove P from its list.
If id(v) is the least id of any id in P, then add P to the Cycle List. (The purpose is to count each cycle only once.)
Else, if id(v) is somewhere else in the path, then remove P from the path list (because this cycle must have been counted already).
Else, append id(v) to the end of each of the remaining paths in its list.
In the social10 graph, there are 5 cycles, all with the Fiona-George-Howard-Ivy cluster.
Characteristic
Value
Result
Computes a list of vertex id lists, each of which is a cycle. The result is available in 2 forms:
streamed out in JSON format
written to a file in tabular format
Input Parameters
SET<STRING> v_type
: Names of vertex types to use
SET<STRING> e_type
: Names of edge types to use
INT depth
: Maximum cycle length to search for = maximum number of iterations
BOOL print_accum
: If True, output JSON to standard output
STRING file_path
: If not empty, write output to this file.
Result Size
Number of cycles * average cycle length
Both of these measures are not known in advance.
Time Complexity
O(E *k), E = number of edges.
k = min(max. cycle length, depth parameter)
Graph Types
Directed
An independent set of vertices does not contain any pair of vertices that are neighbors, i.e., ones which have an edge between them. A maximal independent set is the largest independent set that contains those vertices; you cannot improve upon it unless you start over with a different independent set. However, the search for the largest possible independent set (the maximum independent set, as opposed to the maximal independent set) is an NP-hard problem: there is no known algorithm that can find that answer in polynomial time. So we settle for the maximal independent set.
This algorithm finds use in applications wanting to find the most efficient configuration which "covers" all the necessary cases. For example, it has been used to optimize delivery or transit routes, where each vertex is one transit segment and each edge connects two segments that can NOT be covered by the same vehicle.
Consider our social10 graph, with three components.
It is clear that for each of the two triangles -- (Alex, Bob, Justin) and (Chase, Damon, Eddie) -- we can select one vertex from each triangle to be part of the MIS. For the 4-vertex component (Fiona, George, Howard, Ivy), it is less clear what will happen. If the algorithm selects either George or Ivy, then no other independent vertices remain in the component. However, the algorithm could select both Fiona and Howard; they are independent of one another.
This demonstrates the uncertainty of the Maximal Independent Set algorithm and how it differs from Maximum Independent Set. A maximum independent set algorithm would always select Fiona and Howard, plus 2 others, for a total of 4 vertices. The maximal independent set algorithm relies on chance. It could return either 3 or 4 vertices.
Characteristic
Value
Result
A set of vertices that form a maximal independent set.
Input Parameters
STRING v_type
: Name of vertex type to use
STRING e_type
: Name of edge type to use
INT max_iter
: maximum number of iterations for the search
BOOL print_accum
: If True, output JSON to standard output
STRING file_path
: If not empty, write output to this file.
Result Size
Size of the MIS: unknown. Worst case: If the graph is a set of N unconnected vertices, then the MIS is all N vertices.
Time Complexity
O(E), E = number of edges
Graph Types
Undirected edges
Euclidean distance measures the straight line distance between two points in n-dimensional space. The algorithm takes two vectors denoted by ListAccum
and return the Euclidean distance between them.
This algorithm is implemented as a user-defined function. You need to follow the steps in Add a User-Defined Function to add the function to GSQL. After adding the function, you can call it in any GSQL query in the same way as a built-in GSQL function.
The Euclidean distance between the two vectors.
Name
Description
Data type
A
An n-dimensional vector denoted by a ListAccum
of length n
ListAccum<INT/UINT/FLOAT/DOUBLE>
B
An n-dimensional vector denoted by a ListAccum
of length n
ListAccum<INT/UINT/FLOAT/DOUBLE>
The Pearson correlation coefficient is a measure of linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations.
The formula for calculating the Pearson correlation coefficient is as follows:
This algorithm is implemented as a user-defined function. You need to follow the steps in Add a User-Defined Function to add the function to GSQL. After adding the function, you can call it in any GSQL query in the same way as a built-in GSQL function.
The Pearson correlation coefficient between the two vectors.
To compare two vertices by cosine similarity, the selected properties of each vertex are first represented as a vector. For example, a property vector for a Person vertex could have the elements age, height, and weight. Then the cosine function is applied to the two vectors.
The cosine similarity of two vectors A and B is defined as follows:
If A and B are identical, then cos(A, B) = 1. As expected for a cosine function, the value can also be negative or zero. In fact, cosine similarity is closely related to the Pearson correlation coefficient.
For this library function, the feature vector is the set of edge weights between the two vertices and their neighbors.
In the movie graph shown in the figure below, there are Person vertices and Movie vertices. Every person may give a rating to some of the movies. The rating score is stored on the Likes edge using the weight attribute. For example, in the graph below, Alex gives a rating of 10 to the movie "Free Solo".
The output size is always K (if K <= N), so the algorithm may arbitrarily choose to output one vertex over another if there are tied similarity scores.
Given one person's name, this algorithm calculates the cosine similarity between this person and each other person where there is at one movie they have both rated.
In the previous example, if the source vertex is Alex, and top_k
is set to 5, then we calculate the cosine similarity between him and two other persons, Jing and Kevin. The JSON output shows the top 5 similar vertices and their similarity score in descending order. The output limit is 5 persons, but we have only 2 qualified persons:
The FILE version output is not necessarily in descending order. It looks like the following:
The ATTR version inserts an edge into the graph with the similarity score as an edge attribute whenever the score is larger than zero. The result looks like this:
This algorithm computes the same similarity scores as the Cosine similarity of neighborhoods, all pairs algorithm, except that it starts from all of the vertices as the source vertex and computes its similarity scores with its neighbors for all the vertices in parallel. Since this is a memory-intensive operation, it is split into batches to reduce peak memory usage. The user can specify how many batches it is to be split into. Compared with the Cosine similarity of neighborhoods, all pairs algorithm, this algorithm allows you to split the workload into multiple batches and reduces the burden on memory.
This algorithm has a time complexity of O(E), where E is the number of edges, and runs on graphs with weighted edges (directed or undirected).
The result of this algorithm is the top k cosine similarity scores and their corresponding pair for each vertex. The score is only included if it is greater than 0.
The result can be output in JSON format, in CSV to a file, or saved as a similarity edge in the graph itself.
Using the social10
graph, we can calculate the cosine similarity of every person to every other person connected by the Friend
edge, and print out the top k most similar pairs for each vertex.
The Common Neighbors algorithm calculates the number of common neighbors between two vertices.
The number of common neighbors between two vertices.
Suppose to have the following graph:
Running the algorithm between Dan and Jenny will show that they have 1 common neighbor:
The algorithm will not output more than K vertex pairs, so the algorithm may arbitrarily chose to output one vertex pair over another, if there are tied similarity scores.
For the movie graph, calculate the Jaccard similarity between all pairs and show the 5 most similar pairs: jaccard_nbor_ap(5). This is the JSON output :
This algorithm has a time complexity of O(E), where E is the number of edges, and runs on graphs with unweighted edges (directed or undirected).
The result contains the top k Jaccard similarity scores for each vertex and its corresponding pair. A pair is only included if its similarity is greater than 0, meaning there is at least one common neighbor between the pair. The result is available in JSON format, or can be output to a file in CSV, or it can be saved as an edge on the graph itself. A JSON formatted result could look like this:
This algorithm computes the same similarity scores as the , except that it considers ALL pairs of vertices in the graph (for the vertex and edge types selected by the user). Naturally, this algorithm will take longer to run. For very large and very dense graphs, this algorithm may not be a practical choice.
This algorithm computes the same similarity scores as the except that it starts from all of the vertices as the source vertex and computes its similarity scores with its neighbors for all the vertices in parallel. Since this is a memory-intensive operation, it is split into batches to reduce peak memory usage. The user can specify how many batches it is to be split into. Compared with the , this algorithm allows you to split the workload into multiple batches and reduces the burden on memory.
Name
Description
Data type
A
An n-dimensional vector denoted by a ListAccum
of length n
ListAccum<INT/UINT/FLOAT/DOUBLE>
B
An n-dimensional vector denoted by a ListAccum
of length n
ListAccum<INT/UINT/FLOAT/DOUBLE>
Characteristic
Value
Result
The top K vertices in the graph that have the highest similarity scores, along with their scores.
The result is available in three forms:
streamed out in JSON format
written to a file in tabular format
stored as a vertex attribute value
Input Parameters
VERTEX source
: Source vertex
SET<STRING> e_type
: Edge type to traverse
SET<STRING> re_type
: Reverse edge type to traverse
STRING weight:
The edge attribute to use as the weight of the edge
INT top_k
: Number of vertices
INT output_limit:
BOOL print_accum
: If true
, output JSON to standard output.
STRING filepath
: If provided, the output will be written to this file path in CSV format.
STRING similarity_edge
: If provided, the similarity score will be saved to this edge.
Result Size
top_k
Time Complexity
O(D^2), D = outdegree of vertex v
Graph Types
Undirected or directed edges, weighted edges
Name
Description
v_type
Vertex type to calculate similarity for
e_type
Directed edge type to traverse
edge_attribute
Name of the attribute on the edge type to use as the weight
topK
Number of top scores to report for each vertex
print_accum
If true
, output JSON to standard output.
similarity_edge
If provided, the similarity score will be saved to this edge.
file_path
If not empty, write output to this file in CSV.
num_of_batches
Number of batches to divide the query into
Name | Description | Data type |
| A vertex. |
|
| A vertex. |
|
| Edge types to traverse. |
|
Characteristic | Value |
Result | The top k vertex pairs in the graph that have the highest similarity scores, along with their scores. The result is available in three forms:
|
Input Parameters |
|
Result Size |
|
Time Complexity | O(E^2 / V), V = number of vertices, E = number of edges |
Graph Types | Undirected or directed edges, unweighted edges |
Name | Description |
| Vertex type to calculate similarity for |
| Directed edge type to traverse |
| Reverse edge type to traverse |
| Number of top scores to report for each vertex |
| If |
| If provided, the similarity scores will be saved to this edge type. |
| If a file path is provided, the algorithm will output to a file specified by the file path in CSV format |
| Number of batches to divide the query into |
Preferential Attachment is a measure to compute the closeness of vertices, based on the number of their neighbors. The algorithm returns the product two vertices' number of neighbors.
For more information, see Preferential Attachment.
The product of the number of neighbors of the two vertices.
Suppose we have the following graph:
Since Dan has four neighbors, while Jenny has three, the return value of the algorithm is 3∗4=12:
Resource Allocation is used to compute the closeness of nodes based on their shared neighbors. It is computed by the following formula:
Where 𝑁(𝑢) is the set of nodes adjacent to u.
Suppose we have the following graph:
Since Dan and Jenny has one shared neighbor Tom
, who has two neighbors, running the algorithm between Dan and Jenny with friendship edges would give us a result of 0.5.
Name
Description
Data type
a
A vertex.
VERTEX
b
A vertex.
VERTEX
e_type
Edge types to traverse.
SET<STRING>
Name
Description
Data type
a
A vertex.
VERTEX
b
A vertex.
VERTEX
e_type
Edge types to traverse.
SET<STRING>
This algorithm takes two vertices, and returns 1 if they two vertices are in the same community, and returns 0 if they are not in the same community.
The algorithm assumes that community detection has already completed and that a community ID is stored in an integer attribute on each vertex.
Returns 1 if the two vertices are in the same community.
Returns 0 if the two vertices are not in the same community.
Suppose we have the following vertices:
Their community IDs were generated by running the weakly connected component algorithm on the graph. If we run the algorithm between Kevin and Jenny, we get 1 because the two vertices are in the same community as indicated by their community
attribute:
ArticleRank is an algorithm that has been derived from the PageRank algorithm to measure the influence of journal articles.
Page Rank assumes that relationships originating from low-degree nodes have a higher influence than relationships from high-degree nodes. Article Rank modifies the formula in such a way that it retains the basic PageRank methodology but lowers the influence of low-degree nodes.
The Article Rank of a node v at iteration i is defined as:
Within the formula:
Nin(v) are the incoming neighbors and Nout(v) are the outgoing neighbors of node v.
d is a damping factor in [0, 1], usually set to 0.85.
Nout is the average out-degree
The article rank score for each vertex.
Suppose we have the following graph:
By running Article Rank on the graph, we will see that the vertex with the highest score is Dan:
For more information, see .
Name
Description
Data type
a
A vertex.
VERTEX
b
A vertex.
VERTEX
communityAttribute
The community attribute used to store a vertex’s community ID.
STRING
Name | Description | Data type |
| A vertex type. |
|
| An edge type. |
|
| Article Rank will stop iterating when the largest difference between any vertex's current score and its previous score ≤ |
|
| Maximum number of iterations. |
|
| The damping factor. Usually set to 0.85. |
|
| The number of results with the highest scores to return. |
|
| If true, print JSON output. |
|
| If true, store the article rank score of each vertex in this attribute. |
|
| If true, output CSV to this file. |
|
Influence maximization is the problem of finding a small subset of vertices in a social network that could maximize the spread of influence.
There are two versions of the Influence Maximization algorithm. Both versions find k
vertices that maximize the expected spread of influence in the network. The CELF version improves upon the efficiency of the greedy version and should be preferred in analyzing large networks.
The two versions of the algorithm are implemented on the following papers:
The CELF version and the greedy version of the algorithms share the same set of parameters.
The ID of the vertices with the highest influence scores along with their scores.
CELF:
Greedy:
Name | Description | Data type |
| A vertex type |
|
| An edge type |
|
| The name of the weight attribute on the edge type |
|
| The number of vertices with the highest influence score to return |
|
| If true, print results to JSON output. |
|
| If not empty, save results in CSV to this file. |
|
The Harmonic Centrality algorithm calculates the harmonic centrality of each vertex in the graph. Harmonic Centrality is a variant of Closeness Centrality. In a (not necessarily connected) graph, the harmonic centrality reverses the sum and reciprocal operations in the definition of closeness centrality:
If your graph has many unconnected clusters, the harmonic centrality could be a better indicator of centrality than closeness centrality.
For more information, see Harmonic Centrality.
If we have the following graph, we can see that Ivy is the most central of the five vertices. Running the algorithm on the graph shows that Ivy has the highest centrality score:
Name
Description
Data type
v_type
Vertex types to calculate harmonic centrality for
SET<STRING>
e_type
Edge types to traverse
SET<STRING>
re_type
Reverse edge types. For undirected edges, fill in the edge type name in this parameter as well as the e_type
parameter.
SET<STRING>
max_hops
The maximum number of hops the algorithm would consider for each vertex. If set to a non-positive value, the limit is ignored.
INT
top_k
Sort the scores high to low and output the highest k
scores
INT
wf
If true
, use the Wasserman-Faust normalization for multi-component graphs
BOOL
print_accum
If true
, output JSON to standard output
BOOL
result_attr
If not empty, store centrality values (FLOAT) to this attribute
STRING
file_path
If not empty, write output to this file in CSV.
STRING
display_edges
If true
, include the graph's edges in the JSON output, so that the full graph can be displayed.
BOOL
In addition to the regular strongly connected component algorithm, we also provide a version that is optimized for small-world graphs. A small world graph in this context means the graph has a hub community, where a vast majority of the vertices of the graph are weakly connected.
This version improves upon the performance of the original algorithm when dealing with small-world graphs by combining several different methods used to find connected components in a multi-step process proposed by Slota et al. in BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems.
The algorithm starts by trimming the graph, which removes all vertices whose indegree or outdegree is 0. Then in the second phase, the algorithm selects an initial pivot vertex v with a high product of indegree and outdegree. From the initial pivot vertex , the algorithm uses one iteration of the Forward-Backward method to identify all vertices reachable by v (descendants) and all vertices that can reach v (predecessors). The intersection of the descendants and the predecessors form a strongly connected component (SCC). The vertices that are not included in this SCC are passed off to the next step.
After identifying the first SCC, the algorithm uses the coloring method and Tarjan's serial algorithm to identify the SCCs in the remaining vertices.
For more details, see Slota et al., BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems.
The algorithm has a time complexity of O(V + E), where V is the number of vertices and E is the number of edges in the graph.
When to_show_cc_count
is set to true, the algorithm will return the number of strongly connected components in the graph.
Suppose we have the following graph. We can see there are 7 strongly connected components. With two of them containing more than 1 vertices. The five vertices on the left are each a strongly connected component individually.
Running the algorithm on the graph will return a result of 7:
Name
Description
Data type
v_type
The vertex type to count as part of a strongly connected component
STRING
e_type
The edge type to traverse
STRING
re_type
The reverse edge type to traverse. If the graph is undirected, fill in the name of the undirected edge here as well as for e_type.
STRING
threshold
The threshold used to choose initial pivot vertices. Only vertices whose product of indegree and outdegree exceed this threshold will be considered candidates for the pivot vertex. This is an attempt to increase the chances that the initial pivot is contained within the largest SCC.
The default value for this parameter is 100000. It is suggested that you keep this default value when running the algorithm.
UINT
to_show_cc_count
If set to TRUE
, the algorithm will return the number of vertices in each strongly connected component.
BOOL
The Louvain Method for community detection [1] partitions the vertices in a graph by approximately maximizing the graph's modularity score. The modularity score for a partitioned graph assesses the difference in density of links within a partition vs. the density of links crossing from one partition to another. The assumption is that if a partitioning is good (that is, dividing up the graph into communities or clusters), then the within-density should be high and the inter-density should be low.
The most efficient and empirically effective method for calculating modularity was published by a team of researchers at the University of Louvain. The Louvain method uses agglomeration and hierarchical optimization:
Optimize modularity for small local communities.
Treat each optimized local group as one unit, and repeat the modularity operation for groups of these condensed units.
The original Louvain Method contains two phases. The first phase incrementally calculates the modularity change of moving a vertex into every other community and moves the vertex to the community with the highest modularity change. The second phase coarsens the graph by aggregating the vertices which are assigned in the same community into one vertex. The first phase and second phase make up a pass. The Louvain Method performs the passes iteratively. In other words, the algorithm assigns an initial community label to every vertex, then performs the first phase, during which the community labels are changed until there is no modularity gain. Then it aggregates the vertices with the same labels into one vertex and calculates the aggregated edge weights between new vertices. For the coarsened graph, the algorithm conducts the first phase again to move the vertices into new communities. The algorithm continues until the modularity is not increasing, or runs to the preset iteration limits.
However, phase one is sequential, and thus slow for large graphs. An improved Parallel Louvain Method Louvain Method (PLM) calculates the best community to move to for each vertex in parallel [2]. In Parallel Louvain Method(PLM), the positive modularity gain is not guaranteed, and it may also swap two vertices to each other’s community. After finishing the passes, there is an additional refinement phase, which is running the first phase again on each vertex to do some small adjustments for the resulting communities. [3].
[1] Blondel, Vincent D., et al. "Fast unfolding of communities in large networks." Journal of statistical mechanics: theory and experiment 2008.10 (2008): P10008.
[2] Staudt, Christian L., and Henning Meyerhenke. "Engineering parallel algorithms for community detection in massive networks." IEEE Transactions on Parallel and Distributed Systems 27.1 (2016): 171-184.
[3] Lu, Hao, Mahantesh Halappanavar, and Ananth Kalyanaraman. "Parallel heuristics for scalable community detection." Parallel Computing 47 (2015): 19-37.
If we use louvain_parallel
for social10 graph, it will give the same result as the connected components algorithm. The social26 graph is a densely connected graph. The connected components algorithm groups all the vertices into the same community and label propagation does not consider the edge weight. On the contrary, louvain_parallel
detects 7 communities in total, and the cluster distribution is shown below (csize
is cluster size):
Characteristic
Value
Result
Assigns a component id (INT) to each vertex, such that members of the same component have the same id value. The JSON output lists every vertex with its community ID value. It also lists community id values, sorted by community size.
Input Parameters
SET<STRING> v_type
: Names of vertex types to use
SET<STRING> e_type
: Names of edge types to use
STRING wt_attr
: Name of edge weight attribute (must be FLOAT)
INT iter1
: Max number of iterations for the first phase. Default value is 10
INT iter2
: Max number of iterations for the second phase. Default value is 10
INT iter3
: Max number of iterations for the refinement phase. Default value is 10
INT split
: Number of splits in phase 1. Increase the number to save memory, at the expense of having a longer running time. Default value is 10.
BOOL print_accum
: If True, output JSON to standard output
STRING result_attr
: If not empty, store community id values (INT) to this attribute
STRING file_path
: If not empty, write output to this file.
BOOL comm_by_size
: If True, and if print_accum
is True, output the membership of each community, with communities arranged by size.
Result Size
V = number of vertices
Time Complexity
O(V^2*L), V = number of vertices, L = (iter1 * iter2 + iter3) = total number of iterations
Graph Types
Undirected, weighted edges
An edge weight attribute is required.
The diameter of a graph is the worst-case length of a shortest path between any pair of vertices in a graph. It is the farthest distance to travel, to get from one vertex to another, if you always take the shortest path. Finding the diameter requires calculating (the lengths of) all shortest paths, which can be quite slow.
This algorithm uses a simple heuristic to estimate the diameter. rather than calculating the distance from each vertex to every other vertex, it selects K vertices randomly, where K is a user-provided parameter. It calculates the distances from each of these K vertices to all other vertices. So, instead of calculating V*(V-1) distances, this algorithm only calculates K*(V-1) distances. The higher the value of K, the greater the likelihood of hitting the actual longest shortest path.
The current version only computes unweighted distances.
This algorithm query employs a subquery called max_BFS_depth. Both queries are needed to run the algorithm.
Given an undirected graph with one or more connected components, a minimum spanning forest is a set of minimum spanning trees, one for each component. The library implements the algorithm in section 6.2 of Qin et al. 2014: .
The overlap coefficient, or Szymkiewicz–Simpson coefficient, is a that measures the overlap between two finite sets.
This algorithm is implemented as a user-defined function. You need to follow the steps in to add the function to GSQL. After adding the function, you can call it in any GSQL query in the same way as a built-in GSQL function.
Characteristic
Value
Result
Returns the estimated value for the diameter of the graph
Input Parameters
SET<STRING> v_type
: Names of vertex types to use
SET<STRING> e_type
: Names of edge types to use
INT seed_set_length
: The number (K) of random seed vertices to use
BOOL print_accum
: If True, output JSON to standard output
STRING file_path
: If not empty, write output to this file.
Result Size
one integer
Time Complexity
O(k*E), E = number of edges, k = number of seed vertices
Graph Types
Directed
Characteristic | Value |
Result | Computes a minimum spanning forest. If the JSON or file output selected, the output is the set of edges that form the MSF. If the result_attr option is selected, the edges which are part of the MSF are tagged True; other edges are tagged False. |
Input Parameters |
|
Result Size | V - c, V = number of vertices, c = number of components |
Time Complexity | O((V+E) * logV) |
Graph Types | Undirected edges |
Name | Description | Data type |
| An n-dimensional vector denoted by a |
|
| An n-dimensional vector denoted by a |
|
The Jaccard index measures the relative overlap between two sets. To compare two vertices by Jaccard similarity, first select a set of values for each vertex. For example, a set of values for a Person could be the cities the Person has lived in. Then the Jaccard index is computed for the two vectors.
The Jaccard index of two sets A and B is defined as follows:
The value ranges from 0 to 1. If A and B are identical, then Jaccard(A, B) = 1. If both A and B are empty, we define the value to be 0.
In the current
The algorithm will not output more than K vertices, so the algorithm may arbitrarily choose to output one vertex over another if there are tied similarity scores.
Using the movie graph, we run jaccard_nbor_ss("Neil", 5)
:
If the source vertex (person) doesn't have any common neighbors (movies) with any other vertex (person), such as Elena in our example, the result will be an empty list:
This algorithm computes the same similarity scores as the (cosine_nbor_ss
), except that it considers ALL pairs of vertices in the graph (for the vertex and edge types selected by the user). Naturally, this algorithm will take longer to run. For very large and very dense graphs, this may not be a practical choice.
Characteristic
Value
Result
The top k vertices in the graph that have the highest similarity scores, along with their scores.
The result is available in three forms:
streamed out in JSON format
written to a file in tabular format, or
stored as a vertex attribute value.
Input Parameters
SET<STRING> v_type
: Vertex type to calculate similarity score for
SET<STRING> e_type
: Edge type to traverse
SET<STRING> re_type
: Reverse edge type to traverse
INT top_k
: the number of vertex pairs with the highest similarity scores to return
BOOL print_accum
: Boolean value that decides whether to output to console
STRING similarity_edge
: If provided, the similarity score will be saved to this edge
STRING filepath:
If provided, the algorithm will output to the file path in CSV format
Result Size
top_k
Time Complexity
O(D^2), D = outdegree of vertex v
Graph Types
Undirected or directed edges, unweighted edges
Characteristic | Value |
Result | The top k vertex pairs in the graph which have the highest similarity scores, along with their scores. The result is available in three forms:
|
Input Parameters |
|
Result Size |
|
Time Complexity | O(E), E = number of edges |
Graph Types | Undirected or directed edges, weighted edges |
The algorithm calculates the total number of neighbors of two vertices.
The total number of neighbors of two vertices.
Suppose we have the following graph.
Dan and Jenny together have 6 neighbors in total. Running the algorithm between Dan and Jenny would give us a result of 6. Note that since Jenny and Dan are neighbors themselves, the union of their neighbors includes both Jenny and Dan:
The Adamic/Adar index is a measure introduced in 2003 by Lada Adamic and Eytan Adar to predict links in a social network, according to the number of shared links between two nodes. It is defined as the sum of the inverse logarithmic degree centrality of the neighbors shared by the two nodes.
Where 𝑁(𝑢) is the set of nodes adjacent to u.
The Adamic Adar index between two vertices. If the two vertices do not have common neighbors, the algorithm will return a division by 0 error.
Suppose we have the graph below:
Running the algorithm between Jenny and Dan will give us a result of 1/log(2)=3.321931.
Name | Description | Data type |
| A vertex. |
|
| A vertex. |
|
| Edge types to traverse. |
|
Name | Description | Data type |
| A vertex. |
|
| A vertex. |
|
| Edge types to traverse. |
|