Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
In addition to the regular strongly connected component algorithm, we also provide a version that is optimized for small-world graphs. A small world graph in this context means the graph has a hub community, where a vast majority of the vertices of the graph are weakly connected.
This version improves upon the performance of the original algorithm when dealing with small-world graphs by combining several different methods used to find connected components in a multi-step process proposed by Slota et al. in BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems.
The algorithm starts by trimming the graph, which removes all vertices whose indegree or outdegree is 0. Then in the second phase, the algorithm selects an initial pivot vertex v with a high product of indegree and outdegree. From the initial pivot vertex , the algorithm uses one iteration of the Forward-Backward method to identify all vertices reachable by v (descendants) and all vertices that can reach v (predecessors). The intersection of the descendants and the predecessors form a strongly connected component (SCC). The vertices that are not included in this SCC are passed off to the next step.
After identifying the first SCC, the algorithm uses the coloring method and Tarjan's serial algorithm to identify the SCCs in the remaining vertices.
For more details, see Slota et al., BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems.
The algorithm has a time complexity of O(V + E), where V is the number of vertices and E is the number of edges in the graph.
When to_show_cc_count
is set to true, the algorithm will return the number of strongly connected components in the graph.
Suppose we have the following graph. We can see there are 7 strongly connected components. With two of them containing more than 1 vertices. The five vertices on the left are each a strongly connected component individually.
Running the algorithm on the graph will return a result of 7:
Name
Description
Data type
v_type
The vertex type to count as part of a strongly connected component
STRING
e_type
The edge type to traverse
STRING
re_type
The reverse edge type to traverse. If the graph is undirected, fill in the name of the undirected edge here as well as for e_type.
STRING
threshold
The threshold used to choose initial pivot vertices. Only vertices whose product of indegree and outdegree exceed this threshold will be considered candidates for the pivot vertex. This is an attempt to increase the chances that the initial pivot is contained within the largest SCC.
The default value for this parameter is 100000. It is suggested that you keep this default value when running the algorithm.
UINT
to_show_cc_count
If set to TRUE
, the algorithm will return the number of vertices in each strongly connected component.
BOOL
The Louvain Method for community detection [1] partitions the vertices in a graph by approximately maximizing the graph's modularity score. The modularity score for a partitioned graph assesses the difference in density of links within a partition vs. the density of links crossing from one partition to another. The assumption is that if a partitioning is good (that is, dividing up the graph into communities or clusters), then the within-density should be high and the inter-density should be low.
The most efficient and empirically effective method for calculating modularity was published by a team of researchers at the University of Louvain. The Louvain method uses agglomeration and hierarchical optimization:
Optimize modularity for small local communities.
Treat each optimized local group as one unit, and repeat the modularity operation for groups of these condensed units.
The original Louvain Method contains two phases. The first phase incrementally calculates the modularity change of moving a vertex into every other community and moves the vertex to the community with the highest modularity change. The second phase coarsens the graph by aggregating the vertices which are assigned in the same community into one vertex. The first phase and second phase make up a pass. The Louvain Method performs the passes iteratively. In other words, the algorithm assigns an initial community label to every vertex, then performs the first phase, during which the community labels are changed until there is no modularity gain. Then it aggregates the vertices with the same labels into one vertex and calculates the aggregated edge weights between new vertices. For the coarsened graph, the algorithm conducts the first phase again to move the vertices into new communities. The algorithm continues until the modularity is not increasing, or runs to the preset iteration limits.
However, phase one is sequential, and thus slow for large graphs. An improved Parallel Louvain Method Louvain Method (PLM) calculates the best community to move to for each vertex in parallel [2]. In Parallel Louvain Method(PLM), the positive modularity gain is not guaranteed, and it may also swap two vertices to each other’s community. After finishing the passes, there is an additional refinement phase, which is running the first phase again on each vertex to do some small adjustments for the resulting communities. [3].
[1] Blondel, Vincent D., et al. "Fast unfolding of communities in large networks." Journal of statistical mechanics: theory and experiment 2008.10 (2008): P10008.
[2] Staudt, Christian L., and Henning Meyerhenke. "Engineering parallel algorithms for community detection in massive networks." IEEE Transactions on Parallel and Distributed Systems 27.1 (2016): 171-184.
[3] Lu, Hao, Mahantesh Halappanavar, and Ananth Kalyanaraman. "Parallel heuristics for scalable community detection." Parallel Computing 47 (2015): 19-37.
If we use louvain_parallel
for social10 graph, it will give the same result as the connected components algorithm. The social26 graph is a densely connected graph. The connected components algorithm groups all the vertices into the same community and label propagation does not consider the edge weight. On the contrary, louvain_parallel
detects 7 communities in total, and the cluster distribution is shown below (csize
is cluster size):
Characteristic
Value
Result
Assigns a component id (INT) to each vertex, such that members of the same component have the same id value. The JSON output lists every vertex with its community ID value. It also lists community id values, sorted by community size.
Input Parameters
SET<STRING> v_type
: Names of vertex types to use
SET<STRING> e_type
: Names of edge types to use
STRING wt_attr
: Name of edge weight attribute (must be FLOAT)
INT iter1
: Max number of iterations for the first phase. Default value is 10
INT iter2
: Max number of iterations for the second phase. Default value is 10
INT iter3
: Max number of iterations for the refinement phase. Default value is 10
INT split
: Number of splits in phase 1. Increase the number to save memory, at the expense of having a longer running time. Default value is 10.
BOOL print_accum
: If True, output JSON to standard output
STRING result_attr
: If not empty, store community id values (INT) to this attribute
STRING file_path
: If not empty, write output to this file.
BOOL comm_by_size
: If True, and if print_accum
is True, output the membership of each community, with communities arranged by size.
Result Size
V = number of vertices
Time Complexity
O(V^2*L), V = number of vertices, L = (iter1 * iter2 + iter3) = total number of iterations
Graph Types
Undirected, weighted edges
An edge weight attribute is required.
In addition to the regular weakly connected component algorithm, we also provide a version that is optimized for small-world graphs. A small world graph means the graph has a hub community, where the vast majority of the vertices in the graph are weakly connected.
This version improves upon the performance of the original algorithm when dealing with small-world graphs by combining several different methods used to find connected components in a multi-step process proposed by Slota et al. in BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems.
The algorithm starts by selecting an initial pivot vertex v with a high product of indegree and outdegree. From the initial pivot vertex , the algorithm uses Breadth-First Search to determine the massive weakly connected component. The vertices that are not included in this SCC are passed off to the next step.
After identifying the first WCC, the algorithm uses the coloring method to idenify the WCCs in the remaining vertices.
For more details, see Slota et al., BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems.
If to_show_cc_count
is set to true, the algorithm will return the number of vertices in each weakly connected component.
Suppose we have the following graph. We can see that there are three connected components. The first one has 5 vertices, while the two others have 3 vertices.
Running the algorithm on the graph will show that there are three weakly connected components, and have 5, 3, and 3 vertices respectively.
Name
Description
Data type
v_type
The vertex type to count as part of a connected component
STRING
e_type
The edge type to traverse
STRING
threshold
The threshold used to choose initial pivot vertices. Only vertices whose product of indegree and outdegree exceed this threshold will be considered candidates for the pivot vertex. This is an attempt to increase the chances that the initial pivot is contained within the largest WCC.
The default value for this parameter is 100000. It is suggested that you keep this default value when running the algorithm.
UINT
to_show_cc_count
If true, the algorithm will return the number of vertices in each connected component.
BOOL
A component is the maximal set of vertices, plus their connecting edges, which are interconnected. That is, you can reach each vertex from each other vertex. In the example figure below, there are three components.
This particular algorithm deals with undirected edges. If the same definition (each vertex can reach each other vertex) is applied to directed edges, then the components are called Strongly Connected Components. If you have directed edges but ignore the direction (permitting traversal in either direction), then the algorithm finds Weakly Connected Components.
It is easy to see in this small graph that the algorithm correctly groups the vertices:
Alex, Bob, and Justin all have Community ID = 2097152
Chase, Damon, and Eddie all have Community ID = 5242880
Fiona, George, Howard, and Ivy all have Community ID = 0
Our algorithm uses the TigerGraph engine's internal vertex ID numbers; they cannot be predicted.
A k-core of a graph is a maximal connected subgraph in which every vertex is connected to at least k vertices in the subgraph. To obtain the k-core of a graph, the algorithm first deletes the vertices whose outdegree is less than k. It then updates the outdegree of the neighbors of the deleted vertices, and if that causes a vertex's outdegree to fall below k, it will also delete that vertex. The algorithm repeats this operation until every vertex left in the subgraph has an outdegree of at least k.
O(E), where E is the number of edges in the graph.
If we run the kcore
algorithm on this small graph like so:
Here is the returned JSON response, which includes a 2-core that is comprised of Dan, Jenny, and Tom:
Our algorithm takes a range of values for k and returns the set of the vertices that constitute the k-core with the highest possible value of k within the range. It is an implementation of Algorithm 2 in .
In the example below based on the graph from GSQL 101, we can see that Dan, Tom, and Jenny make up a 2-core, which is the max-core of the graph:
Characteristic
Value
Result
Assigns a component id (INT) to each vertex, such that members of the same component have the same id value.
Input Parameters
SET<STRING> v_type
: Names of vertex types to use
SET<STRING> e_type
: Names of edge types to use
INT output_limit
: If >=0, max number of vertices to output to JSON.
BOOL print_accum
: If True, output JSON to standard output
STRING result_attr
: If not empty, store community ID values (INT) to this attribute
STRING file_path
: If not empty, write output to this file.
Result Size
V = number of vertices
Time Complexity
O(E*d), E = number of edges, d = max(diameter of components)
Graph Types
Undirected edges
Parameter | Description |
| Vertex type to include in the k-core |
| Edge type to count for k-core connections |
| Minimum value of k. If the actual maximum core is below |
| Maximum value of k. If |
| If |
| The k-shell is the set of vertices that are part of the k-core but not part of the (k+1)-core. If |
| Boolean value that decides whether the algorithm will return output in JSON |
| Optional. An attribute of the vertex to save the core level of the vertex to. If |
| Optional. If |
Label Propagation is a heuristic method for determining communities. The idea is simple: If the plurality of your neighbors all bear the label X, then you should label yourself as also a member of X. The algorithm begins with each vertex having its own unique label. Then we iteratively update labels based on the neighbor influence described above. It is important that the order for updating the vertices be random. The algorithm is favored for its efficiency and simplicity, but it is not guaranteed to produce the same results every time.
In a variant version, some vertices could initially be known to belong to the same community. If they are well-connected to one another, they are likely to preserve their common membership and influence their neighbors,
This is the same graph that was used in the Connected Component example. The results are different, though. The quartet of Fiona, George, Howard, and Ivy have been split into 2 groups:
(George & Ivy) each connect to (Fiona & Howard) and to one another.
(Fiona & Howard) each connect to (George & Ivy) but not to one another.
Label Propagation tries to find natural clusters and separations within connected components. That is, it looks at the quality and pattern of connections. The Component Component algorithm simply asks the Yes or No question: Are these two vertices connected?
We set max_iter
to 10, but the algorithm reaches a steady state after 3 iterations:
A strongly connected component (SCC) is a subgraph such that there is a path from any vertex to every other vertex. A graph can contain more than one separate SCC. An SCC algorithm finds the maximal SCCs within a graph. Our implementation is based on the Divide-and-Conquer Strong Components (DCSC) algorithm[1]. In each iteration, pick a pivot vertex v
randomly, and find its descendant and predecessor sets, where descendant set D_v
is the vertex reachable from v
, and predecessor set P_v
is the vertices which can reach v
(stated another way, reachable from v
through reverse edges). The intersection of these two sets is a strongly connected component SCC_v
. The graph can be partitioned into 4 sets: SCC_v
, descendants D_v
excluding SCC_v
, predecessors P_v
excluding SCC
, and the remainders R_v
. It is proved that any SCC is a subset of one of the 4 sets [1]. Thus, we can divide the graph into different subsets and detect the SCCs independently and iteratively.
The problem of this algorithm is unbalanced load and slow convergence when there are a lot of small SCCs, which is often the case in real-world use cases [3]. We added two trimming stages to improve the performance: size-1 SCC trimming[2] and weakly connected components[3].
The implementation of this algorithm requires reverse edges for all directed edges considered in the graph.
[1] Fleischer, Lisa K., Bruce Hendrickson, and Ali Pınar. "On identifying strongly connected components in parallel." International Parallel and Distributed Processing Symposium. Springer, Berlin, Heidelberg, 2000.
[2] Mclendon Iii, William, et al. "Finding strongly connected components in distributed graphs." Journal of Parallel and Distributed Computing 65.8 (2005): 901-910.
[3] Hong, Sungpack, Nicole C. Rodia, and Kunle Olukotun. "On fast parallel detection of strongly connected components (SCC) in small-world graphs." Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 2013.
We ran scc
on the social26 graph. A portion of the JSON result is shown below.
The first element "i"=1
means the whole graph is processed in just one iteration. The 5 "trim_set.size()"
elements mean there were 5 rounds of size-1 SCC trimming. The final "@@.cluster_dist_heap" object"
reports on the size distribution of SCCs.There is one SCC with 9 vertices, and 17 SCCs with only 1 vertex in the graph.
The Speaker-listener Label Propagation Algorithm (SLPA) is a variation of the Label Propagation algorithm that is able to detect overlapping communities. The main difference between LPA and SLPA is that each node can only hold a single label in LPA while it is allowed to possess multiple labels in SLPA.
The algorithm begins with each vertex having its own unique label. It then iteratively records labels in a local accumulator based on specific speaking and listening rules. Then the post-processing of the record labels is applied. Finally, the algorithm removes the nested communities and outputs all the communities. Note that it is not guaranteed to produce the same results every time.
In the example below, we run the tg_slpa
algorithm on the social10 graph. We set max_iter = 10 and threshold = 0.1.
The Local Clustering Coefficient algorithm computes the local clustering coefficient of every vertex in a graph. The local clustering coefficient of a vertex (node) in a graph quantifies how close its neighbors are to being a complete graph, where every two distinct vertices are connected by an edge. It is obtained by dividing the number edges between a vertex's neighbors by the number of edges that could possibly exist.
The algorithm does not report the local clustering coefficient for vertices that have only one neighbor.
O(n)
, where n
is the number of vertices in the graph.
Using the social
graph below as an example, Jenny has three neighbors - Tom, Dan and Amily, but only Tom and Dan are connected. Between the three neighbors, the maximum number of edges that could exist is 3. Therefore, the local clustering coefficient for Jenny is 1/3.
On the other hand, Tom has two neighbors that are connected by one edge. Since the maximum number of edges that could exist between two vertices is 1, Tom has a local clustering coefficient of 1/1 = 1.
By running the algorithm in GSQL, we can confirm that Tom has the highest local clustering coefficient with a score of 1 and Jenny has a score of 1/3.
Why triangles? Think of it in terms of a social network:
If A knows B, and A also knows C, then we complete the triangle if B knows C. If this situation is common, it indicates a community with a lot of interaction.
The triangle is in fact the smallest multi-edge "complete subgraph," where every vertex connects to every other vertex.
Triangle count (or density) is a measure of community and connectedness. In particular, it addresses the question of transitive relationships: If A--> B and B-->C, then what is the likelihood of A--> C?
Note that it is computing a single number: How many triangles are in this graph? It is not finding communities within a graph.
It is not common to count triangles in directed graphs, though it is certainly possible. If you choose to do so, you need to be very specific about the direction of interest: In a directed graph, If A--> B and B--> C, then
if A-->C, we have a "shortcut".
if C-->A, then we have a feedback loop.
This algorithm can only be run on TigerGraph V3.1 or higher.
We present two different algorithms for counting triangles. The first, tri_count(), is the classic edge-iterator algorithm. For each edge and its two endpoint vertices S and T, count the overlap between S's neighbors and T's neighbors.
One side effect of the simple edge-iterator algorithm is that it ends up considering each of the three sides of a triangle. The count needs to be divided by 3, meaning we did 3 times more work than a smaller algorithm would have.
tri_count_fast() is a smarter algorithm which does two passes over the edges. In the first pass we mark which of the two endpoint vertices has fewer neighbors. In the second pass, we count the overlap only between marked vertices. The result is that we eliminate 1/3 of the neighborhood matching, the slowest 1/3, but at the cost of some additional memory.
In the social10 graph with Coworker edges, there are clearly 4 triangles.
For more information, see .
Characteristic
Value
Result
Assigns a component id (INT) to each vertex, such that members of the same component have the same id value.
Input Parameters
SET<STRING> v_type
: Names of vertex types to use
SET<STRING> e_type
: Names of edge types to use
INT max_iter
: Number of maximum iteration of the algorithm
INT output_limit
: If >=0, max number of vertices to output to JSON.
BOOL print_accum
: If True, output JSON to standard output
STRING attr
: If not empty, store community id values (INT) to this attribute
STRING file_path
: If not empty, write output to this file.
Result Size
V = number of vertices
Time Complexity
O(E*k), E = number of edges, k = number of iterations.
Graph Types
Undirected edges
Characteristic
Value
Result
Assigns a component id (INT) to each vertex, such that members of the same component have the same id value.
Input Parameters
SET<STRING> v_type
: Names of vertex types to use
SET<STRING> e_type
: Names of edge types to use
SET<STRING> rev_e_type
: Names of reverse direction edge types to use
INT top_k_dist
: top k result in SCC distribution
INT output_limit
: If >=0, max number of vertices to output to JSON.
INT max_iter
: number of maximum iteration of the algorithm
INT iter_wcc
: find weakly connected components for the active vertices in this iteration, since the largest SCCs are already found after several iterations; usually a small number(3 to 10)
BOOL print_accum
: If True, output JSON to standard output
STRING result_attr
: If not empty, store community ID values (INT) to this attribute
STRING file_path
: If not empty, write output to this file.
Result Size
V = number of vertices
Time Complexity
O(iter*d), d = max(diameter of components)
Graph Types
Directed edges with reverse direction edges as well
Characteristic | Value |
Result | Assigns a list of component id (INT) to each vertex, such that members of the same component have the same id value. |
Required Input Parameters |
|
Result Size | V = number of vertices |
Time Complexity | O(E*k), E = number of edges, k = number of iterations. |
Graph Types | Undirected edges |
Parameter | Description | Data type |
| Vertex type to calculate local clustering coefficient for |
|
| Edge type to traverse. Only vertices that are connected by edges of this type are treated as connected in this algorithm. |
|
| Number of the highest local clustering coefficients to report. |
|
| If |
|
| If provided, the local clustering coefficient of a vertex will be saved to this attribute. |
|
| If provided, write output to this file in CSV. |
|
| If |
|
Characteristic | Value |
Result | Returns the number of triangles in the graph. |
Input Parameters |
|
Result Size | 1 integer |
Time Complexity | O(V * E), V = number of vertices, E = number of edges |
Graph Types | Undirected edges |