Strongly Connected Components

Supported Graph Characteristics
 Unweighted edges Directed edges Undirected edges Homogeneous vertex types Heterogeneous vertex types

A strongly connected component (SCC) is a subgraph such that there is a path from any vertex to every other vertex. A graph can contain more than one separate SCC. This SCC algorithm finds the maximal SCCs within a graph.

Our implementation is based on the Divide-and-Conquer Strong Components (DCSC) algorithm:

• In each iteration, pick a pivot vertex `v` randomly.

• Find its descendant and predecessor sets:

• Descendant set `D_v` is the set of vertices reachable from `v` using regular edges.

• Predecessor set `P_v` is the set of vertices reachable from `v` using reverse edges.

• The intersection of these two sets is a strongly connected component `SCC_v`.

The graph can be partitioned into 4 sets:

• `SCC_v`

• Descendants `D_v` excluding `SCC_v`

• Predecessors `P_v` excluding `SCC_v`

• Remainders `R_v`.

It is proved that any SCC is a subset of one of the 4 sets.

Thus, we can divide the graph into different subsets and detect the SCCs independently and iteratively.

Notes

The implementation of this algorithm requires reverse edges for all directed edges considered in the graph.

The problem of this algorithm is unbalanced load and slow convergence when there are a lot of small SCCs, which is often the case in real-world use cases.

We added two trimming stages to improve the performance: Size-1 SCC Trimming and Weakly Connected Components.

References

DCSC Algorithm: Fleischer, Lisa K., Bruce Hendrickson, and Ali Pınar. "On identifying strongly connected components in parallel." International Parallel and Distributed Processing Symposium. Springer, Berlin, Heidelberg, 2000.

Size-1 SCC Trimming: Mclendon III, William, et al. "Finding strongly connected components in distributed graphs." Journal of Parallel and Distributed Computing 65.8 (2005): 901-910.

Weakly Connected Components: Hong, Sungpack, Nicole C. Rodia, and Kunle Olukotun. "On fast parallel detection of strongly connected components (SCC) in small-world graphs." Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 2013.

Specifications

``````tg_scc (SET<STRING> v_type, SET<STRING> e_type, SET<STRING> rev_e_type,
INT top_k_dist, INT output_limit, INT max_iter = 500, INT iter_wcc = 5,
BOOL print_accum = TRUE, STRING attr= "", STRING file_path="")``````

Parameters

Parameter Description Default Value

`SET<STRING> v_type`

The vertex types to use

(empty set of strings)

`SET<STRING> e_type`

The edge types to use

(empty set of strings)

`SET<STRING> rev_e_type`

The reverse edge types to use

(empty set of strings)

`INT top_k_dist`

The top k results in the SCC distribution

N/A

`INT output_limit`

The maximum number of vertices to output in JSON format.

N/A

`INT max_iter`

The maximum number of iterations of the algorithm.

500

`INT iter_wcc`

Find weakly connected components for the active vertices in this iteration, since the largest SCCs are already found after several iterations. Usually a small number (3 to 10).

5

`BOOL print_accum`

If true, print output in JSON format to the standard output.

True

`STRING result_attr`

If not empty, store community values in `INT` format to this vertex attribute

(empty string)

`STRING file_path`

If not empty, write output to this file.

(empty string)

Output

Assigns a component id (INT) to each vertex, such that members of the same component have the same ID value.

Time complexity

This algorithm has a time complexity of O(k * d), where k is equal to the number of iterations and d is equal to the maximum component diameter.

Example

Consider a graph with Person vertices and Friend and Coworker edges. The following illustration shows a group of these vertices and edges. Not shown are a number of other vertices that have no connection to other vertices.

We run the SCC query with default values, making sure to include both types of edges in the query.

A portion of the JSON result is shown below.

``````[
{
"@@cluster_dist_heap": [
{
"csize": 9,
"num": 1
},
{
"csize": 1,
"num": 17
}
]
}
]``````

The `@@.cluster_dist_heap` object reports on the size distribution of SCCs.

There is one SCC with 9 vertices, and 17 SCCs with only 1 vertex in the graph.