A strongly connected component (SCC) is a subgraph such that there is a path from any vertex to every other vertex. A graph can contain more than one separate SCC. An SCC algorithm finds the maximal SCCs within a graph. Our implementation is based on the Divide-and-Conquer Strong Components (DCSC) algorithm[1]. In each iteration, pick a pivot vertex v
randomly, and find its descendant and predecessor sets, where descendant set D_v
is the vertex reachable from v
, and predecessor set P_v
is the vertices which can reach v
(stated another way, reachable from v
through reverse edges). The intersection of these two sets is a strongly connected component SCC_v
. The graph can be partitioned into 4 sets: SCC_v
, descendants D_v
excluding SCC_v
, predecessors P_v
excluding SCC
, and the remainders R_v
. It is proved that any SCC is a subset of one of the 4 sets [1]. Thus, we can divide the graph into different subsets and detect the SCCs independently and iteratively.
The problem of this algorithm is unbalanced load and slow convergence when there are a lot of small SCCs, which is often the case in real-world use cases [3]. We added two trimming stages to improve the performance: size-1 SCC trimming[2] and weakly connected components[3].
The implementation of this algorithm requires reverse edges for all directed edges considered in the graph.
[1] Fleischer, Lisa K., Bruce Hendrickson, and Ali Pınar. "On identifying strongly connected components in parallel." International Parallel and Distributed Processing Symposium. Springer, Berlin, Heidelberg, 2000.
[2] Mclendon Iii, William, et al. "Finding strongly connected components in distributed graphs." Journal of Parallel and Distributed Computing 65.8 (2005): 901-910.
[3] Hong, Sungpack, Nicole C. Rodia, and Kunle Olukotun. "On fast parallel detection of strongly connected components (SCC) in small-world graphs." Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 2013.
We ran scc
on the social26 graph. A portion of the JSON result is shown below.
The first element "i"=1
means the whole graph is processed in just one iteration. The 5 "trim_set.size()"
elements mean there were 5 rounds of size-1 SCC trimming. The final "@@.cluster_dist_heap" object"
reports on the size distribution of SCCs.There is one SCC with 9 vertices, and 17 SCCs with only 1 vertex in the graph.
Characteristic
Value
Result
Assigns a component id (INT) to each vertex, such that members of the same component have the same id value.
Input Parameters
SET<STRING> v_type
: Names of vertex types to use
SET<STRING> e_type
: Names of edge types to use
SET<STRING> rev_e_type
: Names of reverse direction edge types to use
INT top_k_dist
: top k result in SCC distribution
INT output_limit
: If >=0, max number of vertices to output to JSON.
INT max_iter
: number of maximum iteration of the algorithm
INT iter_wcc
: find weakly connected components for the active vertices in this iteration, since the largest SCCs are already found after several iterations; usually a small number(3 to 10)
BOOL print_accum
: If True, output JSON to standard output
STRING result_attr
: If not empty, store community ID values (INT) to this attribute
STRING file_path
: If not empty, write output to this file.
Result Size
V = number of vertices
Time Complexity
O(iter*d), d = max(diameter of components)
Graph Types
Directed edges with reverse direction edges as well
In addition to the regular strongly connected component algorithm, we also provide a version that is optimized for small-world graphs. A small world graph in this context means the graph has a hub community, where a vast majority of the vertices of the graph are weakly connected.
This version improves upon the performance of the original algorithm when dealing with small-world graphs by combining several different methods used to find connected components in a multi-step process proposed by Slota et al. in BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems.
The algorithm starts by trimming the graph, which removes all vertices whose indegree or outdegree is 0. Then in the second phase, the algorithm selects an initial pivot vertex v with a high product of indegree and outdegree. From the initial pivot vertex , the algorithm uses one iteration of the Forward-Backward method to identify all vertices reachable by v (descendants) and all vertices that can reach v (predecessors). The intersection of the descendants and the predecessors form a strongly connected component (SCC). The vertices that are not included in this SCC are passed off to the next step.
After identifying the first SCC, the algorithm uses the coloring method and Tarjan's serial algorithm to identify the SCCs in the remaining vertices.
For more details, see Slota et al., BFS and Coloring-based Parallel Algorithms for Strongly Connected Components and Related Problems.
The algorithm has a time complexity of O(V + E), where V is the number of vertices and E is the number of edges in the graph.
When to_show_cc_count
is set to true, the algorithm will return the number of strongly connected components in the graph.
Suppose we have the following graph. We can see there are 7 strongly connected components. With two of them containing more than 1 vertices. The five vertices on the left are each a strongly connected component individually.
Running the algorithm on the graph will return a result of 7:
Name
Description
Data type
v_type
The vertex type to count as part of a strongly connected component
STRING
e_type
The edge type to traverse
STRING
re_type
The reverse edge type to traverse. If the graph is undirected, fill in the name of the undirected edge here as well as for e_type.
STRING
threshold
The threshold used to choose initial pivot vertices. Only vertices whose product of indegree and outdegree exceed this threshold will be considered candidates for the pivot vertex. This is an attempt to increase the chances that the initial pivot is contained within the largest SCC.
The default value for this parameter is 100000. It is suggested that you keep this default value when running the algorithm.
UINT
to_show_cc_count
If set to TRUE
, the algorithm will return the number of vertices in each strongly connected component.
BOOL