1 of 12

Centrality Algorithms

PageRank

The PageRank algorithm measures the influence of each vertex on every other vertex. PageRank influence is defined recursively: a vertex's influence is based on the influence of the vertices which refer to it. A vertex's influence tends to increase if (1) it has more referring vertices or if (2) its referring vertices have higher influence. The analogy to social influence is clear.

A common way of interpreting PageRank value is through the Random Network Surfer model. A vertex's PageRank score is proportional to the probability that a random network surfer will be at that vertex at any given time. A vertex with a high PageRank score is a vertex that is frequently visited, assuming that vertices are visited according to the following Random Surfer scheme:

Assume a person travels or surfs across a network's structure, moving from vertex to vertex in a long series of rounds.
The surfer can start anywhere. This start-anywhere property is part of the magic of PageRank, meaning the score is a truly fundamental property of the graph structure itself.
Each round, the surfer randomly picks one of the outward connections from the surfer's current location. The surfer repeats this random walk for a long time.
But wait. The surfer doesn't always follow the network's connection structure. There is a probability (1-damping, to be precise), that the surfer will ignore the structure and will magically teleport to a random vertex.

For more information, see the Google paper on PageRank.

Specifications

tg_pageRank (STRING v_type, STRING e_type,  FLOAT max_change=0.001, INT max_iter=25, FLOAT damping=0.85, INT top_k = 100,   BOOL print_accum = TRUE, STRING result_attr =  "", STRING file_path = "",   BOOL display_edges = FALSE)

Example

 # Use _ for default values RUN QUERY tg_pageRank("Person", "Friend", 0.001, 25, 0.85, 100 _, _, _, _)

We ran pageRank on our test10 graph (using Friend edges) with the following parameter values: damping=0.85, max_change=0.001, and max_iter=25. We see that Ivy (center bottom) has the highest pageRank score (1.12). This makes sense since there are 3 neighboring persons who point to Ivy, more than for any other person. Eddie and Justin have scores of exactly 1 because they do not have any out-edges. This is an artifact of our particular version pageRank. Likewise, Alex has a score of 0.15, which is (1-damping), because Alex has no in-edges.

Article Rank (Beta)

ArticleRank is an algorithm that has been derived from the PageRank algorithm to measure the influence of journal articles.

Page Rank assumes that relationships originating from low-degree nodes have a higher influence than relationships from high-degree nodes. Article Rank modifies the formula in such a way that it retains the basic PageRank methodology but lowers the influence of low-degree nodes.

The Article Rank of a node v at iteration i is defined as:

Within the formula:

Nin(v) are the incoming neighbors and Nout(v) are the outgoing neighbors of node v.
d is a damping factor in [0, 1], usually set to 0.85.
Nout is the average out-degree

For more information, see ArticleRank: a PageRank‐based alternative to numbers of citations for analysing citation network.

Specifications

CREATE QUERY tg_article_rank (STRING v_type, STRING e_type,
 FLOAT max_change=0.001, INT max_iter=25, FLOAT damping=0.85, INT top_k = 100,
 BOOL print_accum = TRUE, STRING result_attr =  "", STRING file_path = "")

Parameters

Return value

The article rank score for each vertex.

Example

Suppose we have the following graph:

By running Article Rank on the graph, we will see that the vertex with the highest score is Dan:

RUN QUERY tg_article_rank ("person", "friendship", _, _, _, _, _)

{
  "error": false,
  "message": "",
  "version": {
    "schema": 2,
    "edition": "enterprise",
    "api": "v2"
  },
  "results": [{"@@topScores": [
    {
      "score": 2348294.75,
      "Vertex_ID": "Dan"
    },
    {
      "score": 1863160.625,
      "Vertex_ID": "Jenny"
    },
    {
      "score": 1442890.5,
      "Vertex_ID": "Tom"
    },
    {
      "score": 1053484.625,
      "Vertex_ID": "Nancy"
    },
    {
      "score": 739327.9375,
      "Vertex_ID": "Kevin"
    },
    {
      "score": 703562.75,
      "Vertex_ID": "Amily"
    },
    {
      "score": 498013.25,
      "Vertex_ID": "Jack"
    }
  ]}]
}

Weighted PageRank

The only difference between weighted PageRank and standard PageRank is that edges have weights, and the influence that a vertex receives from an in-neighbor is multiplied by the weight of the in-edge.

Specifications

tg_pageRank_wt (SET<STRING> v_type, SET<STRING> e_type, STRING wt_attr,
  FLOAT max_change=0.001, INT max_iter=25, FLOAT damping=0.85, INT top_k=100,
   BOOL print_accum = TRUE, STRING result_attr =  "", STRING file_path = "",
   BOOL display_edges = FALSE)

Personalized PageRank

In the original PageRank, the damping factor is the probability of the surfer continues browsing at each step. The surfer may also stop browsing and start again from a random vertex. In personalized PageRank, the surfer can only start browsing from a given set of source vertices both at the beginning and after stopping.

Specifications

tg_pageRank_pers(SET<VERTEX> source, STRING e_type,
FLOAT max_change=0.001, INT max_iter=25, FLOAT damping = 0.85, INT top_k = 100
BOOL print_accum = TRUE, STRING result_attr = "", STRING file_path = "")

Example

We ran Personalized PageRank on the graph social10 using Friend edges with the following parameter values:

# Using "_" to use default values
RUN QUERY tg_pageRank_pers([("Fiona","Person")], "Friend", _, _, _, _, _, _,
_)

In this case, the random walker can only start or restart walking from Fiona. In the figure below, we see that Fiona has the highest PageRank score in the result. Ivy and George have the next highest scores because they are direct out-neighbors of Ivy and there are looping paths that lead back to them again. Half of the vertices have a score of 0 since they can not be reached from Fiona.

Betweenness Centrality

The Betweenness Centrality of a vertex is defined as the number of shortest paths that pass through this vertex, divided by the total number of shortest paths. That is

$BC(v) =\sum_{s \ne v \ne t}PD_{st}(v)= \sum_{s \ne v \ne t} SP_{st}(v)/SP_{st} ,$

where $PD$ is called the pair dependency, $SP_{st}$ is the total number of shortest paths from node s to node t and $SP_{st}(v)$ is the number of those paths that pass through v.

The TigerGraph implementation is based on A Faster Algorithm for Betweenness Centrality by Ulrik Brandes, Journal of Mathematical Sociology 25(2):163-177, (2001). For every vertex s in the graph, the pair dependency starting from vertex s to all other vertices t via all other vertices v is computed first,

$PD_{s*}(v) = \sum_{t:s \in V} PD_{st}(v)$ .

Then betweenness centrality is computed as

$BC(v) =\sum_{s:s \in V}PD_{s*}(v)/2$ .

According to Brandes, the accumulated pair dependency can be calculated as

$PD_{s*}(v) =\sum_{w:v \in P_s(w)} SP_{sv}(v)/SP_{sw} \cdot (1+PD_{s*}(w)) ,$

where $P_s(w)$ , the set of predecessors of vertex w on shortest paths from s, is defined as

$P_s(w) = \{u \in V: \{u, w\} \in E, dist(s,w) = dist(s,u)+dist(u,w) \} .$

For every vertex, the algorithm works in two phases. The first phase calculates the number of shortest paths passing through each vertex. Then starting from the vertex on the most outside layer in a non-incremental order with pair dependency initial value of 0, traverse back to the starting vertex

This algorithm query employs a subquery called bc_subquery. Both queries are needed to run the algorithm.

Specifications

CREATE QUERY tg_betweenness_cent(SET<STRING> v_type, SET<STRING> e_type, 
STRING re_type,INT max_hops=10, INT top_k=100, BOOL print_accum = True, 
STRING result_attr = "", STRING file_path = "", BOOL display_edges = FALSE)

Parameters

Example

In the example below, Claire is in the very center of the graph and has the highest betweenness centrality. Six shortest paths pass through Sam (i.e. paths from Victor to all other 6 people except for Sam and Victor), so the score of Sam is 6. David also has a score of 6, since Brian has 6 paths to other people that pass through David.

# Use _ for default values
RUN QUERY tg_betweenness_cent(["Person"], ["Friend"], _, _, _, _, _, _)

[
  {
    "@@BC": {
      "Alice": 0,
      "Frank": 0,
      "Claire": 17,
      "Sam": 6,
      "Brian": 0,
      "David": 6,
      "Richard": 0,
      "Victor": 0
    }
  }
]

In the following example, both Charles and David have 9 shortest paths passing through them. Ellen is in a similar position as Charles, but her centrality is weakened due to the path between Frank and Jack.

[
  {
    "@@BC": {
      "Alice": 0,
      "Frank": 0,
      "Charles": 9,
      "Ellen": 8,
      "Brian": 0,
      "David": 9,
      "Jack": 0
    }
  }
]

Eigenvector Centrality (Beta)

Eigenvector centrality (also called eigencentrality or prestige score) is a measure of the influence of a vertex in a network. Relative scores are assigned to all vertices in the network based on the concept that connections to high-scoring vertices contribute more to the score of the vertex in question than equal connections to low-scoring vertices. A high eigenvector score means that a vertex is connected to many vertices who themselves have high scores.

Specification

Parameters

Return value

The vertices with the highest Eigenvector centrality scores along with their score.

Example

Suppose we have the following graph:

Running the algorithm on the graph will show that Dan has the highest centrality score.

Degree Centrality (Beta)

Degree centrality is defined as the number of edges incident upon a node (i.e., the number of ties that a node has). The degree can be interpreted in terms of the immediate risk of a node for catching whatever is flowing through the network (such as a virus, or some information).

Specification

CREATE QUERY tg_degree_cent(SET<STRING> v_type, SET<STRING> e_type, 
  SET<STRING> re_type, BOOL in_degree = TRUE, BOOL out_degree = TRUE,
  INT top_k=100, BOOL print_accum = True, STRING result_attr = "",
  STRING file_path = "")

Parameters

Return value

The vertices with the highest degree centrality scores along with their scores.

Example

Suppose we have the following graph:

Running the query on the graph will show that Dan has the highest degree centrality

RUN QUERY tg_degree_cent(["person"], ["friendship"],["friendship"])

{
  "error": false,
  "message": "",
  "version": {
    "schema": 2,
    "edition": "enterprise",
    "api": "v2"
  },
  "results": [{"top_scores": [
    {
      "score": 8,
      "Vertex_ID": "Dan"
    },
    {
      "score": 6,
      "Vertex_ID": "Jenny"
    },
    {
      "score": 4,
      "Vertex_ID": "Nancy"
    },
    {
      "score": 2,
      "Vertex_ID": "Kevin"
    },
    {
      "score": 2,
      "Vertex_ID": "Amily"
    },
    {
      "score": 2,
      "Vertex_ID": "Jack"
    }
  ]}]
}

Closeness Centrality

We all have an intuitive understanding when we say a home, an office, or a store is "centrally located." Closeness Centrality provides a precise measure of how "centrally located" a vertex is. The steps below show the steps for one vertex v:

TigerGraph's closeness centrality algorithm uses multi-source breadth-first search (MS-BFS) to traverse the graph and calculate the sum of a vertex's distance to every other vertex in the graph, which vastly improves the performance of the algorithm. The algorithm's implementation of MS-BFS is based on the paper The More the Merrier: Efficient Multi-source Graph Traversal by Then et al.

This algorithm query employs a subquery called cc_subquery. Both queries are needed to run the algorithm.

Specifications

tg_closeness_cent (SET<STRING> v_type, SET<STRING> e_type, INT max_hops=10,
  INT top_k=100, BOOL wf = TRUE, BOOL print_accum = True, STRING result_attr = "",
  STRING file_path = "", BOOL display_edges = FALSE)

Parameters

Example

Closeness centrality can be measured for either directed edges (from v to others) or for undirected edges. Directed graphs may seem less intuitive, however, because if the distance from Alex to Bob is 1, it does not mean the distance from Bob to Alex is also 1.

For our example, we wanted to use the topology of the Likes graph, but to have undirected edges. We emulated an undirected graph by using both Friend and Also_Friend (reverse-direction) edges.

# Use _ for default values
RUN QUERY tg_closeness_cent(["Person"], ["Friend", "Also_Friend"], _, _, 
_, _, _, _, _)

Approximate Closeness Centrality

In the Closeness Centrality algorithm, to obtain the closeness centrality score for a vertex, we measure the distance from the source vertex to every single vertex in the graph. In large graphs, running this calculation for every vertex can be highly time-consuming.

The Approximate Closeness Centrality algorithm (based on Cohen et al. 2014) calculates the approximate closeness centrality score for each vertex by combining two estimation approaches - sampling and pivoting. This hybrid estimation approach offers near-linear time processing and linear space overhead within a small relative error. It runs on graphs with unweighted edges (directed or undirected).

This query uses another subquerycloseness_cent_approx_sub, which needs to be installed before closeness_approx can be installed.

Specifications

tg_closeness_approx (
    SET<STRING> v_type, 
    SET<STRING> e_type,
        INT k = 100,  # sample num
        INT max_hops = 10,  # max BFS explore steps 
        DOUBLE epsilon = 0.1,  # error parameter
    BOOL print_accum = true, # output to console
        STRING file_path = "",  # output file 
        INT debug = 0,  # debug flag -- 0: No LOG;1: LOG without the sample-node bfs loop;2: ALL LOG.
        INT sample_index = 0,  # random sample group
        INT maxsize = 1000,  # max size of connected components using exact closeness algorithm
        BOOL wf = True # Wasserman and Faust formula 
)

Parameters

Result

The result is a list of all vertices in the graph with their approximate closeness centrality score. It is available both in JSON and CSV format.

Example

Below is an example of running the algorithm on the social10 test graph and an excerpt of the response.

RUN QUERY tg_closeness_aprox(["Person"], ["Friend", "Coworker"], 6, 3   \
0.1, true, "", 0, 0, 100, false)

[
  {
    "Start": [
      {
        "attributes": {
          "Start.@closeness": 0.58333
        },
        "v_id": "Fiona",
        "v_type": "Person"
      },
      {
        "attributes": {
          "Start.@closeness": 0.44444
        },
        "v_id": "Justin",
        "v_type": "Person"
      },
      {
        "attributes": {
          "Start.@closeness": 0.53333
        },
        "v_id": "Bob",
        "v_type": "Person"
      }
]

Harmonic Centrality

The Harmonic Centrality algorithm calculates the harmonic centrality of each vertex in the graph. Harmonic Centrality is a variant of Closeness Centrality. In a (not necessarily connected) graph, the harmonic centrality reverses the sum and reciprocal operations in the definition of closeness centrality:

If your graph has many unconnected clusters, the harmonic centrality could be a better indicator of centrality than closeness centrality.

For more information, see Harmonic Centrality.

Specifications

CREATE QUERY harmonic_cent(SET<STRING> v_type, SET<STRING> e_type, 
SET<STRING> re_type,INT max_hops=10, INT top_k=100, BOOL wf = TRUE, 
BOOL print_accum = True, STRING result_attr = "", STRING file_path = "", 
BOOL display_edges = FALSE)

Parameters

Example

If we have the following graph, we can see that Ivy is the most central of the five vertices. Running the algorithm on the graph shows that Ivy has the highest centrality score:

RUN QUERY harmonic_cent(["Person"], ["Coworker"], ["Coworker"], 4, 5, 
true, true, _, _, _)

{
  "error": false,
  "message": "",
  "version": {
    "schema": 0,
    "edition": "enterprise",
    "api": "v2"
  },
  "results": [{"top_scores": [
    {
      "score": 0.04167,
      "Vertex_ID": "Ivy"
    },
    {
      "score": 0.03571,
      "Vertex_ID": "Damon"
    },
    {
      "score": 0.03571,
      "Vertex_ID": "George"
    },
    {
      "score": 0.025,
      "Vertex_ID": "Steven"
    },
    {
      "score": 0.025,
      "Vertex_ID": "Glinda"
    }
  ]}]
}

Visualized results

Influence Maximization (Beta)

Influence maximization is the problem of finding a small subset of vertices in a social network that could maximize the spread of influence.

There are two versions of the Influence Maximization algorithm. Both versions find k vertices that maximize the expected spread of influence in the network. The CELF version improves upon the efficiency of the greedy version and should be preferred in analyzing large networks.

The two versions of the algorithm are implemented on the following papers: