Jaccard Similarity of Neighborhoods (Single Source)

Supported Graph Characteristics

Unweighted edges

Homogeneous vertex types

The Jaccard index measures the relative overlap between two sets. To compare two vertices by Jaccard similarity, first select a set of values for each vertex. For example, a set of values for a Person could be the cities the Person has lived in. Then the Jaccard index is computed for the two vectors.

The Jaccard index of two sets A and B is defined as follows:

\[Jaccard(A,B)=\frac{|A \cap B|}{|A \cup B|}\]

The value ranges from 0 to 1. If A and B are identical, then \(Jaccard(A, B) = 1\). If both A and B are empty, we define the value to be 0.

Specifications

tg_jaccard_nbor_ss (VERTEX source, STRING e_type, STRING rev_e_type,  INT top_k = 100,
  BOOL print_accum = TRUE, STRING similarity_edge_type = "",STRING file_path = "")

Time complexity

This algorithm has a time complexity of \(O(D^2)\), where \(D\) is equal to the outdegree of the vertex.

Characteristic Value

Result

The top k vertices in the graph that have the highest similarity scores, along with their scores.

The result is available in three forms:

  • streamed out in JSON format

  • written to a file in tabular format, or

  • stored as a vertex attribute value.

Input Parameters

  • SET<STRING> v_type: Vertex type to calculate similarity score for

  • SET<STRING> e_type: Edge type to traverse

  • SET<STRING> re_type: Reverse edge type to traverse

  • INT top_k: the number of vertex pairs with the highest similarity scores to return

  • BOOL print_accum: Boolean value that decides whether to output to console

  • STRING similarity_edge: If provided, the similarity score will be saved to this edge

  • `STRING filepath:`If provided, the algorithm will output to the file path in CSV format

Result Size

top_k

Graph Types

Undirected or directed edges, unweighted edges

The algorithm will not output more than K vertices, so the algorithm may arbitrarily choose to output one vertex over another if there are tied similarity scores.

Example

Using the movie graph, we run jaccard_nbor_ss("Neil", 5):

[
  {
    "@@result_topK": [
      {
        "vertex1": "Neil",
        "vertex2": "Kat",
        "score": 0.5
      },
      {
        "vertex1": "Neil",
        "vertex2": "Kevin",
        "score": 0.4
      },
      {
        "vertex1": "Neil",
        "vertex2": "Jing",
        "score": 0.2
      }
    ]
  }
]

If the source vertex (person) doesn’t have any common neighbors (movies) with any other vertex (person), such as Elena in our example, the result will be an empty list:

[
  {
    "@@result_topK": []
  }
]