Jaccard Similarity of Neighborhoods (Single Source)

The Jaccard index measures the relative overlap between two sets. To compare two vertices by Jaccard similarity, first select a set of values for each vertex. For example, a set of values for a Person could be the cities the Person has lived in. Then the Jaccard index is computed for the two vectors.

The Jaccard index of two sets A and B is defined as follows:

\[Jaccard(A,B)=\frac{|A \cap B|}{|A \cup B|}\]

The value ranges from 0 to 1. If A and B are identical, then Jaccard(A, B) = 1. If both A and B are empty, we define the value to be 0.

Specifications

tg_jaccard_nbor_ss (VERTEX source, STRING e_type, STRING rev_e_type,  INT top_k = 100,
  BOOL print_accum = TRUE, STRING similarity_edge_type = "",STRING file_path = "")

Time complexity

This algorithm has a time complexity of \(O(D^2)\), where \(D\) is equal to the outdegree of the vertex.

Characteristic Value

Result

The top k vertices in the graph that have the highest similarity scores, along with their scores.

The result is available in three forms:

  • streamed out in JSON format

  • written to a file in tabular format, or

  • stored as a vertex attribute value.

Input Parameters

  • SET<STRING> v_type: Vertex type to calculate similarity score for

  • SET<STRING> e_type: Edge type to traverse

  • SET<STRING> re_type: Reverse edge type to traverse

  • INT top_k: the number of vertex pairs with the highest similarity scores to return

  • BOOL print_accum: Boolean value that decides whether to output to console

  • STRING similarity_edge: If provided, the similarity score will be saved to this edge

  • `STRING filepath:`If provided, the algorithm will output to the file path in CSV format

Result Size

top_k

Graph Types

Undirected or directed edges, unweighted edges

The algorithm will not output more than K vertices, so the algorithm may arbitrarily choose to output one vertex over another if there are tied similarity scores.

Example

Using the movie graph, we run jaccard_nbor_ss("Neil", 5):

[
  {
    "@@result_topK": [
      {
        "vertex1": "Neil",
        "vertex2": "Kat",
        "score": 0.5
      },
      {
        "vertex1": "Neil",
        "vertex2": "Kevin",
        "score": 0.4
      },
      {
        "vertex1": "Neil",
        "vertex2": "Jing",
        "score": 0.2
      }
    ]
  }
]

If the source vertex (person) doesn’t have any common neighbors (movies) with any other vertex (person), such as Elena in our example, the result will be an empty list:

[
  {
    "@@result_topK": []
  }
]