Jaccard Similarity of Neighborhoods (Single Source)

Supported Graph Characteristics

Unweighted edges

Homogeneous vertex types

The Jaccard index measures the relative overlap between two sets.

To compare two vertices by Jaccard similarity, first select a set of attribute values for each vertex.

For example, a set of values for a Person vertex could be the list of cities the person has lived in. The Jaccard index is computed for the two value sets.

The Jaccard index of two sets A and B is defined as follows:

\[Jaccard(A,B)=\frac{|A \cap B|}{|A \cup B|}\]

The value ranges from 0 to 1. If A and B are identical, then \(Jaccard(A, B) = 1\). If both A and B are empty, we define the value to be 0.

Specifications

tg_jaccard_nbor_ss (VERTEX source, STRING e_type, STRING rev_e_type,  INT top_k = 100,
  BOOL print_accum = TRUE, STRING similarity_edge_type = "",STRING file_path = "")

Parameters

Parameter Description Default value

SET<STRING> v_type

Vertex type used to calculate similarity score

(empty set of strings)

SET<STRING> e_type

Edge type to traverse

(empty set of strings)

SET<STRING> re_type

Reverse edge type to traverse

(empty set of strings)

INT top_k

The number of vertex pairs with the highest similarity scores to return. Limits number of returned items.

100

BOOL print_accum

Whether to output the final results to the console in JSON format

True

STRING similarity_edge

If provided, the similarity score will be saved to this edge

(empty string)

STRING filepath

If provided, the algorithm will save the output in CSV format to this file

(empty string)

Output

The top k vertices in the graph that have the highest similarity scores, along with their scores.

Vertices with a similarity score of 0 to the target vertex are not included.

The result is available in three forms:

  • streamed out in JSON format

  • written to a file in tabular format, or

  • stored as a vertex attribute value.

Result size

If there are several vertices with identical similarity scores and returning them all would exceed the limit imposed by top_k, the algorithm arbitrarily chooses which results to display.

Time complexity

This algorithm has a time complexity of \(O(D^2)\), where \(D\) is equal to the outdegree of the vertex.

Example

This example uses a Movie Rating graph consisting of Person and Movie vertices. There are Likes edges that are weighted according to how much the person liked the movie. Each person in the dataset liked at least one movie, but not all movies were liked by all people.

View of a graph illustrating Person vertices connected to Movie vertices by Likes edges.

Here we run tg_jaccard_nbor_ss("Neil", "Likes", "Reverse_Likes", 5):

[
  {
    "Others": [
      {
        "attributes": {
          "Others.@sum_similarity": 0.5
        },
        "v_id": "Kat",
        "v_type": "Person"
      },
      {
        "attributes": {
          "Others.@sum_similarity": 0.4
        },
        "v_id": "Kevin",
        "v_type": "Person"
      },
      {
        "attributes": {
          "Others.@sum_similarity": 0.2
        },
        "v_id": "Jing",
        "v_type": "Person"
      }
    ]
  }
]

When comparing similarity to Neil, Kat is ranked higher than Kevin. This makes intuitive sense, because Kat likes two movies, both of which were also liked by Neil. Kevin also likes two movies that Neil likes. However, Kevin also likes a third movie that Neil doesn’t like, and is therefore less similar than Kat was.

Although we set top_k to 5, only three vertices were returned because neither Alex nor Elena likes any movies that Kevin likes.

If the source vertex (Person) doesn’t have any common neighbors (Movies) with any other vertex (Person), such as Elena in our example, the result is an empty list:

Here we run tg_jaccard_nbor_ss("Elena", "Likes", "Reverse_Likes", 5):

[
  {
    "Others": []
  }
]