1 of 10

Similarity Algorithms

Euclidean Distance (Beta)

Euclidean distance measures the straight line distance between two points in n-dimensional space. The algorithm takes two vectors denoted by ListAccum and return the Euclidean distance between them.

This algorithm is implemented as a user-defined function. You need to follow the steps in Add a User-Defined Function to add the function to GSQL. After adding the function, you can call it in any GSQL query in the same way as a built-in GSQL function.

Specifications

tg_euclidean_distance_accum(A, B)

Parameters

Return value

The Euclidean distance between the two vectors.

Example

CREATE QUERY euclidean_example() FOR GRAPH social {   
  ListAccum<INT> @@a = [1, 2, 3];
  ListAccum<INT> @@b = [4, 5, 6];
  double distance = tg_euclidean_distance_accum(@@a, @@b);
  PRINT distance; 
}

  {
    "distance": 5.19615
  }

Overlap Similarity (Beta)

The overlap coefficient, or Szymkiewicz–Simpson coefficient, is a similarity measure that measures the overlap between two finite sets.

The algorithm takes two vectors denoted by ListAccum and returns the overlap coefficient between them.

Specification

tg_overlap_similarity_accum( A, B )

Parameters

Return value

The overlap coefficient between the two vectors.

Example

CREATE QUERY overlap_example(/* Parameters here */) FOR GRAPH social { 
  ListAccum<INT> @@a = [1, 2, 3];
  ListAccum<INT> @@b = [2, 2, 3];
  double overlap_similarity = tg_overlap_similarity_accum(@@a, @@b);
  PRINT overlap_similarity; 
}

[
  {
    "overlap_similarity": 0.66667
  }
]

Pearson Similarity (Beta)

The formula for calculating the Pearson correlation coefficient is as follows:

Specifications

Parameters

Return value

The Pearson correlation coefficient between the two vectors.

Example

Cosine Similarity of Neighborhoods (Single-Source)

The cosine similarity of two vectors A and B is defined as follows:

If A and B are identical, then cos(A, B) = 1. As expected for a cosine function, the value can also be negative or zero. In fact, cosine similarity is closely related to the Pearson correlation coefficient.

For this library function, the feature vector is the set of edge weights between the two vertices and their neighbors.

In the movie graph shown in the figure below, there are Person vertices and Movie vertices. Every person may give a rating to some of the movies. The rating score is stored on the Likes edge using the weight attribute. For example, in the graph below, Alex gives a rating of 10 to the movie "Free Solo".

Specifications

The output size is always K (if K <= N), so the algorithm may arbitrarily choose to output one vertex over another if there are tied similarity scores.

Example

Given one person's name, this algorithm calculates the cosine similarity between this person and each other person where there is at one movie they have both rated.

In the previous example, if the source vertex is Alex, and top_k is set to 5, then we calculate the cosine similarity between him and two other persons, Jing and Kevin. The JSON output shows the top 5 similar vertices and their similarity score in descending order. The output limit is 5 persons, but we have only 2 qualified persons:

The FILE version output is not necessarily in descending order. It looks like the following:

The ATTR version inserts an edge into the graph with the similarity score as an edge attribute whenever the score is larger than zero. The result looks like this:

Cosine Similarity of Neighborhoods (All Pairs)

Specifications

Example

Using the movie graph, calculate the cosine similarity between all pairs and show the top 5 pairs. This is the JSON result:

The FILE output is similar to the output of cosine_nbor_file.

The ATTR version will create k edges:

Cosine Similarity of Neighborhoods (Batch)

This algorithm has a time complexity of O(E), where E is the number of edges, and runs on graphs with weighted edges (directed or undirected).

Specifications

Parameters

Result

The result of this algorithm is the top k cosine similarity scores and their corresponding pair for each vertex. The score is only included if it is greater than 0.

The result can be output in JSON format, in CSV to a file, or saved as a similarity edge in the graph itself.

Example

Using the social10 graph, we can calculate the cosine similarity of every person to every other person connected by the Friend edge, and print out the top k most similar pairs for each vertex.

Jaccard Similarity of Neighborhoods (Single Source)

The Jaccard index measures the relative overlap between two sets. To compare two vertices by Jaccard similarity, first select a set of values for each vertex. For example, a set of values for a Person could be the cities the Person has lived in. Then the Jaccard index is computed for the two vectors.

The Jaccard index of two sets A and B is defined as follows:

Jaccard(A,B)=\frac{|A \cap B|}{|A \cup B|}

The value ranges from 0 to 1. If A and B are identical, then Jaccard(A, B) = 1. If both A and B are empty, we define the value to be 0.

Specifications

In the current

tg_jaccard_nbor_ss (VERTEX source, STRING e_type, STRING rev_e_type,  INT top_k = 100,
  BOOL print_accum = TRUE, STRING similarity_edge_type = "",STRING file_path = "")

The algorithm will not output more than K vertices, so the algorithm may arbitrarily choose to output one vertex over another if there are tied similarity scores.

Example

Using the movie graph, we run jaccard_nbor_ss("Neil", 5):

[
  {
    "@@result_topK": [
      {
        "vertex1": "Neil",
        "vertex2": "Kat",
        "score": 0.5
      },
      {
        "vertex1": "Neil",
        "vertex2": "Kevin",
        "score": 0.4
      },
      {
        "vertex1": "Neil",
        "vertex2": "Jing",
        "score": 0.2
      }
    ]
  }
]

If the source vertex (person) doesn't have any common neighbors (movies) with any other vertex (person), such as Elena in our example, the result will be an empty list:

[
  {
    "@@result_topK": []
  }
]

Jaccard Similarity of Neighborhoods (All Pairs)

This algorithm computes the same similarity scores as the Jaccard similarity of neighborhoods, single-source algorithm, except that it considers ALL pairs of vertices in the graph (for the vertex and edge types selected by the user). Naturally, this algorithm will take longer to run. For very large and very dense graphs, this algorithm may not be a practical choice.

Specifications

tg_jaccard_nbor_ap(STRING v_type, STRING e_type, STRING re_type, INT top_k,
  BOOL print_accum = TRUE, STRING similarity_edge = "", STRING file_path = "")

The algorithm will not output more than K vertex pairs, so the algorithm may arbitrarily chose to output one vertex pair over another, if there are tied similarity scores.

Example

For the movie graph, calculate the Jaccard similarity between all pairs and show the 5 most similar pairs: jaccard_nbor_ap(5). This is the JSON output :

[
  {
    "@@total_result": [
      {
        "vertex1": "Kat",
        "vertex2": "Neil",
        "score": 0.5
      },
      {
        "vertex1": "Kevin",
        "vertex2": "Neil",
        "score": 0.4
      },
      {
        "vertex1": "Jing",
        "vertex2": "Alex",
        "score": 0.25
      },
      {
        "vertex1": "Kat",
        "vertex2": "Kevin",
        "score": 0.25
      },
      {
        "vertex1": "Jing",
        "vertex2": "Neil",
        "score": 0.2
      }
    ]
  }
]

Jaccard Similarity of Neighborhoods (Batch)

This algorithm has a time complexity of O(E), where E is the number of edges, and runs on graphs with unweighted edges (directed or undirected).

Specifications

Parameters

Result

The result contains the top k Jaccard similarity scores for each vertex and its corresponding pair. A pair is only included if its similarity is greater than 0, meaning there is at least one common neighbor between the pair. The result is available in JSON format, or can be output to a file in CSV, or it can be saved as an edge on the graph itself. A JSON formatted result could look like this:

Cosine Similarity of Neighborhoods (Single-Source)

To compare two vertices by , the selected properties of each vertex are first represented as a vector. For example, a property vector for a Person vertex could have the elements age, height, and weight. Then the cosine function is applied to the two vectors.

The cosine similarity of two vectors A and B is defined as follows:

cos(A,B)=\frac{A\cdot B}{||A||\cdot ||B||} = \frac{\sum_iA_i B_i}{\sqrt{\sum_iA^2_i}\sqrt{\sum_iB^2_i}}

For this library function, the feature vector is the set of edge weights between the two vertices and their neighbors.

Specifications

tg_cosine_nbor_ss (VERTEX source, SET<STRING> e_type, SET<STRING> re_type,
  STRING weight, INT top_k, INT output_limit, BOOL print_accum = TRUE,
  STRING file_path = "", STRING similarity_edge = "")
RETURNS (MapAccum<VERTEX, FLOAT>)

The output size is always K (if K <= N), so the algorithm may arbitrarily choose to output one vertex over another if there are tied similarity scores.

Example

Given one person's name, this algorithm calculates the cosine similarity between this person and each other person where there is at one movie they have both rated.

[
  {
    "@@result_topk": [
      {
        "vertex1": "Alex",
        "vertex2": "Jing",
        "score": 0.42173
      },
      {
        "vertex1": "Alex",
        "vertex2": "Kevin",
        "score": 0.14248
      }
    ]
  }
]

The FILE version output is not necessarily in descending order. It looks like the following:

Vertex1,Vertex2,Similarity
Alex,Kevin,0.142484
Alex,Jing,0.421731

The ATTR version inserts an edge into the graph with the similarity score as an edge attribute whenever the score is larger than zero. The result looks like this:

Cosine Similarity of Neighborhoods (Batch)

This algorithm computes the same similarity scores as the except that it starts from all of the vertices as the source vertex and computes its similarity scores with its neighbors for all the vertices in parallel. Since this is a memory-intensive operation, it is split into batches to reduce peak memory usage. The user can specify how many batches it is to be split into. Compared with the , this algorithm allows you to split the workload into multiple batches and reduces the burden on memory.

This algorithm has a time complexity of O(E), where E is the number of edges, and runs on graphs with weighted edges (directed or undirected).

Specifications

tg_cosine_batch(STRING vertex_type, STRING edge_type, STRING edge_attribute, 
INT topK, BOOL print_accum = true, STRING file_path, 
STRING similarity_edge, INT num_of_batches=1)

Parameters

Result

The result of this algorithm is the top k cosine similarity scores and their corresponding pair for each vertex. The score is only included if it is greater than 0.

The result can be output in JSON format, in CSV to a file, or saved as a similarity edge in the graph itself.

Example

Using the social10 graph, we can calculate the cosine similarity of every person to every other person connected by the Friend edge, and print out the top k most similar pairs for each vertex.

GSQL > RUN QUERY tg_cosine_batch("Person", "Friend", "weight", 5, true, "", "", 1)

// Every vertex and their most similar pairs ranked by their Cosine
// Similarity score.
[
  {
    "start": [
      {
        "attributes": {
          "start.@heap": [
            {
              "val": 0.49903,
              "ver": "Howard"
            },
            {
              "val": 0.43938,
              "ver": "George"
            },
            {
              "val": 0.05918,
              "ver": "Alex"
            },
            {
              "val": 0.05579,
              "ver": "Ivy"
            }
          ]
        },
        "v_id": "Fiona",
        "v_type": "Person"
      },
      {
        "attributes": {
          "start.@heap": []
        },
        "v_id": "Justin",
        "v_type": "Person"
      },
      {
        "attributes": {
          "start.@heap": []
        },
        "v_id": "Bob",
        "v_type": "Person"
      },
      {
        "attributes": {
          "start.@heap": [
            {
              "val": 0.22361,
              "ver": "Bob"
            },
            {
              "val": 0.21213,
              "ver": "Alex"
            }
          ]
        },
        "v_id": "Chase",
        "v_type": "Person"
      },
      {
        "attributes": {
          "start.@heap": [
            {
              "val": 0.57143,
              "ver": "Bob"
            },
            {
              "val": 0.12778,
              "ver": "Chase"
            }
          ]
        },
        "v_id": "Damon",
        "v_type": "Person"
      },
      {
        "attributes": {
          "start.@heap": []
        },
        "v_id": "Alex",
        "v_type": "Person"
      },
      {
        "attributes": {
          "start.@heap": [
            {
              "val": 0.64253,
              "ver": "Alex"
            },
            {
              "val": 0.63607,
              "ver": "Ivy"
            },
            {
              "val": 0.27091,
              "ver": "Howard"
            },
            {
              "val": 0.14364,
              "ver": "Fiona"
            }
          ]
        },
        "v_id": "George",
        "v_type": "Person"
      },
      {
        "attributes": {
          "start.@heap": []
        },
        "v_id": "Eddie",
        "v_type": "Person"
      },
      {
        "attributes": {
          "start.@heap": [
            {
              "val": 0.94848,
              "ver": "Fiona"
            },
            {
              "val": 0.6364,
              "ver": "Alex"
            },
            {
              "val": 0.31046,
              "ver": "George"
            },
            {
              "val": 0.1118,
              "ver": "Howard"
            }
          ]
        },
        "v_id": "Ivy",
        "v_type": "Person"
      },
      {
        "attributes": {
          "start.@heap": [
            {
              "val": 1.09162,
              "ver": "Fiona"
            },
            {
              "val": 0.78262,
              "ver": "Ivy"
            },
            {
              "val": 0.11852,
              "ver": "George"
            }
          ]
        },
        "v_id": "Howard",
        "v_type": "Person"
      }
    ]
  }
]

Jaccard Similarity of Neighborhoods (Batch)

This algorithm has a time complexity of O(E), where E is the number of edges, and runs on graphs with unweighted edges (directed or undirected).

Specifications

tg_jaccard_batch (STRING v_type, STRING e_type, STRING re_type, INT topK, 
BOOL print_accum = true, STRING similarity_edge, STRING file_path, 
INT num_of_batches = 1)

Parameters

Result

// Run jaccard_batch on social10 graph traversing through Friend edges
[
  {
    "Start": [
      {
        "attributes": {
          "Start.@heap": [
            {
              "val": 0.33333,
              "ver": "Howard"
            },
            {
              "val": 0.25,
              "ver": "Ivy"
            },
            {
              "val": 0.25,
              "ver": "George"
            }
          ]
        },
        "v_id": "Fiona",
        "v_type": "Person"
      },
      {
        "attributes": {
          "Start.@heap": []
        },
        "v_id": "Justin",
        "v_type": "Person"
      },
      {
        "attributes": {
          "Start.@heap": []
        },
        "v_id": "Bob",
        "v_type": "Person"
      },
      {
        "attributes": {
          "Start.@heap": [
            {
              "val": 0.5,
              "ver": "Damon"
            }
          ]
        },
        "v_id": "Chase",
        "v_type": "Person"
      },
      {
        "attributes": {
          "Start.@heap": [
            {
              "val": 0.5,
              "ver": "Chase"
            }
          ]
        },
        "v_id": "Damon",
        "v_type": "Person"
      },
      {
        "attributes": {
          "Start.@heap": [
            {
              "val": 0.33333,
              "ver": "Ivy"
            }
          ]
        },
        "v_id": "Alex",
        "v_type": "Person"
      },
      {
        "attributes": {
          "Start.@heap": [
            {
              "val": 0.5,
              "ver": "Howard"
            },
            {
              "val": 0.25,
              "ver": "Fiona"
            }
          ]
        },
        "v_id": "George",
        "v_type": "Person"
      },
      {
        "attributes": {
          "Start.@heap": []
        },
        "v_id": "Eddie",
        "v_type": "Person"
      },
      {
        "attributes": {
          "Start.@heap": [
            {
              "val": 0.33333,
              "ver": "Alex"
            },
            {
              "val": 0.25,
              "ver": "Fiona"
            }
          ]
        },
        "v_id": "Ivy",
        "v_type": "Person"
      },
      {
        "attributes": {
          "Start.@heap": [
            {
              "val": 0.5,
              "ver": "George"
            },
            {
              "val": 0.33333,
              "ver": "Fiona"
            }
          ]
        },
        "v_id": "Howard",
        "v_type": "Person"
      }
    ]
  }
]