Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The overlap coefficient, or Szymkiewicz–Simpson coefficient, is a similarity measure that measures the overlap between two finite sets.
The algorithm takes two vectors denoted by ListAccum
and returns the overlap coefficient between them.
This algorithm is implemented as a user-defined function. You need to follow the steps in Add a User-Defined Function to add the function to GSQL. After adding the function, you can call it in any GSQL query in the same way as a built-in GSQL function.
The overlap coefficient between the two vectors.
Name
Description
Data type
A
An n-dimensional vector denoted by a ListAccum
of length n
ListAccum<INT/UINT/FLOAT/DOUBLE>
B
An n-dimensional vector denoted by a ListAccum
of length n
ListAccum<INT/UINT/FLOAT/DOUBLE>
Euclidean distance measures the straight line distance between two points in n-dimensional space. The algorithm takes two vectors denoted by ListAccum
and return the Euclidean distance between them.
This algorithm is implemented as a user-defined function. You need to follow the steps in Add a User-Defined Function to add the function to GSQL. After adding the function, you can call it in any GSQL query in the same way as a built-in GSQL function.
The Euclidean distance between the two vectors.
The cosine similarity of two vectors A and B is defined as follows:
If A and B are identical, then cos(A, B) = 1. As expected for a cosine function, the value can also be negative or zero. In fact, cosine similarity is closely related to the Pearson correlation coefficient.
For this library function, the feature vector is the set of edge weights between the two vertices and their neighbors.
In the movie graph shown in the figure below, there are Person vertices and Movie vertices. Every person may give a rating to some of the movies. The rating score is stored on the Likes edge using the weight attribute. For example, in the graph below, Alex gives a rating of 10 to the movie "Free Solo".
The output size is always K (if K <= N), so the algorithm may arbitrarily choose to output one vertex over another if there are tied similarity scores.
Given one person's name, this algorithm calculates the cosine similarity between this person and each other person where there is at one movie they have both rated.
In the previous example, if the source vertex is Alex, and top_k
is set to 5, then we calculate the cosine similarity between him and two other persons, Jing and Kevin. The JSON output shows the top 5 similar vertices and their similarity score in descending order. The output limit is 5 persons, but we have only 2 qualified persons:
The FILE version output is not necessarily in descending order. It looks like the following:
The ATTR version inserts an edge into the graph with the similarity score as an edge attribute whenever the score is larger than zero. The result looks like this:
This algorithm has a time complexity of O(E), where E is the number of edges, and runs on graphs with weighted edges (directed or undirected).
The result of this algorithm is the top k cosine similarity scores and their corresponding pair for each vertex. The score is only included if it is greater than 0.
The result can be output in JSON format, in CSV to a file, or saved as a similarity edge in the graph itself.
Using the social10
graph, we can calculate the cosine similarity of every person to every other person connected by the Friend
edge, and print out the top k most similar pairs for each vertex.
To compare two vertices by , the selected properties of each vertex are first represented as a vector. For example, a property vector for a Person vertex could have the elements age, height, and weight. Then the cosine function is applied to the two vectors.
This algorithm computes the same similarity scores as the except that it starts from all of the vertices as the source vertex and computes its similarity scores with its neighbors for all the vertices in parallel. Since this is a memory-intensive operation, it is split into batches to reduce peak memory usage. The user can specify how many batches it is to be split into. Compared with the , this algorithm allows you to split the workload into multiple batches and reduces the burden on memory.
Name
Description
Data type
A
An n-dimensional vector denoted by a ListAccum
of length n
ListAccum<INT/UINT/FLOAT/DOUBLE>
B
An n-dimensional vector denoted by a ListAccum
of length n
ListAccum<INT/UINT/FLOAT/DOUBLE>
Characteristic | Value |
Result | The top K vertices in the graph that have the highest similarity scores, along with their scores. The result is available in three forms:
|
Input Parameters |
|
Result Size |
|
Time Complexity | O(D^2), D = outdegree of vertex v |
Graph Types | Undirected or directed edges, weighted edges |
Name | Description |
| Vertex type to calculate similarity for |
| Directed edge type to traverse |
| Name of the attribute on the edge type to use as the weight |
| Number of top scores to report for each vertex |
| If |
| If provided, the similarity score will be saved to this edge. |
| If not empty, write output to this file in CSV. |
| Number of batches to divide the query into |
The Pearson correlation coefficient is a measure of between two sets of data. It is the ratio between the of two variables and the product of their .
This algorithm is implemented as a user-defined function. You need to follow the steps in to add the function to GSQL. After adding the function, you can call it in any GSQL query in the same way as a built-in GSQL function.
Name | Description | Data type |
| An n-dimensional vector denoted by a |
|
| An n-dimensional vector denoted by a |
|
This algorithm has a time complexity of O(E), where E is the number of edges, and runs on graphs with unweighted edges (directed or undirected).
The result contains the top k Jaccard similarity scores for each vertex and its corresponding pair. A pair is only included if its similarity is greater than 0, meaning there is at least one common neighbor between the pair. The result is available in JSON format, or can be output to a file in CSV, or it can be saved as an edge on the graph itself. A JSON formatted result could look like this:
This algorithm computes the same similarity scores as the (cosine_nbor_ss
), except that it considers ALL pairs of vertices in the graph (for the vertex and edge types selected by the user). Naturally, this algorithm will take longer to run. For very large and very dense graphs, this may not be a practical choice.
This algorithm computes the same similarity scores as the except that it starts from all of the vertices as the source vertex and computes its similarity scores with its neighbors for all the vertices in parallel. Since this is a memory-intensive operation, it is split into batches to reduce peak memory usage. The user can specify how many batches it is to be split into. Compared with the , this algorithm allows you to split the workload into multiple batches and reduces the burden on memory.
Characteristic | Value |
Result | The top k vertex pairs in the graph which have the highest similarity scores, along with their scores. The result is available in three forms:
|
Input Parameters |
|
Result Size |
|
Time Complexity | O(E), E = number of edges |
Graph Types | Undirected or directed edges, weighted edges |
Name | Description |
| Vertex type to calculate similarity for |
| Directed edge type to traverse |
| Reverse edge type to traverse |
| Number of top scores to report for each vertex |
| If |
| If provided, the similarity scores will be saved to this edge type. |
| If a file path is provided, the algorithm will output to a file specified by the file path in CSV format |
| Number of batches to divide the query into |
The Jaccard index measures the relative overlap between two sets. To compare two vertices by Jaccard similarity, first select a set of values for each vertex. For example, a set of values for a Person could be the cities the Person has lived in. Then the Jaccard index is computed for the two vectors.
The Jaccard index of two sets A and B is defined as follows:
The value ranges from 0 to 1. If A and B are identical, then Jaccard(A, B) = 1. If both A and B are empty, we define the value to be 0.
In the current
The algorithm will not output more than K vertices, so the algorithm may arbitrarily choose to output one vertex over another if there are tied similarity scores.
Using the movie graph, we run jaccard_nbor_ss("Neil", 5)
:
If the source vertex (person) doesn't have any common neighbors (movies) with any other vertex (person), such as Elena in our example, the result will be an empty list:
This algorithm computes the same similarity scores as the Jaccard similarity of neighborhoods, single-source algorithm, except that it considers ALL pairs of vertices in the graph (for the vertex and edge types selected by the user). Naturally, this algorithm will take longer to run. For very large and very dense graphs, this algorithm may not be a practical choice.
The algorithm will not output more than K vertex pairs, so the algorithm may arbitrarily chose to output one vertex pair over another, if there are tied similarity scores.
For the movie graph, calculate the Jaccard similarity between all pairs and show the 5 most similar pairs: jaccard_nbor_ap(5). This is the JSON output :
Characteristic
Value
Result
The top k vertices in the graph that have the highest similarity scores, along with their scores.
The result is available in three forms:
streamed out in JSON format
written to a file in tabular format, or
stored as a vertex attribute value.
Input Parameters
SET<STRING> v_type
: Vertex type to calculate similarity score for
SET<STRING> e_type
: Edge type to traverse
SET<STRING> re_type
: Reverse edge type to traverse
INT top_k
: the number of vertex pairs with the highest similarity scores to return
BOOL print_accum
: Boolean value that decides whether to output to console
STRING similarity_edge
: If provided, the similarity score will be saved to this edge
STRING filepath:
If provided, the algorithm will output to the file path in CSV format
Result Size
top_k
Time Complexity
O(D^2), D = outdegree of vertex v
Graph Types
Undirected or directed edges, unweighted edges
Characteristic
Value
Result
The top k vertex pairs in the graph that have the highest similarity scores, along with their scores.
The result is available in three forms:
streamed out in JSON format
written to a file in tabular format, or
stored as a vertex attribute value.
Input Parameters
SET<STRING> v_type
: Vertex types to calculate similarity score for
SET<STRING> e_type
: Edge types to traverse
SET<STRING> re_type
: Reverse edge types to traverse
INT top_k
: the number of vertex pairs with the highest similarity scores to return
BOOL print_accum
: Boolean value that decides whether to output to console
STRING similarity_edge
: If provided, the similarity score will be saved to this edge
STRING file_path
: If provided, the algorithm will output to the file path in CSV format
Result Size
top_k
Time Complexity
O(E^2 / V), V = number of vertices, E = number of edges
Graph Types
Undirected or directed edges, unweighted edges