1 of 3

Node Embeddings

Node2Vec

Node2Vec is a node embedding algorithm that uses random walks in the graph to create a vector representation of a node.

A random walk starts with a node, and the algorithm iteratively selects neighboring nodes to visit, and each neighboring node has an assigned probability. This transforms graph structure into a collection of linear sequences of nodes. For each node we will be left with a list of other nodes from their local or extended neighborhoods.

Once the above step is complete, the algorithm uses a variation of the word2vec model from the language modeling community to turn each node into a vector of probabilities. The probabilities represent the likelihood of visiting a given node in a random walk from each starting node.

Specification

tg_random_walk(INT step = 8, INT path_size = 4, 
    STRING filepath = "/home/tigergraph/path.csv", SET<STRING> edge_types, 
    INT sample_num)

tg_node2vec_query(STRING filepath = "/home/tigergraph/path.csv", 
    STRING output_file = "/home/tigergraph/embedding.csv", 
    INT dimension)

Installing this query requires installing a UDF, which can be found in the Github repository of the query. If you are running the query on a cluster, you need to manually install the UDF on every node of the cluster.

Parameters

Fast Random Projection (FastRP)

Fast Random Projection (FastRP) is a scalable and performant node-embedding algorithm. It generates node embeddings (vectors) of low dimensionality through random projections from the graph's adjacency matrix (a high-dimensional matrix) to a low-dimensional matrix, significantly reducing the computing power required to process the data. The algorithm is theoretically backed by the Johnsson-Lindenstrauss lemma, which states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved.

This allows us to aggressively reduce the dimensionality of a dataset while preserving the distance information. By running the algorithm on a graph, we can preserve the similarity between nodes and their neighbors. Nodes that have similar neighborhoods are mapped to similar embedding vectors, while nodes that are not similar are mapped to dissimilar vectors.

The algorithm starts by assigning random vectors using very sparse random projection. The algorithm continues to average the random vectors, and then iteratively average the neighboring vectors from the previous iteration to construct intermediate embeddings. In each iteration, the algorithm also normalizes the intermediate embedding using a standard Euclidean norm.

In the end, the embedding for each node is a weighted sum of the intermediate embeddings, where the weights are provided as a parameter of the algorithm.

This algorithm is especially useful in the following situations:

Your data is in such high dimension that it is too expensive to perform calculations.
The dimensionality is very high and the number of samples is too sparse to calculate covariances.
You don't have access to the entire dataset, such as when working with real-time data.‌

Specifications

CREATE QUERY tg_fastRP_query(
INT num_edges, INT num_nodes, INT k, INT sampling_constant, 
INT reduced_dimension, DOUBLE normalization_strength, STRING input_weights, 
STRING index_attr)

This algorithm requires an index attribute on the schema. Before running the algorithm, create an attribute on all vertex types to hold the index. You can use our provided schema change job script in the algorithm library.

After the attribute has been created, run the tg_fastRP_preprocessing.gsql query to index all vertices on the graph.

Parameters

Parameter tuning‌

The optimal values for the following parameters are dependent on your dataset. In order to obtain the best quality embeddings for your graph, it is a good idea to tune these parameters. ‌

`reduced_dimension`

The reduced dimension (reduced_dimension) is the length of the produced vectors. A greater dimension offers greater precision, but is more costly to operate over.

`input_weights` and `k`

The algorithm interactively constructs intermediate embeddings by averaging either neighboring intermediate embeddings from the previous iteration, or the generated random vectors during the first iteration. The final embeddings are weighted sums of the intermediate embeddings from each iteration.

k is the number of iterations and the input_weights parameter determine how much each set of intermediate embeddings from each iteration weighs. You should make sure that you supply a value of k that is in accordance with the number of weights you provide.

With each iteration, the algorithm will consider neighbors that are one step farther away.

`normalization_strength`

The normalization strength determines how node degrees influence the embedding. Using a negative value will downplay the importance of high degree neighbors, while a positive value will instead increase their importance.

`sampling_constant`

FastRP uses very sparse random projection to reduce the dimensionality of the data from an nm matrix to an n*d matrix where d <= m by multiplying the original matrix with an m*d matrix. The m*d matrix is made up of independently and identically distributed data sampled from:

Where s is the sampling constant (sampling_constant). The higher the constant, the higher the number of zeros in the resulting matrix, making it more sparse, but also speeds up the algorithm.

Fast Random Projection (FastRP)

In the end, the embedding for each node is a weighted sum of the intermediate embeddings, where the weights are provided as a parameter of the algorithm.

This algorithm is especially useful in the following situations:

Your data is in such high dimension that it is too expensive to perform calculations.
The dimensionality is very high and the number of samples is too sparse to calculate covariances.
You don't have access to the entire dataset, such as when working with real-time data.‌

Specifications

CREATE QUERY tg_fastRP_query(
INT num_edges, INT num_nodes, INT k, INT sampling_constant, 
INT reduced_dimension, DOUBLE normalization_strength, STRING input_weights, 
STRING index_attr)

After the attribute has been created, run the tg_fastRP_preprocessing.gsql query to index all vertices on the graph.

Parameters

Parameter tuning‌

The optimal values for the following parameters are dependent on your dataset. In order to obtain the best quality embeddings for your graph, it is a good idea to tune these parameters. ‌

`reduced_dimension`

The reduced dimension (reduced_dimension) is the length of the produced vectors. A greater dimension offers greater precision, but is more costly to operate over.

`input_weights` and `k`

With each iteration, the algorithm will consider neighbors that are one step farther away.

`normalization_strength`

`sampling_constant`

Where s is the sampling constant (sampling_constant). The higher the constant, the higher the number of zeros in the resulting matrix, making it more sparse, but also speeds up the algorithm.

Node2Vec

Node2Vec is a node embedding algorithm that uses random walks in the graph to create a vector representation of a node.

Specification

tg_random_walk(INT step = 8, INT path_size = 4, 
    STRING filepath = "/home/tigergraph/path.csv", SET<STRING> edge_types, 
    INT sample_num)

tg_node2vec_query(STRING filepath = "/home/tigergraph/path.csv", 
    STRING output_file = "/home/tigergraph/embedding.csv", 
    INT dimension)

Node Embeddings

Node2Vec

Specification

Parameters

Fast Random Projection (FastRP)

Specifications

Parameters

Parameter tuning‌

reduced_dimension

input_weights and k

normalization_strength

sampling_constant

Fast Random Projection (FastRP)

Specifications

Parameters

Parameter tuning‌

reduced_dimension

input_weights and k

normalization_strength

sampling_constant

Node2Vec

Specification

Parameters

`reduced_dimension`

`input_weights` and `k`

`normalization_strength`

`sampling_constant`

`reduced_dimension`

`input_weights` and `k`

`normalization_strength`

`sampling_constant`