Fast Random Projection (FastRP)

Fast Random Projection (FastRP) is a scalable and performant node-embedding algorithm. It generates node embeddings (vectors) of low dimensionality through random projections from the graph's adjacency matrix (a high-dimensional matrix) to a low-dimensional matrix, significantly reducing the computing power required to process the data. The algorithm is theoretically backed by the Johnsson-Lindenstrauss lemma, which states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved.

This allows us to aggressively reduce the dimensionality of a dataset while preserving the distance information. By running the algorithm on a graph, we can preserve the similarity between nodes and their neighbors. Nodes that have similar neighborhoods are mapped to similar embedding vectors, while nodes that are not similar are mapped to dissimilar vectors.

The algorithm starts by assigning random vectors using very sparse random projection. The algorithm continues to average the random vectors, and then iteratively average the neighboring vectors from the previous iteration to construct intermediate embeddings. In each iteration, the algorithm also normalizes the intermediate embedding using a standard Euclidean norm.

In the end, the embedding for each node is a weighted sum of the intermediate embeddings, where the weights are provided as a parameter of the algorithm.

This algorithm is especially useful in the following situations:

  • Your data is in such high dimension that it is too expensive to perform calculations.

  • The dimensionality is very high and the number of samples is too sparse to calculate covariances.

  • You don't have access to the entire dataset, such as when working with real-time data.‌

Specifications

CREATE QUERY tg_fastRP_query(
INT num_edges, INT num_nodes, INT k, INT sampling_constant, 
INT reduced_dimension, DOUBLE normalization_strength, STRING input_weights, 
STRING index_attr)

This algorithm requires an index attribute on the schema. Before running the algorithm, create an attribute on all vertex types to hold the index. You can use our provided schema change job script in the algorithm library.

After the attribute has been created, run the tg_fastRP_preprocessing.gsql query to index all vertices on the graph.

Parameters

Parameter

Description

Data type

num_edges

The total number of edges in the graph.

INT

num_nodes

The total number of vertices in the graph.

INT

sampling_constant

The sampling constant. A high sampling_constant increases the speed of the algorithm.

INT

reduced_dimension

The length of the produced vectors. A greater dimension offers greater precision, but is more costly to operate over.

INT

k

The number of iterations for the algorithm to construct intermediate embeddings.

INT

normalization_strength

The normalization strength determines how node degrees influence the embedding. Using a negative value will downplay the importance of high degree neighbors, while a positive value will instead increase their importance.

DOUBLE

input_weights

A list of floats that determines the weight of the intermediate embeddings constructed in each iteration.

STRING

index_attr

The name of the index attribute created during preprocessing.

STRING

Parameter tuning‌

The optimal values for the following parameters are dependent on your dataset. In order to obtain the best quality embeddings for your graph, it is a good idea to tune these parameters. ‌

reduced_dimension

The reduced dimension (reduced_dimension) is the length of the produced vectors. A greater dimension offers greater precision, but is more costly to operate over.

input_weights and k

The algorithm interactively constructs intermediate embeddings by averaging either neighboring intermediate embeddings from the previous iteration, or the generated random vectors during the first iteration. The final embeddings are weighted sums of the intermediate embeddings from each iteration.

k is the number of iterations and the input_weights parameter determine how much each set of intermediate embeddings from each iteration weighs. You should make sure that you supply a value of k that is in accordance with the number of weights you provide.

With each iteration, the algorithm will consider neighbors that are one step farther away.

normalization_strength

The normalization strength determines how node degrees influence the embedding. Using a negative value will downplay the importance of high degree neighbors, while a positive value will instead increase their importance.

sampling_constant

FastRP uses very sparse random projection to reduce the dimensionality of the data from an nm matrix to an n*d matrix where d <= m by multiplying the original matrix with an m*d matrix. The m*d matrix is made up of independently and identically distributed data sampled from:

Where s is the sampling constant (sampling_constant). The higher the constant, the higher the number of zeros in the resulting matrix, making it more sparse, but also speeds up the algorithm.

Last updated