Fast Random Projection (FastRP) is a scalable and performant node-embedding algorithm. It generates node embeddings (vectors) of low dimensionality through random projections from the graph's adjacency matrix (a high-dimensional matrix) to a low-dimensional matrix, significantly reducing the computing power required to process the data. The algorithm is theoretically backed by the Johnsson-Lindenstrauss lemma, which states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved.
This allows us to aggressively reduce the dimensionality of a dataset while preserving the distance information. By running the algorithm on a graph, we can preserve the similarity between nodes and their neighbors. Nodes that have similar neighborhoods are mapped to similar embedding vectors, while nodes that are not similar are mapped to dissimilar vectors.
The algorithm starts by assigning random vectors using very sparse random projection. The algorithm continues to average the random vectors, and then iteratively average the neighboring vectors from the previous iteration to construct intermediate embeddings. In each iteration, the algorithm also normalizes the intermediate embedding using a standard Euclidean norm.
In the end, the embedding for each node is a weighted sum of the intermediate embeddings, where the weights are provided as a parameter of the algorithm.
This algorithm is especially useful in the following situations:
Your data is in such high dimension that it is too expensive to perform calculations.
The dimensionality is very high and the number of samples is too sparse to calculate covariances.
You don't have access to the entire dataset, such as when working with real-time data.
This algorithm requires an index attribute on the schema. Before running the algorithm, create an attribute on all vertex types to hold the index. You can use our provided schema change job script in the algorithm library.
After the attribute has been created, run the tg_fastRP_preprocessing.gsql query to index all vertices on the graph.
The optimal values for the following parameters are dependent on your dataset. In order to obtain the best quality embeddings for your graph, it is a good idea to tune these parameters.
reduced_dimension
The reduced dimension (reduced_dimension
) is the length of the produced vectors. A greater dimension offers greater precision, but is more costly to operate over.
input_weights
and k
The algorithm interactively constructs intermediate embeddings by averaging either neighboring intermediate embeddings from the previous iteration, or the generated random vectors during the first iteration. The final embeddings are weighted sums of the intermediate embeddings from each iteration.
k
is the number of iterations and the input_weights
parameter determine how much each set of intermediate embeddings from each iteration weighs. You should make sure that you supply a value of k
that is in accordance with the number of weights you provide.
With each iteration, the algorithm will consider neighbors that are one step farther away.
normalization_strength
The normalization strength determines how node degrees influence the embedding. Using a negative value will downplay the importance of high degree neighbors, while a positive value will instead increase their importance.
sampling_constant
FastRP uses very sparse random projection to reduce the dimensionality of the data from an nm matrix to an n*d matrix where d <= m by multiplying the original matrix with an m*d matrix. The m*d matrix is made up of independently and identically distributed data sampled from:
Where s is the sampling constant (sampling_constant
). The higher the constant, the higher the number of zeros in the resulting matrix, making it more sparse, but also speeds up the algorithm.
Parameter
Description
Data type
num_edges
The total number of edges in the graph.
INT
num_nodes
The total number of vertices in the graph.
INT
sampling_constant
The sampling constant. A high sampling_constant
increases the speed of the algorithm.
INT
reduced_dimension
The length of the produced vectors. A greater dimension offers greater precision, but is more costly to operate over.
INT
k
The number of iterations for the algorithm to construct intermediate embeddings.
INT
normalization_strength
The normalization strength determines how node degrees influence the embedding. Using a negative value will downplay the importance of high degree neighbors, while a positive value will instead increase their importance.
DOUBLE
input_weights
A list of floats that determines the weight of the intermediate embeddings constructed in each iteration.
STRING
index_attr
The name of the index attribute created during preprocessing.
STRING
Node2Vec is a node embedding algorithm that uses random walks in the graph to create a vector representation of a node.
A random walk starts with a node, and the algorithm iteratively selects neighboring nodes to visit, and each neighboring node has an assigned probability. This transforms graph structure into a collection of linear sequences of nodes. For each node we will be left with a list of other nodes from their local or extended neighborhoods.
Once the above step is complete, the algorithm uses a variation of the word2vec model from the language modeling community to turn each node into a vector of probabilities. The probabilities represent the likelihood of visiting a given node in a random walk from each starting node.
Installing this query requires installing a UDF, which can be found in the Github repository of the query. If you are running the query on a cluster, you need to manually install the UDF on every node of the cluster.
Parameter
Description
Data type
step
Number of random walks per node
INT
path_size
Number of hops per walk
INT
filepath
File path to output results to
STRING
edge_types
Edge types to traverse
SET<STRING>
sample_num
Number of nodes to be used in the random sample
INT