Fast Random Projection
Fast Random Projection (FastRP) is a scalable and performant node-embedding algorithm. It generates node embeddings (vectors) of low dimensionality through random projections from the graph’s adjacency matrix (a high-dimensional matrix) to a low-dimensional matrix, significantly reducing the computing power required to process the data.
The algorithm is theoretically backed by the Johnson-Lindenstrauss lemma, which states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved.
This algorithm is especially useful in the following situations:
-
Your data is of such high dimension that it is too expensive to perform calculations.
-
The dimensionality is very high and the number of samples are too sparse to calculate covariances.
-
You don’t have access to the entire dataset, such as when working with real-time data.
Notes
FastRP was originally created to work on undirected graphs. It can work on directed graphs, although we recommend having reverse edges in your schema so the embeddings can be passed in both directions. If using a directed graph, there is a modification to allow it to run with sink vertices, but this may affect the embeddings.
This query needs to be modified based upon your schema to set the embedding
attribute accordingly if you wish to store the embeddings in your graph.
On line 92, change fastrp_embedding
to the name of the attribute where you want to store the embedding.
The attribute must be of type LIST<DOUBLE>
.
If you do not wish to store the embeddings in the graph, delete the lines that use the attribute fastrp_embedding
.
Specifications
CREATE QUERY tg_fastRP(SET<STRING> v_type, SET<STRING> e_type, SET<STRING> output_v_type, STRING weights, FLOAT beta, INT total_dim, INT default_idx = 0, INT default_length, FLOAT default_weight, SET<STRING> embedding_dim_map, SET<STRING> feature_dim_map, INT sampling_constant, INT random_seed, STRING result_attr="", SET<VERTEX> filter_v_set, SET<STRING> stop_set, STRING component_attr="", INT batch_iter=0, BOOL print_to_file=FALSE, STRING filepath="", BOOL print_all=FALSE, INT print_sample_size=5)
Parameters
Parameter | Description | Default value |
---|---|---|
|
Set of all vertex types to traverse in the graph. |
(empty set of strings) |
|
Set of all edge types to traverse in the graph. |
(empty set of strings) |
|
Set of vertex types to produce output embeddings for. |
(empty set of strings) |
|
A comma-separated string of numbers to weight each iteration of the embedding process. |
(empty string) |
|
A float value that normalizes high-degree vertices in the graph. Usually between -1 and 0, where -1 penalizes high-degree vertices heavily. If desired, one can use a positive value to weight high-degree vertices more than low-degree vertices. |
N/A |
|
The total dimensions of the desired embedding. |
N/A |
|
The default_index parameter sets the index in the set of default edges in the embedding vector where the edges start to be incorporated in the vector. |
0 |
|
Length of the embedding region that incorporates default edge type information. |
N/A |
|
Influence that the default edge types have in the region of the embedding they are incorporated. |
1.0 |
|
Set of comma-separated tuples of |
(empty set of strings) |
|
Parameter defined from the FastRP paper. Controls the sparsity of the embedding. Usually an integer between 1 (fully dense embedding) and 3 works well. |
3 |
|
Seed for the random number generator. |
42 |
|
Attribute where the result embedding is stored.
Must be of type |
(empty string) |
|
If batching the embedding process by connected components, specify the vertex attribute that contains the integer ID of the component. |
(empty string) |
|
If batching by connected component, specify which batch to compute the embeddings for. |
N/A |
|
If specified, what file to write the computed embeddings to. |
(empty string) |
|
If |
False |
|
Prints a sample of this many vertices and their embeddings. |
5 |
Parameter tuning
The optimal values for the following parameters depend on your dataset. Tune these parameters to obtain the best quality embeddings for your graph.
reduced_dimension
The reduced dimension (reduced_dimension
) is the length of the produced vectors. A greater dimension offers a greater precision, but is more costly to operate over.
input_weights
The algorithm interactively constructs intermediate embeddings by averaging either neighboring intermediate embeddings from the previous iteration, or the generated random vectors during the first iteration.
The final embeddings is a weighted sum of the intermediate embeddings from each iteration, and the input_weights
parameter determine how much each set of intermediate embeddings weigh.
beta
The parameter beta
determines how node degrees influence the embedding.
Using a negative value will downplay the importance of high degree neighbors, while a positive value will instead increase their importance.
Usually, beta
is set to be between -1 and 0.
sampling_constant
FastRP uses very sparse random projection to reduce the dimensionality of the data from an \$n*m\$ matrix to an \$n*d\$ matrix where \$d <= m\$ by multiplying the original matrix with an \$m*d\$ matrix. The \$m*d\$ matrix is made up of independently and identically distributed data sampled from:
s is the sampling constant (sampling_constant
). The higher the constant, the higher the number of zeros in the resulting matrix, which speeds up the algorithm.