Fast Random Projection

Algorithm link: Fast Random Projection

Fast Random Projection (FastRP) is a scalable and performant node-embedding algorithm. It generates node embeddings (vectors) of low dimensionality through random projections from the graph’s adjacency matrix (a high-dimensional matrix) to a low-dimensional matrix, significantly reducing the computing power required to process the data.

The algorithm is theoretically backed by the Johnson-Lindenstrauss lemma, which states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved.

This algorithm is especially useful in the following situations:

  • Your data is in such high dimension that it is too expensive to perform calculations.

  • The dimensionality is very high and the number of samples are too sparse to calculate covariances.

  • You don’t have access to the entire dataset, such as when working with real-time data.

FastRP was originally created to work on undirected graphs. It can work on directed graphs, although we recommend having reverse edges in your schema so the embeddings can be passed in both directions. If using a directed graph, there is a modification to allow it to run with sink vertices, but this may affect the embeddings.

Specifications

CREATE QUERY tg_fastRP(SET<STRING> v_type, SET<STRING> e_type,
    STRING weights, FLOAT beta, INT k, INT reduced_dim,
    INT sampling_constant, INT random_seed, BOOL print_accum=FALSE,
    STRING result_attr="", STRING file_path="")

This query needs to be modified based upon your schema to set the embedding attribute accordingly if you wish to store the embeddings in your graph. On line 92, change fastrp_embedding to the name of the attribute where you want to store the embedding. The attribute must type LIST<DOUBLE>. If you do not wish to store the embeddings in graph, delete the lines that use the attribute fastrp_embedding.

Parameters

Parameter Description Data type

v_type

The total number of edges in the graph.

SET<STRING>

e_type

The total number of vertices in the graph.

SET<STRING>

k

"Depth" of embedding. k=2 means that the resulting embedding would take vertices within 2 hops into account.

INT

sampling_constant

The sampling constant that controls the sparsity of the resulting embedding. A high sampling_constant increases the speed of the algorithm. See more details in Parameter Tuning.

INT

random_seed

Seed number of the random number generator used to generate the random vectors. It can be set to any positive integer. If the seed number is kept the same, the numbers used to generate random vectors stay the same.

INT

reduced_dim

The length of the produced vectors. A greater dimension offers a greater precision, but is more costly to operate over.

INT

beta

Hyperparameter that is typically between -1 and 0.

DOUBLE

weights

Comma-separated string of weights for each hop in the graph. The number of weights must be equal the value of the parameter k. when k=3 gives the nearest neighbor a weight of 1, neighbors 2 hops away a weight of 2 and neighbors 3 hops away a weight of 4.

STRING

print_accum

If true, print results to standard JSON output.

BOOL

result_attr

If provided, the algorithm stores the produced embeddings in this attribute. The attribute must be of type LIST<DOUBLE>.

STRING

file_path

If provided, print results to this filepath in CSV format.

STRING

Parameter tuning

The optimal values for the following parameters depend on your dataset. In order to obtain the best quality embeddings for your graph, it is a good idea to tune these parameters.

reduced_dimension

The reduced dimension (reduced_dimension) is the length of the produced vectors. A greater dimension offers a greater precision, but is more costly to operate over.

input_weights

The algorithm interactively constructs intermediate embeddings by averaging either neighboring intermediate embeddings from the previous iteration, or the generated random vectors during the first iteration. The final embeddings is a weighted sum of the intermediate embeddings from each iteration, and the input_weights parameter determine how much each set of intermediate embeddings weigh.

beta

The parameter beta determines how node degrees influence the embedding. Using a negative value will downplay the importance of high degree neighbors, while a positive value will instead increase their importance.

Usually, beta is set to be between -1 and 0.

sampling_constant

FastRP uses very sparse random projection to reduce the dimensionality of the data from an \$n*m\$ matrix to an \$n*d\$ matrix where \$d <= m\$ by multiplying the original matrix with an \$m*d\$ matrix. The \$m*d\$ matrix is made up of independently and identically distributed data sampled from:

image (38)

Where s is the sampling constant (sampling_constant). The higher the constant, the higher the number of zeros in the resulting matrix, which speeds up the algorithm.