Fast Random Projection

Supported Graph Characteristics
 Unweighted edges Directed edges Undirected edges Homogeneous vertex types Heterogeneous vertex types

Fast Random Projection (FastRP) is a scalable and performant node-embedding algorithm. It generates node embeddings (vectors) of low dimensionality through random projections from the graph’s adjacency matrix (a high-dimensional matrix) to a low-dimensional matrix, significantly reducing the computing power required to process the data.

The algorithm is theoretically backed by the Johnson-Lindenstrauss lemma, which states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved.

This algorithm is especially useful in the following situations:

• Your data is of such high dimension that it is too expensive to perform calculations.

• The dimensionality is very high and the number of samples are too sparse to calculate covariances.

• You don’t have access to the entire dataset, such as when working with real-time data.

Notes

FastRP was originally created to work on undirected graphs. It can work on directed graphs, although we recommend having reverse edges in your schema so the embeddings can be passed in both directions. If using a directed graph, there is a modification to allow it to run with sink vertices, but this may affect the embeddings.

This query needs to be modified based upon your schema to set the embedding attribute accordingly if you wish to store the embeddings in your graph. On line 92, change `fastrp_embedding` to the name of the attribute where you want to store the embedding. The attribute must be of type `LIST<DOUBLE>`.

If you do not wish to store the embeddings in the graph, delete the lines that use the attribute `fastrp_embedding`.

Specifications

``CREATE QUERY tg_fastRP(SET<STRING> v_type, SET<STRING> e_type, SET<STRING> output_v_type, STRING weights, FLOAT beta, INT total_dim, INT default_idx = 0, INT default_length, FLOAT default_weight, SET<STRING> embedding_dim_map, SET<STRING> feature_dim_map, INT sampling_constant, INT random_seed, STRING result_attr="", SET<VERTEX> filter_v_set, SET<STRING> stop_set, STRING component_attr="", INT batch_iter=0, BOOL print_to_file=FALSE, STRING filepath="", BOOL print_all=FALSE, INT print_sample_size=5)``

Parameters

Parameter Description Default value

`SET<STRING> v_type`

Set of all vertex types to traverse in the graph.

(empty set of strings)

`SET<STRING> e_type`

Set of all edge types to traverse in the graph.

(empty set of strings)

`SET<STRING> output_v_type`

Set of vertex types to produce output embeddings for.

(empty set of strings)

`STRING iteration_weights`

A comma-separated string of numbers to weight each iteration of the embedding process.

(empty string)

`FLOAT beta`

A float value that normalizes high-degree vertices in the graph. Usually between -1 and 0, where -1 penalizes high-degree vertices heavily. If desired, one can use a positive value to weight high-degree vertices more than low-degree vertices.

N/A

`INT embedding_dimension`

The total dimensions of the desired embedding.

N/A

`INT default_index`

The default_index parameter sets the index in the set of default edges in the embedding vector where the edges start to be incorporated in the vector.

0

`INT default_length`

Length of the embedding region that incorporates default edge type information.

N/A

`FLOAT default_weight`

Influence that the default edge types have in the region of the embedding they are incorporated.

1.0

`SET<STRING> embedding_dim_map`

Set of comma-separated tuples of `<edge type>,<length>,<weight>,<starting index>`. Use if incorporating specific edge types into certain regions of the embedding is desired. If the empty set is used, all edge types will be considered "default" edge types.

(empty set of strings)

`INT sampling_constant`

Parameter defined from the FastRP paper. Controls the sparsity of the embedding. Usually an integer between 1 (fully dense embedding) and 3 works well.

3

`INT random_seed`

Seed for the random number generator.

42

`STRING result_attribute`

Attribute where the result embedding is stored. Must be of type `LIST<FLOAT>`. If empty, the algorithm does not write the embedding to an attribute.

(empty string)

`STRING component_attribute`

If batching the embedding process by connected components, specify the vertex attribute that contains the integer ID of the component.

(empty string)

`INT batch_number`

If batching by connected component, specify which batch to compute the embeddings for.

N/A

`STRING filepath`

If specified, what file to write the computed embeddings to.

(empty string)

`BOOL print_results`

If `TRUE`, all vertex embeddings will be printed.

False

`INT choose_k`

Prints a sample of this many vertices and their embeddings.

5

Parameter tuning

The optimal values for the following parameters depend on your dataset. Tune these parameters to obtain the best quality embeddings for your graph.

`reduced_dimension`

The reduced dimension (`reduced_dimension`) is the length of the produced vectors. A greater dimension offers a greater precision, but is more costly to operate over.

`input_weights`

The algorithm interactively constructs intermediate embeddings by averaging either neighboring intermediate embeddings from the previous iteration, or the generated random vectors during the first iteration. The final embeddings is a weighted sum of the intermediate embeddings from each iteration, and the `input_weights` parameter determine how much each set of intermediate embeddings weigh.

`beta`

The parameter `beta` determines how node degrees influence the embedding. Using a negative value will downplay the importance of high degree neighbors, while a positive value will instead increase their importance.

Usually, `beta` is set to be between -1 and 0.

`sampling_constant`

FastRP uses very sparse random projection to reduce the dimensionality of the data from an n*m matrix to an n*d matrix where d <= m by multiplying the original matrix with an m*d matrix. The m*d matrix is made up of independently and identically distributed data sampled from:

s is the sampling constant (`sampling_constant`). The higher the constant, the higher the number of zeros in the resulting matrix, which speeds up the algorithm.