Fast Random Projection

Supported Graph Characteristics

Unweighted edges

Directed edges

Undirected edges

Homogeneous vertex types

Heterogeneous vertex types

Algorithm link: Fast Random Projection

Fast Random Projection (FastRP) is a scalable and performant node-embedding algorithm. It generates node embeddings (vectors) of low dimensionality through random projections from the graph’s adjacency matrix (a high-dimensional matrix) to a low-dimensional matrix, significantly reducing the computing power required to process the data.

The algorithm is theoretically backed by the Johnson-Lindenstrauss lemma, which states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved.

This algorithm is especially useful in the following situations:

  • Your data is of such high dimension that it is too expensive to perform calculations.

  • The dimensionality is very high and the number of samples are too sparse to calculate covariances.

  • You don’t have access to the entire dataset, such as when working with real-time data.

Notes

FastRP was originally created to work on undirected graphs. It can work on directed graphs, although we recommend having reverse edges in your schema so the embeddings can be passed in both directions. If using a directed graph, there is a modification to allow it to run with sink vertices, but this may affect the embeddings.

This query needs to be modified based upon your schema to set the embedding attribute accordingly if you wish to store the embeddings in your graph. On line 92, change fastrp_embedding to the name of the attribute where you want to store the embedding. The attribute must be of type LIST<DOUBLE>.

If you do not wish to store the embeddings in the graph, delete the lines that use the attribute fastrp_embedding.

Specifications

CREATE QUERY tg_fastRP(SET<STRING> v_type, SET<STRING> e_type, SET<STRING> output_v_type, STRING weights, FLOAT beta, INT total_dim, INT default_idx = 0, INT default_length, FLOAT default_weight, SET<STRING> embedding_dim_map, SET<STRING> feature_dim_map, INT sampling_constant, INT random_seed, STRING result_attr="", SET<VERTEX> filter_v_set, SET<STRING> stop_set, STRING component_attr="", INT batch_iter=0, BOOL print_to_file=FALSE, STRING filepath="", BOOL print_all=FALSE, INT print_sample_size=5)

Parameters

Parameter Description Default value

SET<STRING> v_type

Set of all vertex types to traverse in the graph.

(empty set of strings)

SET<STRING> e_type

Set of all edge types to traverse in the graph.

(empty set of strings)

SET<STRING> output_v_type

Set of vertex types to produce output embeddings for.

(empty set of strings)

STRING iteration_weights

A comma-separated string of numbers to weight each iteration of the embedding process.

(empty string)

FLOAT beta

A float value that normalizes high-degree vertices in the graph. Usually between -1 and 0, where -1 penalizes high-degree vertices heavily. If desired, one can use a positive value to weight high-degree vertices more than low-degree vertices.

N/A

INT embedding_dimension

The total dimensions of the desired embedding.

N/A

INT default_index

The default_index parameter sets the index in the set of default edges in the embedding vector where the edges start to be incorporated in the vector.

0

INT default_length

Length of the embedding region that incorporates default edge type information.

N/A

FLOAT default_weight

Influence that the default edge types have in the region of the embedding they are incorporated.

1.0

SET<STRING> embedding_dim_map

Set of comma-separated tuples of <edge type>,<length>,<weight>,<starting index>. Use if incorporating specific edge types into certain regions of the embedding is desired. If the empty set is used, all edge types will be considered "default" edge types.

(empty set of strings)

INT sampling_constant

Parameter defined from the FastRP paper. Controls the sparsity of the embedding. Usually an integer between 1 (fully dense embedding) and 3 works well.

3

INT random_seed

Seed for the random number generator.

42

STRING result_attribute

Attribute where the result embedding is stored. Must be of type LIST<FLOAT>. If empty, the algorithm does not write the embedding to an attribute.

(empty string)

STRING component_attribute

If batching the embedding process by connected components, specify the vertex attribute that contains the integer ID of the component.

(empty string)

INT batch_number

If batching by connected component, specify which batch to compute the embeddings for.

N/A

STRING filepath

If specified, what file to write the computed embeddings to.

(empty string)

BOOL print_results

If TRUE, all vertex embeddings will be printed.

False

INT choose_k

Prints a sample of this many vertices and their embeddings.

5

Parameter tuning

The optimal values for the following parameters depend on your dataset. Tune these parameters to obtain the best quality embeddings for your graph.

reduced_dimension

The reduced dimension (reduced_dimension) is the length of the produced vectors. A greater dimension offers a greater precision, but is more costly to operate over.

input_weights

The algorithm interactively constructs intermediate embeddings by averaging either neighboring intermediate embeddings from the previous iteration, or the generated random vectors during the first iteration. The final embeddings is a weighted sum of the intermediate embeddings from each iteration, and the input_weights parameter determine how much each set of intermediate embeddings weigh.

beta

The parameter beta determines how node degrees influence the embedding. Using a negative value will downplay the importance of high degree neighbors, while a positive value will instead increase their importance.

Usually, beta is set to be between -1 and 0.

sampling_constant

FastRP uses very sparse random projection to reduce the dimensionality of the data from an \$n*m\$ matrix to an \$n*d\$ matrix where \$d <= m\$ by multiplying the original matrix with an \$m*d\$ matrix. The \$m*d\$ matrix is made up of independently and identically distributed data sampled from:

image (38)

s is the sampling constant (sampling_constant). The higher the constant, the higher the number of zeros in the resulting matrix, which speeds up the algorithm.