k-Means Clustering

Supported Graph Characteristics

Unweighted edges

Weighted edges

Directed edges

Undirected edges

Homogeneous vertex types

In the k-Means Clustering algorithm, vertices are gradually sorted into an arbitrary number k of clusters based on averaging their properties.

First, k vertices are selected at random to act as centroids, or the center of their clusters. The first clusters are defined as consisting of the vertices closest to those centroids. The borders of these clusters are refined iteratively by calculating the mean position of all points in the clusters, then by finding the closest vertices to those mean position points to act as the new centroids.

The last two steps are repeated many more times until each iteration no longer changes the location of the centroids or the vertices belonging to the clusters.

Vertices in the graph need to have an attribute in their schemas that is a list of doubles or floats (an embedding) for this algorithm to work. This is because the distances are calculated based on the vertex embeddings rather than any other factor.

In some use cases, an embeddings algorithm like FastRP can be used to create these embeddings. If you have vertices with separate latitude and longitude attributes, you must first prepare the data by combining the two coordinates into a list of FLOAT values in a single attribute.

Notes

This algorithm ignores edges in the graph and looks only at vertex attributes.

This query uses another subquery tg_kmeans_sub, which needs to be installed before tg_kmeans can be installed.

Before using this query, it is necessary to change the placeholder attribute embeddings on lines 42, 93, and 107 in the code to the name of the attribute where the vertex embeddings are stored.

Specifications

tg_kmeans(INT k = 2, INT max_k = 5, FLOAT max_change = 1.0, STRING v_type, STRING e_type, BOOL random = FALSE, BOOL print_accum = TRUE, STRING file_path="")

Parameters

Name Description Default Value

Name	Description	Default Value
`INT k`	The starting value of the K range	`2`
`INT max_k`	The ending value of the K range	`5`
`FLOAT max_change`	The maximum centroid vector change condition	`1`
`STRING v_type`	Vertex type to start from	(empty string)
`STRING e_type`	Edge type to traverse	(empty string)
`BOOL random`	Whether to start from either random or non-random centroid positions	`FALSE`
`BOOL print_accum`	Whether to print JSON output to the console	`FALSE`
`STRING file_path`	File to write CSV output to	(empty string)

INT k

The starting value of the K range

2

INT max_k

The ending value of the K range

5

FLOAT max_change

The maximum centroid vector change condition

1

STRING v_type

Vertex type to start from

(empty string)

STRING e_type

Edge type to traverse

(empty string)

BOOL random

Whether to start from either random or non-random centroid positions

FALSE

BOOL print_accum

Whether to print JSON output to the console

FALSE

STRING file_path

File to write CSV output to

(empty string)