k-Means Clustering
In the k-Means Clustering algorithm, vertices are gradually sorted into an arbitrary number k of clusters based on averaging their properties.
First, k vertices are selected at random to act as centroids, or the center of their clusters. The first clusters are defined as consisting of the vertices closest to those centroids. The borders of these clusters are refined iteratively by calculating the mean position of all points in the clusters, then by finding the closest vertices to those mean position points to act as the new centroids.
The last two steps are repeated many more times until each iteration no longer changes the location of the centroids or the vertices belonging to the clusters.
Vertices in the graph need to have an attribute in their schemas that is a list of doubles or floats (an embedding) for this algorithm to work. This is because the distances are calculated based on the vertex embeddings rather than any other factor.
In some use cases, an embeddings algorithm like FastRP can be used to create these embeddings.
If you have vertices with separate latitude and longitude attributes, you must first prepare the data by combining the two coordinates into a list of FLOAT
values in a single attribute.
Notes
This algorithm ignores edges in the graph and looks only at vertex attributes.
This query uses another subquery tg_kmeans_sub
, which needs to be installed before tg_kmeans
can be installed.
Before using this query, it is necessary to change the placeholder attribute embeddings
on lines 42, 93, and 107 in the code to the name of the attribute where the vertex embeddings are stored.
Specifications
tg_kmeans(INT k = 2, INT max_k = 5, FLOAT max_change = 1.0, STRING v_type, STRING e_type, BOOL random = FALSE, BOOL print_accum = TRUE, STRING file_path="")
Parameters
Name | Description | Default Value |
---|---|---|
|
The starting value of the K range |
|
|
The ending value of the K range |
|
|
The maximum centroid vector change condition |
|
|
Vertex type to start from |
(empty string) |
|
Edge type to traverse |
(empty string) |
|
Whether to start from either random or non-random centroid positions |
|
|
Whether to print JSON output to the console |
|
|
File to write CSV output to |
(empty string) |