# k-Means Clustering

Supported Graph Characteristics
 Unweighted edges Weighted edges Directed edges Undirected edges Homogeneous vertex types

In the k-Means Clustering algorithm, vertices are gradually sorted into an arbitrary number k of clusters based on averaging their properties.

First, k vertices are selected at random to act as centroids, or the center of their clusters. The first clusters are defined as consisting of the vertices closest to those centroids. The borders of these clusters are refined iteratively by calculating the mean position of all points in the clusters, then by finding the closest vertices to those mean position points to act as the new centroids.

The last two steps are repeated many more times until each iteration no longer changes the location of the centroids or the vertices belonging to the clusters.

Vertices in the graph need to have an attribute in their schemas that is a list of doubles or floats (an embedding) for this algorithm to work. This is because the distances are calculated based on the vertex embeddings rather than any other factor.

In some use cases, an embeddings algorithm like FastRP can be used to create these embeddings. If you have vertices with separate latitude and longitude attributes, you must first prepare the data by combining the two coordinates into a list of `FLOAT` values in a single attribute.

## Notes

This algorithm ignores edges in the graph and looks only at vertex attributes.

This query uses another subquery `tg_kmeans_sub`, which needs to be installed before `tg_kmeans` can be installed.

Before using this query, it is necessary to change the placeholder attribute `embeddings` on lines 42, 93, and 107 in the code to the name of the attribute where the vertex embeddings are stored.

## Specifications

``tg_kmeans(INT k = 2, INT max_k = 5, FLOAT max_change = 1.0, STRING v_type, STRING e_type, BOOL random = FALSE, BOOL print_accum = TRUE, STRING file_path="")``

## Parameters

Name Description Default Value

`INT k`

The starting value of the K range

`2`

`INT max_k`

The ending value of the K range

`5`

`FLOAT max_change`

The maximum centroid vector change condition

`1`

`STRING v_type`

Vertex type to start from

(empty string)

`STRING e_type`

Edge type to traverse

(empty string)

`BOOL random`

Whether to start from either random or non-random centroid positions

`FALSE`

`BOOL print_accum`

Whether to print JSON output to the console

`FALSE`

`STRING file_path`

File to write CSV output to

(empty string)