k-Nearest Neighbors (Cross-Validation Version)

k-Nearest Neighbors (kNN) is often used for machine learning. You can choose the value for topK based on your experience, or using cross-validation to optimize the hyperparameters.

In our library, Leave-one-out cross-validation for selecting optimal k is provided. Given a k value, we run the algorithm repeatedly using every vertex with a known label as the source vertex and predict its label. We assess the accuracy of the predictions for each value of k, and then repeat for different values of k in the given range.

The goal is to find the value of k with highest predicting accuracy in the given range, for that dataset.

Specifications

tg_knn_cosine_cv( SET<STRING> v_type_set,  SET<STRING> e_type_set, SET<STRING> reverse_e_type_set,
STRING weight_attribute, STRING label, INT min_k, INT max_k) RETURNS (INT)

Parameters

Parameter Description Default Value

Parameter	Description	Default Value
`SET<STRING> v_type_set`	The vertex types to calculate the distance to the source vertex for.	(empty set of strings)
`SET<STRING> e_type_set`	The edge types to use	(empty set of strings)
`SET<STRING> reverse_e_type_set`	The reverse edge types to use	(empty set of strings)
`STRING weight_attribute`	If not empty, use this edge attribute as the edge weight.	(empty string)
`STRING label`	If not empty, read an existing label from this attribute.	(empty string)
`INT min_k`	The lower bound of k (inclusive)	N/A
`INT max_k`	The upper bound of k (inclusive)	N/A

SET<STRING> v_type_set

The vertex types to calculate the distance to the source vertex for.

(empty set of strings)

SET<STRING> e_type_set

The edge types to use

(empty set of strings)

SET<STRING> reverse_e_type_set

The reverse edge types to use

(empty set of strings)

STRING weight_attribute

If not empty, use this edge attribute as the edge weight.

(empty string)

STRING label

If not empty, read an existing label from this attribute.

(empty string)

INT min_k

The lower bound of k (inclusive)

N/A

INT max_k

The upper bound of k (inclusive)

N/A

Output

A list of prediction accuracy for every k value in the given range, and the value of k with the highest predicting accuracy in the given range.

The result is available in JSON format.

The result size is equal to max_k - min_k + 1.

Time complexity

This algorithm has a complexity of \(O(max\_k*\frac{E^2}{V})\), where \(E\) is the number of edges and \(V\) is the number of vertices.

Example

Run knn_cosine_cv with min_k=2, max_k = 5. The JSON result:

[
  {
    "@@correct_rate_list": [
      0.33333,
      0.33333,
      0.33333,
      0.33333
    ]
  },
  {
    "best_k": 2
  }
]