k-Nearest Neighbors (Cross-Validation Version)

k-Nearest Neighbors (kNN) is often used for machine learning. You can choose the value for topK based on your experience, or using cross-validation to optimize the hyperparameters.

In our library, Leave-one-out cross-validation for selecting optimal k is provided. Given a k value, we run the algorithm repeatedly using every vertex with a known label as the source vertex and predict its label. We assess the accuracy of the predictions for each value of k, and then repeat for different values of k in the given range.

The goal is to find the value of k with highest predicting accuracy in the given range, for that dataset.

Specifications

tg_knn_cosine_cv(SET<STRING> v_type, SET<STRING> e_type, SET<STRING> re_type,
STRING weight, STRING label, INT min_k, INT max_k) RETURNS (INT)

Parameters

Parameter Description Default Value

Parameter	Description	Default Value
`SET<STRING> v_type`	The vertex types to calculate the distance to the source vertex for.	(empty set of strings)
`SET<STRING> e_type`	The edge types to use	(empty set of strings)
`SET<STRING> re_type`	The reverse edge types to use	(empty set of strings)
`STRING weight`	If not empty, use this edge attribute as the edge weight.	(empty string)
`STRING label`	If not empty, read an existing label from this attribute.	(empty string)
`INT min_k`	The lower bound of k (inclusive)	N/A
`INT max_k`	The upper bound of k (inclusive)	N/A

SET<STRING> v_type

The vertex types to calculate the distance to the source vertex for.

(empty set of strings)

SET<STRING> e_type

The edge types to use

(empty set of strings)

SET<STRING> re_type

The reverse edge types to use

(empty set of strings)

STRING weight

If not empty, use this edge attribute as the edge weight.

(empty string)

STRING label

If not empty, read an existing label from this attribute.

(empty string)

INT min_k

The lower bound of k (inclusive)

N/A

INT max_k

The upper bound of k (inclusive)

N/A

Output

A list of prediction accuracy for every k value in the given range, and the value of k with the highest predicting accuracy in the given range.

The result is available in JSON format.

The result size is equal to max_k - min_k + 1.

Time complexity

This algorithm has a complexity of \(O(max\_k*\frac{E^2}{V})\), where \(E\) is the number of edges and \(V\) is the number of vertices.

Example

Run knn_cosine_cv with min_k=2, max_k = 5. The JSON result:

[
  {
    "@@correct_rate_list": [
      0.33333,
      0.33333,
      0.33333,
      0.33333
    ]
  },
  {
    "best_k": 2
  }
]