The KD-Tree and Ball-Tree algorithm are giving me unexpected results in a new pyspark notebook running on databricks.
I have created a new databricks notebook (pyspark) and copied the following code in:
import numpy as np
from sklearn.neighbors import BallTree
matrix = np.array([
[0, 1, 4, 2, 1],
[1, 0, 2, 1, 1],
[4, 2, 0, 3, 1],
[2, 1, 3, 0, 1],
[1, 1, 1, 1, 0]
])
tree = BallTree(matrix)
dist, ind = tree.query(matrix, k=5)
print(ind)
And the ind output is:
[[0 1 3 4 2]
[1 4 3 0 2]
[2 4 1 3 0]
[3 1 4 0 2]
[4 1 3 0 2]]
My expectation would be that the first row should read [[0 1 4 3 2] or it should read [[0 4 1 3 2], rather than [[0 1 3 4 2]. Have I missed something / misunderstood? The same also occurs when I use a KDTree instead of a ball tree.
You should take into account that
BallTreeandKDTreetook as their inputs the coordianates of points in N-dim space (not the matrix of pairwise distances). The following calculation proves that the result returned byBallTreeis correct: