Hopefully someone can help me out with the following:
I have a mixed-type dataset that contains both numerical (dtypes: int, float, bool) and categorical (dtype: categorical) variables in Python. Now, I want to train a Nearest Neighbor algorithm on this dataset using the class NearestNeighbors from sklearn.neighbors. To handle the different datatypes, I want to initialize the metric parameter with a heterogeneous distance metric. The sklearn.neighbors description states that "any metric from scikit-learn or scipy.spatial.distance can be used" to define this parameter. Because (as far as I know) these do not include a heterogeneous distance metric, I decided to use distython: a user-defined distance metric class that can compute heterogeneous distances for datasets with both numerical & categorical variables.
My code is the following:
from sklearn.neighbors import NearestNeighbors
from distython import HEOM
X_train # dataset with numerical & categorical variables
catIndices # column indices of categorical variables
# initialize heterogenous distance metric
heom_metric = HEOM(X_train, catIndices)
# Construct & train nearest neighbor algorithm
neigh = NearestNeighbors(n_neighbors=5, metric = heom_metric.heom)
neigh.fit(X_train)
However, I get the following error:
Cell In[17], line 12
11 neigh = NearestNeighbors(n_neighbors=5, metric = heom_metric.heom)
---> 12 neigh.fit(X_train)
File ~\anaconda3\Lib\site-packages\sklearn\base.py:1152, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
1145 estimator._validate_params()
1147 with config_context(
1148 skip_parameter_validation=(
1149 prefer_skip_nested_validation or global_skip_validation
1150 )
1151 ):
-> 1152 return fit_method(estimator, *args, **kwargs)
File ~\anaconda3\Lib\site-packages\sklearn\neighbors\_unsupervised.py:175, in NearestNeighbors.fit(self, X, y)
154 @_fit_context(
155 # NearestNeighbors.metric is not validated yet
156 prefer_skip_nested_validation=False
157 )
158 def fit(self, X, y=None):
159 """Fit the nearest neighbors estimator from the training dataset.
160
161 Parameters
(...)
173 The fitted nearest neighbors estimator.
174 """
--> 175 return self._fit(X)
File ~\anaconda3\Lib\site-packages\sklearn\neighbors\_base.py:498, in NeighborsBase._fit(self, X, y)
496 else:
497 if not isinstance(X, (KDTree, BallTree, NeighborsBase)):
--> 498 X = self._validate_data(X, accept_sparse="csr", order="C")
500 self._check_algorithm_metric()
501 if self.metric_params is None:
File ~\anaconda3\Lib\site-packages\sklearn\base.py:605, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
603 out = X, y
604 elif not no_val_X and no_val_y:
--> 605 out = check_array(X, input_name="X", **check_params)
606 elif no_val_X and not no_val_y:
607 out = _check_y(y, **check_params)
File ~\anaconda3\Lib\site-packages\sklearn\utils\validation.py:915, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
913 array = xp.astype(array, dtype, copy=False)
914 else:
--> 915 array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
916 except ComplexWarning as complex_warning:
917 raise ValueError(
918 "Complex data not supported\n{}\n".format(array)
919 ) from complex_warning
File ~\anaconda3\Lib\site-packages\sklearn\utils\_array_api.py:380, in _asarray_with_order(array, dtype, order, copy, xp)
378 array = numpy.array(array, order=order, dtype=dtype)
379 else:
--> 380 array = numpy.asarray(array, order=order, dtype=dtype)
382 # At this point array is a NumPy ndarray. We convert it to an array
383 # container that is consistent with the input's namespace.
384 return xp.asarray(array)
**ValueError: could not convert string to float: [element from categorical column]**
I understand that the fit() method from the NearestNeighbors class does not handle categorical object, because the kNN algorithm can not compute distances between string elements (obviously). However, I don't understand why this error is also given in my situation where I explicitly state a distance metric that can have a combination of numericals & categoricals as input and gives a single numerical distance value as output.
My hypothesis is that my user-defined distance metric for some reason is not properly 'recognized' by the NearestNeighbors class as being heterogeneous, and therefore the fit() already produces this ValueError before it even gets the chance to calculate the distances. The weird thing is that this error does not occur in the example supplied on the Github page of the user-defined distance metric.
My question is: how do I fix this issue and make sure that my NearestNeighbors class properly accepts my user-defined distance metric?
Thanks in advance.