Error "could not convert string to float" when setting user-defined heterogeneous distance metric for mixed-type dataset in sklearn.neighbors

60 Views Asked by Tomas H. At 27 October 2023 at 11:42

Hopefully someone can help me out with the following:

I have a mixed-type dataset that contains both numerical (dtypes: int, float, bool) and categorical (dtype: categorical) variables in Python. Now, I want to train a Nearest Neighbor algorithm on this dataset using the class NearestNeighbors from sklearn.neighbors. To handle the different datatypes, I want to initialize the metric parameter with a heterogeneous distance metric. The sklearn.neighbors description states that "any metric from scikit-learn or scipy.spatial.distance can be used" to define this parameter. Because (as far as I know) these do not include a heterogeneous distance metric, I decided to use distython: a user-defined distance metric class that can compute heterogeneous distances for datasets with both numerical & categorical variables.

My code is the following:

from sklearn.neighbors import NearestNeighbors
from distython import HEOM

X_train # dataset with numerical & categorical variables
catIndices # column indices of categorical variables

# initialize heterogenous distance metric
heom_metric = HEOM(X_train, catIndices)
    
# Construct & train nearest neighbor algorithm
neigh = NearestNeighbors(n_neighbors=5, metric = heom_metric.heom)
neigh.fit(X_train)

However, I get the following error:

Cell In[17], line 12
     11 neigh = NearestNeighbors(n_neighbors=5, metric = heom_metric.heom)
---> 12 neigh.fit(X_train)

File ~\anaconda3\Lib\site-packages\sklearn\base.py:1152, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1145     estimator._validate_params()
   1147 with config_context(
   1148     skip_parameter_validation=(
   1149         prefer_skip_nested_validation or global_skip_validation
   1150     )
   1151 ):
-> 1152     return fit_method(estimator, *args, **kwargs)

File ~\anaconda3\Lib\site-packages\sklearn\neighbors\_unsupervised.py:175, in NearestNeighbors.fit(self, X, y)
    154 @_fit_context(
    155     # NearestNeighbors.metric is not validated yet
    156     prefer_skip_nested_validation=False
    157 )
    158 def fit(self, X, y=None):
    159     """Fit the nearest neighbors estimator from the training dataset.
    160 
    161     Parameters
   (...)
    173         The fitted nearest neighbors estimator.
    174     """
--> 175     return self._fit(X)

File ~\anaconda3\Lib\site-packages\sklearn\neighbors\_base.py:498, in NeighborsBase._fit(self, X, y)
    496 else:
    497     if not isinstance(X, (KDTree, BallTree, NeighborsBase)):
--> 498         X = self._validate_data(X, accept_sparse="csr", order="C")
    500 self._check_algorithm_metric()
    501 if self.metric_params is None:

File ~\anaconda3\Lib\site-packages\sklearn\base.py:605, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
    603         out = X, y
    604 elif not no_val_X and no_val_y:
--> 605     out = check_array(X, input_name="X", **check_params)
    606 elif no_val_X and not no_val_y:
    607     out = _check_y(y, **check_params)

File ~\anaconda3\Lib\site-packages\sklearn\utils\validation.py:915, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    913         array = xp.astype(array, dtype, copy=False)
    914     else:
--> 915         array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
    916 except ComplexWarning as complex_warning:
    917     raise ValueError(
    918         "Complex data not supported\n{}\n".format(array)
    919     ) from complex_warning

File ~\anaconda3\Lib\site-packages\sklearn\utils\_array_api.py:380, in _asarray_with_order(array, dtype, order, copy, xp)
    378     array = numpy.array(array, order=order, dtype=dtype)
    379 else:
--> 380     array = numpy.asarray(array, order=order, dtype=dtype)
    382 # At this point array is a NumPy ndarray. We convert it to an array
    383 # container that is consistent with the input's namespace.
    384 return xp.asarray(array)
**ValueError: could not convert string to float: [element from categorical column]**

I understand that the fit() method from the NearestNeighbors class does not handle categorical object, because the kNN algorithm can not compute distances between string elements (obviously). However, I don't understand why this error is also given in my situation where I explicitly state a distance metric that can have a combination of numericals & categoricals as input and gives a single numerical distance value as output.

My hypothesis is that my user-defined distance metric for some reason is not properly 'recognized' by the NearestNeighbors class as being heterogeneous, and therefore the fit() already produces this ValueError before it even gets the chance to calculate the distances. The weird thing is that this error does not occur in the example supplied on the Github page of the user-defined distance metric.

My question is: how do I fix this issue and make sure that my NearestNeighbors class properly accepts my user-defined distance metric?

Thanks in advance.

Original Q&A

Error "could not convert string to float" when setting user-defined heterogeneous distance metric for mixed-type dataset in sklearn.neighbors

There are 0 best solutions below

Related Questions in SCIKIT-LEARN

Related Questions in USER-DEFINED-FUNCTIONS

Related Questions in DISTANCE

Related Questions in METRICS

Related Questions in NEAREST-NEIGHBOR

Trending Questions

Popular # Hahtags

Popular Questions