I am working on a regression model to predict housing sale prices. I split the data into X and y. I needed to preprocess the data so I created pipelines to handle imputing and scaling numeric variables and imputing and encoding categorical variables. The pipelines work as expected when directly transforming the dataframe, but when passed to a ColumnTransformer something changes. The dataset looks different when using the ColumnTransformer as a preprocessor and the DataFrame produced produces an error when passed to my mutual info regression function. ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required. However it works when I do not specify the discrete variables. Any other preprocessed DataFrame works and allows me to specify discrete variables. I need to make sure the mutual_info_regression recognizes discrete variables so it can produce good results.
Here is the code for the preprocessing that produces problems:
#Pipeline to impute missing values and scale numerical variables
numerical_processes = Pipeline(steps = [('imputer', SimpleImputer(strategy = 'constant', fill_value = 0)),
('scaler', StandardScaler())])
#Pipeline to impute missing values and encode categorical variables
categorical_processes = Pipeline(steps = [('imputer', SimpleImputer(strategy = 'constant', fill_value = 'None')),
('encoder', ce.TargetEncoder())])
#create a preprocessor that wraps up processes for both numerical and categorical variables
Preprocessor = ColumnTransformer(
transformers = [('num', numerical_processes, numerical),
('categorical', categorical_processes, categorical)])
Below is code that does the exact same preprocessing but works (I understand I could just use this, but its for a portfolio so I want to use a preprocessor for neatness):
X_df = X.copy()
X_df[numerical] = numerical_processes.fit_transform(X[numerical])
X_df[categorical] = categorical_processes.fit_transform(X[categorical], y)
X_df.head()
I will provide the code for the mutual_info_regression and the output when I call it here:
from sklearn.feature_selection import mutual_info_regression
def MI(X, y, categorical):
mi_scores = mutual_info_regression(X, y, discrete_features = X.columns.get_indexer(categorical),
random_state = 4)
mi_scores = pd.Series(mi_scores, name = 'Mutual Info', index = X.columns)
mi_scores = mi_scores.sort_values(ascending = False)
return mi_scores
print(MI(X_pp, y, categorical))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[21], line 1
----> 1 print(MI(X_pp, y, categorical))
2 #Mutual_Information.head()
Cell In[18], line 4, in MI(X, y, categorical)
3 def MI(X, y, categorical):
----> 4 mi_scores = mutual_info_regression(X, y, discrete_features = X.columns.get_indexer(categorical),
5 random_state = 4)
6 mi_scores = pd.Series(mi_scores, name = 'Mutual Info', index = X.columns)
7 mi_scores = mi_scores.sort_values(ascending = False)
File /opt/conda/lib/python3.10/site-packages/sklearn/feature_selection/_mutual_info.py:388, in mutual_info_regression(X, y, discrete_features, n_neighbors, copy, random_state)
312 def mutual_info_regression(
313 X, y, *, discrete_features="auto", n_neighbors=3, copy=True, random_state=None
314 ):
315 """Estimate mutual information for a continuous target variable.
316
317 Mutual information (MI) [1]_ between two random variables is a non-negative
(...)
386 of a Random Vector", Probl. Peredachi Inf., 23:2 (1987), 9-16
387 """
--> 388 return _estimate_mi(X, y, discrete_features, False, n_neighbors, copy, random_state)
File /opt/conda/lib/python3.10/site-packages/sklearn/feature_selection/_mutual_info.py:304, in _estimate_mi(X, y, discrete_features, discrete_target, n_neighbors, copy, random_state)
297 y = scale(y, with_mean=False)
298 y += (
299 1e-10
300 * np.maximum(1, np.mean(np.abs(y)))
301 * rng.standard_normal(size=n_samples)
302 )
--> 304 mi = [
305 _compute_mi(x, y, discrete_feature, discrete_target, n_neighbors)
306 for x, discrete_feature in zip(_iterate_columns(X), discrete_mask)
307 ]
309 return np.array(mi)
File /opt/conda/lib/python3.10/site-packages/sklearn/feature_selection/_mutual_info.py:305, in <listcomp>(.0)
297 y = scale(y, with_mean=False)
298 y += (
299 1e-10
300 * np.maximum(1, np.mean(np.abs(y)))
301 * rng.standard_normal(size=n_samples)
302 )
304 mi = [
--> 305 _compute_mi(x, y, discrete_feature, discrete_target, n_neighbors)
306 for x, discrete_feature in zip(_iterate_columns(X), discrete_mask)
307 ]
309 return np.array(mi)
File /opt/conda/lib/python3.10/site-packages/sklearn/feature_selection/_mutual_info.py:161, in _compute_mi(x, y, x_discrete, y_discrete, n_neighbors)
159 return mutual_info_score(x, y)
160 elif x_discrete and not y_discrete:
--> 161 return _compute_mi_cd(y, x, n_neighbors)
162 elif not x_discrete and y_discrete:
163 return _compute_mi_cd(x, y, n_neighbors)
File /opt/conda/lib/python3.10/site-packages/sklearn/feature_selection/_mutual_info.py:138, in _compute_mi_cd(c, d, n_neighbors)
135 c = c[mask]
136 radius = radius[mask]
--> 138 kd = KDTree(c)
139 m_all = kd.query_radius(c, radius, count_only=True, return_distance=False)
140 m_all = np.array(m_all)
File sklearn/neighbors/_binary_tree.pxi:833, in sklearn.neighbors._kd_tree.BinaryTree.__init__()
File /opt/conda/lib/python3.10/site-packages/sklearn/utils/validation.py:931, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
929 n_samples = _num_samples(array)
930 if n_samples < ensure_min_samples:
--> 931 raise ValueError(
932 "Found array with %d sample(s) (shape=%s) while a"
933 " minimum of %d is required%s."
934 % (n_samples, array.shape, ensure_min_samples, context)
935 )
937 if ensure_min_features > 0 and array.ndim == 2:
938 n_features = array.shape[1]
ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required.