I get a
Feature names seen at fit time, yet now missing
error when predicting from X_test with the subset of features selected by the sklearn SFS:
model_for_sfs = LogisticRegression(solver="saga")
model = LogisticRegression(solver="saga")
pipeline_for_fs = Pipeline(steps=[
('imputer', SimpleImputer(strategy="median")),
("model",model_for_sfs)])
n_splits = 2
cv_fs = StratifiedKFold(n_splits, shuffle=True, random_state=0)
cv_perf = StratifiedKFold(n_splits, shuffle=True, random_state=0)
# Feature selection
fs = SFS(
estimator=pipeline_for_fs,
n_features_to_select=2,
cv=cv_fs,
scoring='accuracy',
n_jobs=-1
)
pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy="median")),
('selector', fs),
("model", model)])
pipeline.fit(X_train, y_train)
sfs = pipeline.named_steps["selector"]
features = sfs.get_support(indices=True)
y_pred = pipeline.predict_proba(X_test.iloc[:, list(features)])[:, 1]
I thought sklearn SFS should transform the dataset to keep only the features it has chosen. Is it not the case? Is there a way to make it do that?