Good morning,
I am trying to select features with RFECVfrom sklearn.feature_selection and I am puzzled by the number of features vs CV score plot. The plot goes up and down at almost each step and it isn't easy to trust the results. The optimal number of features is 5 (among 92 backed by the scientific literature on my topic).
RFECV code below. Note that I use XGB classifier, the score to optimize is neg_log_loss and randomCV_clf is a custom CV returning train/validation indexes for 5 folds (tested elsewhere and working fine).
xgboost_clf = xgb.XGBClassifier(random_state = 57,
grow_policy = "depthwise",
booster = "gbtree",
tree_method = "auto",
)
step = 1
rfecv = RFECV(
estimator=xgboost_clf,
step=step,
cv=randomCV_clf,
scoring="neg_log_loss",
min_features_to_select= 1,
n_jobs=-1,
)
rfecv.fit(preprocessed_X_train_full, y_train_full)
The plot code is:
import matplotlib.pyplot as plt
n_scores = len(rfecv.cv_results_["mean_test_score"])
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Mean test accuracy")
plt.errorbar(
range(preprocessed_X_train_full.shape[1]-n_scores*step,
preprocessed_X_train_full.shape[1],
step),
rfecv.cv_results_["mean_test_score"],
# yerr=rfecv.cv_results_["std_test_score"], # error bars
)
plt.title("Recursive Feature Elimination \nwith correlated features")
plt.show()
