Recursive Feature Elimination plot feature number VS score

35 Views Asked by At

Good morning,

I am trying to select features with RFECVfrom sklearn.feature_selection and I am puzzled by the number of features vs CV score plot. The plot goes up and down at almost each step and it isn't easy to trust the results. The optimal number of features is 5 (among 92 backed by the scientific literature on my topic).

enter image description here

RFECV code below. Note that I use XGB classifier, the score to optimize is neg_log_loss and randomCV_clf is a custom CV returning train/validation indexes for 5 folds (tested elsewhere and working fine).

xgboost_clf = xgb.XGBClassifier(random_state = 57, 
                                grow_policy = "depthwise", 
                                booster = "gbtree",
                                tree_method = "auto",
                                )
step = 1
rfecv = RFECV(
    estimator=xgboost_clf,
    step=step,
    cv=randomCV_clf,
    scoring="neg_log_loss",
    min_features_to_select= 1,
    n_jobs=-1, 
)
rfecv.fit(preprocessed_X_train_full, y_train_full)

The plot code is:

import matplotlib.pyplot as plt

n_scores = len(rfecv.cv_results_["mean_test_score"])
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Mean test accuracy")
plt.errorbar(
    range(preprocessed_X_train_full.shape[1]-n_scores*step,
        preprocessed_X_train_full.shape[1],
        step),
    rfecv.cv_results_["mean_test_score"],
    # yerr=rfecv.cv_results_["std_test_score"], # error bars
)
plt.title("Recursive Feature Elimination \nwith correlated features")
plt.show()
0

There are 0 best solutions below