Why does fitting an RFECV instance fail on the *second* attempt?

26 Views Asked by At

I'm using RFECV in a pipeline to reduce features. I've got it all working fine, and so to streamline the process I've started trying to iterate over a list of different pipelines to evaluate various approaches. The problem arises with the SECOND attempt to fit the model. Each of the individual models fit fine when I start a new session, but when I try to fit a second model in the same session (or the same model a second time in the same session), I get the following error:

----> 1 rfe_logit.fit(X_rfe, y_rfe)

~/anaconda3/lib/python3.9/site-packages/sklearn/feature_selection/_rfe.py in fit(self, X, y, groups)
    730         scores = np.array(scores)
    731         scores_sum = np.sum(scores, axis=0)
--> 732         scores_sum_rev = scores_sum[::-1]
    733         argmax_idx = len(scores_sum) - np.argmax(scores_sum_rev) - 1
    734         n_features_to_select = max(

IndexError: invalid index to scalar variable.

I've inspected the source code and it looks like it's something to do with how the scores are stored in the instance, but I've no idea how to take this further. I would be very grateful for any insight the community could share.

Please find below my MWE (other models use identical construction, and as mentioned above, they all work individually if they are called first):


kbest = SelectKBest(k=50)
X_rfe = pd.DataFrame(
    kbest.fit_transform(X, y), 
    columns = kbest.get_feature_names_out()
    )
binarise = LabelEncoder()

y_rfe = binarise.fit(y)
y_rfe.classes_ = np.array(['No tumour', 'Tumour'])
y_rfe = binarise.fit_transform(y)

class RFEpipeline(Pipeline):
    @property
    def coef_(self):
        return self._final_estimator.coef_
    @property
    def feature_importances_(self):
        return self._final_estimator.feature_importances_

lda = RFEpipeline([
    ('med_log_transformer', FunctionTransformer(
        med_log_transform,
        feature_names_out='one-to-one'
        )),
    ('standardscaler', StandardScaler()),
    ('lineardiscriminantanalysis', LinearDiscriminantAnalysis())
    ]
)

logit = RFEpipeline([
    ('med_log_transformer', FunctionTransformer(
        med_log_transform,
        feature_names_out='one-to-one'
        )),
    ('standardscaler', StandardScaler()),
    ('logisticregression', LogisticRegression())
    ]
)


logocv = LeaveOneGroupOut() 
groups = ex_vivo_metadata.patient_number
cv = logocv.split(X_rfe, y_rfe, groups=groups)


common_params = {
    'n_jobs':8,
    'cv':cv,
    'verbose':1,
    'step':3, 
    'min_features_to_select':10,
    'scoring':'f1',
}



rfe_lda = RFECV(lda, **common_params)
rfe_logit = RFECV(logit, **common_params)

rfe_lda.fit(X_rfe, y_rfe)
rfe_logit.fit(X_rfe, y_rfe)

To reiterate, the second fit fails, but the first works no matter which it is. Many thanks for your thoughts.

(NB, the custom RFEpipeline class comes from this SO question, and works great. Thanks to Vivek Kumar.)

0

There are 0 best solutions below