I'm using RFECV in a pipeline to reduce features. I've got it all working fine, and so to streamline the process I've started trying to iterate over a list of different pipelines to evaluate various approaches. The problem arises with the SECOND attempt to fit the model. Each of the individual models fit fine when I start a new session, but when I try to fit a second model in the same session (or the same model a second time in the same session), I get the following error:
----> 1 rfe_logit.fit(X_rfe, y_rfe)
~/anaconda3/lib/python3.9/site-packages/sklearn/feature_selection/_rfe.py in fit(self, X, y, groups)
730 scores = np.array(scores)
731 scores_sum = np.sum(scores, axis=0)
--> 732 scores_sum_rev = scores_sum[::-1]
733 argmax_idx = len(scores_sum) - np.argmax(scores_sum_rev) - 1
734 n_features_to_select = max(
IndexError: invalid index to scalar variable.
I've inspected the source code and it looks like it's something to do with how the scores are stored in the instance, but I've no idea how to take this further. I would be very grateful for any insight the community could share.
Please find below my MWE (other models use identical construction, and as mentioned above, they all work individually if they are called first):
kbest = SelectKBest(k=50)
X_rfe = pd.DataFrame(
kbest.fit_transform(X, y),
columns = kbest.get_feature_names_out()
)
binarise = LabelEncoder()
y_rfe = binarise.fit(y)
y_rfe.classes_ = np.array(['No tumour', 'Tumour'])
y_rfe = binarise.fit_transform(y)
class RFEpipeline(Pipeline):
@property
def coef_(self):
return self._final_estimator.coef_
@property
def feature_importances_(self):
return self._final_estimator.feature_importances_
lda = RFEpipeline([
('med_log_transformer', FunctionTransformer(
med_log_transform,
feature_names_out='one-to-one'
)),
('standardscaler', StandardScaler()),
('lineardiscriminantanalysis', LinearDiscriminantAnalysis())
]
)
logit = RFEpipeline([
('med_log_transformer', FunctionTransformer(
med_log_transform,
feature_names_out='one-to-one'
)),
('standardscaler', StandardScaler()),
('logisticregression', LogisticRegression())
]
)
logocv = LeaveOneGroupOut()
groups = ex_vivo_metadata.patient_number
cv = logocv.split(X_rfe, y_rfe, groups=groups)
common_params = {
'n_jobs':8,
'cv':cv,
'verbose':1,
'step':3,
'min_features_to_select':10,
'scoring':'f1',
}
rfe_lda = RFECV(lda, **common_params)
rfe_logit = RFECV(logit, **common_params)
rfe_lda.fit(X_rfe, y_rfe)
rfe_logit.fit(X_rfe, y_rfe)
To reiterate, the second fit fails, but the first works no matter which it is. Many thanks for your thoughts.
(NB, the custom RFEpipeline class comes from this SO question, and works great. Thanks to Vivek Kumar.)