I am attempting to write a custom classifier for use in a sklearn gridsearchCV pipeline.
I've stripped everything back to the bare minimum in the class which currently looks like this:
from sklearn.base import BaseEstimator, ClassifierMixin
import pandas as pd
class DifferentialMethylation(BaseEstimator, ClassifierMixin):
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return self
In my main code, I have this:
X_train, X_test, y_train, y_test = train_test_split(df, cancerType, test_size=0.2, random_state=42)
differentialMethylation = DifferentialMethylation()
feature_selection = RFE(estimator=RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1))
randomForest = RandomForestClassifier(random_state=42)
# Create the pipeline with feature selection and model refinement
pipeline = Pipeline([
('differentialMethylation', differentialMethylation),
('featureSelection', featureSelection),
('modelRefinement', randomForest)
])
search = GridSearchCV(pipeline,
param_grid=parameterGrid,
scoring='accuracy',
cv=5,
verbose=0,
n_jobs=-1,
pre_dispatch='2*n_jobs')
search.fit(X_train, y_train)
If I remove the custom classifier from the pipeline, so that the pipeline looks like this:
pipeline = Pipeline([
('featureSelection', featureSelection),
('modelRefinement', randomForest)
])
it runs happily. If I add that line back in, I get:
ValueError: Expected 2D array, got scalar array instead:
array=DifferentialMethylation().
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
the X_train is a two dimensional data frame - X_train.shape: (679, 369), y_train.shape: (679,). As best I can tell, the stripped back classifier .fit() method should be acting as a pass-through method, leaving the data unchanged, so I have no idea why the output of the test_train_split is being accepted by the RFE in the featureSelection classifier, but not in the differentialMethylation.
Unless there's some obscure piece of lore in the sklearn documentation about transforming input data for custom classifiers that I've missed.
Thoughts as to what's going on would be appreciated.
In the documentation, in the very obvious section :
Developer API for set_output
I see that the transform method doesn't return self, it returns X.
So it wasn't that the custom classifier wouldn't accept the dataframe, it's that when it was trying to feed it through as input for the next stage, returning self resulting in the next step generating the
In retrospect, I can see how that error makes sense, but it was not an easy path to understanding.