I'm trying to use sklearn to build a custom Pipeline for a school project that uses ML to analyze text. I have established some logging into my custom Transformers and am encountering an issue that got me stuck for a week if not more. It's the following :
ValueError: X has 7930 features, but SelectKBest is expecting 25050 features as input.
Essentially my process is the following :
Gather features before preprocessing (when I need the words and punctuation to be present and unchanged for some feature extraction)
Apply preprocessing
Gather features after preprocessing
Example transformers :
class FeatureExtractorBeforePreprocessing(BaseEstimator, TransformerMixin):
"""
Extracts all features combined before preprocessing.
"""
def __init__(self, stopWords=False, errorDetector=False, punctuationFrequency=False, sentenceLength=False):
"""
Initialize the FeatureExtractorBeforePreprocessing.
Args:
stopWords (bool): Whether to include stop words as a feature.
errorDetector (bool): Whether to include error detection as a feature.
punctuationFrequency (bool): Whether to include punctuation frequency as a feature.
sentenceLength (bool): Whether to include sentence length as a feature.
"""
self.stopWords = stopWords
self.errorDetector = errorDetector
self.punctuationFrequency = punctuationFrequency
self.sentenceLength = sentenceLength
self.feature_union = []
self.combined_transformers = None
# Add transformers based on selected options
if self.stopWords:
self.feature_union.append(("stopWords", StopWords()))
if self.errorDetector:
self.feature_union.append(("errorDetector", ErrorDetector()))
if self.punctuationFrequency:
self.feature_union.append(("punctuationFrequency", PunctuationFrequency()))
if self.sentenceLength:
self.feature_union.append(("sentenceLength", SentenceLength()))
if self.feature_union:
self.combined_transformers = FeatureUnion(self.feature_union)
def fit(self, X, y=None):
# Fit the combined transformers if they exist
if self.combined_transformers:
self.combined_transformers.fit(X)
return self
def transform(self, X):
if self.feature_union:
logger.info("Extracting features before preprocessing..")
combined_features = self.combined_transformers.transform(X)
logger.info(f"Shape of combined features before preprocessing: {combined_features.shape}")
return combined_features
else:
X_array = np.empty((len(X), 0))
logger.info("Nothing to extract before preprocessing..")
logger.info(f"Shape of combined features before preprocessing: {X_array.shape}")
return X_array
class FeatureExtractorAfterPreprocessing(BaseEstimator, TransformerMixin):
def __init__(self, config=config, textWordCounter=False, wordLength=False, vocabularySize=False):
self.config = config
self.textWordCounter = textWordCounter
self.wordLength = wordLength
self.vocabularySize = vocabularySize
self.feature_union = []
if self.textWordCounter:
self.feature_union.append(("textWordCounter", TextWordCounter(self.config.getboolean("TextWordCounter","freqDist"), self.config.getboolean("TextWordCounter", "bigrams"))))
if self.vocabularySize:
self.feature_union.append(("vocabularySize", VocabularySize()))
self.combined_transformers = FeatureUnion(self.feature_union)
def fit(self, X, y=None):
if self.combined_transformers:
self.combined_transformers.fit(X)
return self
def transform(self, X):
if self.feature_union:
logger.info("Extracting features after preprocessing..")
combined_features = self.combined_transformers.transform(X)
logger.info(f"Shape of combined features after preprocessing: {combined_features.shape}")
return combined_features
else:
X_array = np.empty((len(X), 0))
logger.info("Nothing to extract before preprocessing..")
logger.info(f"Shape of combined features after preprocessing: {X_array.shape}")
return X_array
I have the logging process here :
2024-03-21 20:31:55,868 [INFO] Creating custom pipeline...
2024-03-21 20:31:55,868 [INFO] pipeline config: {'stopWords': False, 'errorDetector': False, 'punctuationFrequency': False, 'sentenceLength': False, 'textWordCounter': True, 'wordLength': False, 'vocabularySize': False, 'featureSelector': SelectKBest(k=10000, score_func=<function chi2 at 0x000001ACCFDC4040>)}
2024-03-21 20:31:55,868 [INFO] Creating pipeline...
2024-03-21 20:31:55,870 [INFO] Pipeline config: [('featureExtractionUnion', FeatureUnion(transformer_list=[('featureExtractionBeforePreprocessing',
FeatureExtractorBeforePreprocessing()),
('afterPreprocessingPipeline',
Pipeline(steps=[('preprocessing',
Preprocessing()),
('featureExtractionAfterPreprocessing',
FeatureExtractorAfterPreprocessing(textWordCounter=True))]))])), ('featureSelection', SelectKBest(k=10000, score_func=<function chi2 at 0x000001ACCFDC4040>))]
2024-03-21 20:31:55,870 [INFO] Pipeline created.
2024-03-21 20:31:55,870 [INFO] Selected classifier: svm
2024-03-21 20:31:55,870 [INFO] Custom pipeline created.
2024-03-21 20:31:55,870 [INFO] UserConfigPipeline initialized.
2024-03-21 20:31:55,873 [INFO] Training custom model...
2024-03-21 20:31:55,873 [INFO] Nothing to extract before preprocessing..
2024-03-21 20:31:55,873 [INFO] Shape of combined features before preprocessing: (334, 0)
2024-03-21 20:31:55,873 [INFO] Preprocessing..
2024-03-21 20:31:55,874 [INFO] No preprocessing applied..
2024-03-21 20:31:55,874 [INFO] Returned list X of 334 texts.
2024-03-21 20:31:55,874 [INFO] Extracting features after preprocessing..
2024-03-21 20:31:55,960 [INFO] Extracting freqDict features..
2024-03-21 20:31:55,990 [INFO] Shape of freqDict features: (334, 25050)
2024-03-21 20:31:55,991 [INFO] Shape of combined features after preprocessing: (334, 25050)
2024-03-21 20:31:56,166 [INFO] Successfully trained custom model...
2024-03-21 20:31:56,166 [INFO] Validating model...
2024-03-21 20:31:56,166 [INFO] Predicting evaluation set...
2024-03-21 20:31:56,166 [INFO] Nothing to extract before preprocessing..
2024-03-21 20:31:56,166 [INFO] Shape of combined features before preprocessing: (72, 0)
2024-03-21 20:31:56,166 [INFO] Preprocessing..
2024-03-21 20:31:56,166 [INFO] No preprocessing applied..
2024-03-21 20:31:56,166 [INFO] Returned list X of 72 texts.
2024-03-21 20:31:56,166 [INFO] Extracting features after preprocessing..
2024-03-21 20:31:56,185 [INFO] Extracting freqDict features..
2024-03-21 20:31:56,195 [INFO] Shape of freqDict features: (72, 7930)
2024-03-21 20:31:56,196 [INFO] Shape of combined features after preprocessing: (72, 7930)
Now as I understand it, it should be absolutely normal for the validation set to contain less features than the training set since it is a smaller portion of the whole dataset. I have split them 70% training, 15% validation, 15% testing. Now, I apply the same pipeline to both sets (training and validation) and I am fitting the pipeline during the training process before using the .predict method during validation. Does anyone have a hint on what could be causing this issue?
As I understood it, it should be a built-in feature that it takes into account the fact it gets new values that are not necessarily the same features it got during training, and therefore, should set the missing values to 0 or an "absent" value. Am I missing something ?
Thanks for your time and hopefully someone can lend me a hand with this :)