I am trying to implement tf-idf and use KNN to predict a class based on text. I have a data of 500 rows split in 450-50 for training and testing.
While training, I have fitted the training data and transformed and also extracted the vectorizer to use for testing.
word_vectorizer = TfidfVectorizer(
stop_words='english',
strip_accents='unicode',
token_pattern=r'\w{1,}',
analyzer='word',
ngram_range=(1, 1))
word_vectorizer.fit(X_train)
vfilename = 'vectorizer.joblib'
joblib.dump(word_vectorizer, vfilename)
_X_train = word_vectorizer.transform(X_train)
Similarly, I have extracted the model weights,
classifier_algo.fit(_X_train, y_train)
filename = f'updated_{classifier_algo}_{class_name}_model.joblib'
model_hello = joblib.dump(classifier_algo, filename)
But during testing while importing and running the two,
word_vectorizer = joblib.load(vfilename)
_X_test = word_vectorizer.transform(X_test)
model_hello = joblib.load(filename)
y_pred = model_hello.predict(_X_test)
I end up getting this error
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 289402 while Y.shape[1] == 303846
I have tried looking everywhere and seem to find nothing but people suggesting not to use fit on test data but i have not used it