Incompatible dimension error if input one data row for tfidfvectorizer

34 Views Asked by At

I am trying to implement tf-idf and use KNN to predict a class based on text. I have a data of 500 rows split in 450-50 for training and testing.

While training, I have fitted the training data and transformed and also extracted the vectorizer to use for testing.

word_vectorizer = TfidfVectorizer(
                        stop_words='english',
                        strip_accents='unicode',
                        token_pattern=r'\w{1,}',
                        analyzer='word',
                        ngram_range=(1, 1))

word_vectorizer.fit(X_train)
vfilename = 'vectorizer.joblib'
joblib.dump(word_vectorizer, vfilename)

_X_train = word_vectorizer.transform(X_train)

Similarly, I have extracted the model weights,


classifier_algo.fit(_X_train, y_train)
filename = f'updated_{classifier_algo}_{class_name}_model.joblib'
model_hello = joblib.dump(classifier_algo, filename)

But during testing while importing and running the two,

word_vectorizer = joblib.load(vfilename)
_X_test = word_vectorizer.transform(X_test)

model_hello = joblib.load(filename)
y_pred = model_hello.predict(_X_test)

I end up getting this error

ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 289402 while Y.shape[1] == 303846

I have tried looking everywhere and seem to find nothing but people suggesting not to use fit on test data but i have not used it

0

There are 0 best solutions below