I'm trying to do MultinomialNB. I got Value Error: dimension mismatch.
I'm using DictVectorizer for the training data and LabelEncoder for the class.
This is my code:
def create_token(inpt):
return inpt.split(' ')
def tok_freq(inpt):
tok = {}
for i in create_token(inpt):
if i not in tok:
tok[i] = 1
else:
tok[i] += 1
return tok
training_data = []
for i in range(len(raw_data)):
training_data.append((get_freq_of_tokens(raw_data.iloc[i].text), raw_data.iloc[i].category))
#vectorization
X, y = list(zip(*training_data))
label = LabelEncoder()
vector = DictVectorizer(dtype=float, sparse=True)
X = vector.fit_transform(X)
y = label.fit_transform(y)
multinb = mnb()
multinb.fit(X,y)
#vectorization for testing set
Xz = tok_freq(sms)
testX = vector.fit_transform(Xz)
multinb.predict(testX)
Which part of my code is wrong? Thanks.
Change
to:
When you do
fit()orfit_transform(), you are essentially training the vectorizer on the new data, which is not what you want. You only want to convert the test set in the same manner as on the train set, so only calltransform()