Python sklearn MultinomialNB: Dimension mismatch using DictVectorizer

216 Views Asked by jted95 At 24 April 2018 at 00:23

I'm trying to do MultinomialNB. I got Value Error: dimension mismatch.

I'm using DictVectorizer for the training data and LabelEncoder for the class.

This is my code:

def create_token(inpt):
    return inpt.split(' ')

def tok_freq(inpt):
    tok = {}
    for i in create_token(inpt):
        if i not in tok:
            tok[i] = 1
        else:
            tok[i] += 1
    return tok

training_data = []
for i in range(len(raw_data)):
    training_data.append((get_freq_of_tokens(raw_data.iloc[i].text), raw_data.iloc[i].category))

#vectorization
X, y = list(zip(*training_data))
label = LabelEncoder()
vector = DictVectorizer(dtype=float, sparse=True)
X = vector.fit_transform(X)
y = label.fit_transform(y)
multinb = mnb()
multinb.fit(X,y)

#vectorization for testing set
Xz = tok_freq(sms)
testX = vector.fit_transform(Xz)

multinb.predict(testX)

Which part of my code is wrong? Thanks.

Original Q&A

There are 1 best solutions below

Vivek Kumar On 24 April 2018 at 04:33 BEST ANSWER

Change

testX = vector.fit_transform(Xz)

to:

testX = vector.transform(Xz)

When you do fit() or fit_transform(), you are essentially training the vectorizer on the new data, which is not what you want. You only want to convert the test set in the same manner as on the train set, so only call transform()

Python sklearn MultinomialNB: Dimension mismatch using DictVectorizer

There are 1 best solutions below

Related Questions in PYTHON-3.X

Related Questions in SCIKIT-LEARN

Related Questions in VALUEERROR

Related Questions in DICTVECTORIZER

Trending Questions

Popular # Hahtags

Popular Questions