I am trying to save all the vocab words and the tfidf vectorizer from the train/test set so that I can use it on a new set of text at a later time. I got the vocab and idf dictionary using this code:
cvec_tfidf = TfidfVectorizer(analyzer="word", tokenizer=nltk.word_tokenize, strip_accents='unicode', min_df = .01, max_df = .99, ngram_range=(1,3))
cvec_tfidf.fit(X_train['answer'])
vocab_tfidf = cvec_tfidf.get_feature_names()
def tfidf (tokens, vocab, cvec):
cvec_counts = cvec.transform(tokens)
cvec_matrix = cvec_counts.toarray()
tfidf_model = pd.DataFrame(cvec_matrix, columns=cvec.vocabulary_)
idf = dict(zip(vocab, cvec.idf_))
return tfidf_model, idf
X_train, X_train_idf = tfidf(X_train['answer'], vocab_tfidf, cvec_tfidf)
X_test, X_test_idf = tfidf(X_test['answer'], vocab_tfidf, cvec_tfidf)
I think I have saved and loaded the vocab with
import pickle
pickle.dump(cvec_tfidf.vocabulary_, open("feature.pkl", "wb"))
## LOAD TFIDF
savedtfidf = pickle.load(open("feature.pkl", 'rb'))
I tried to run it on new text but got an error
## USE TFIDF ON NEW DATA
newtext = savedtfidf.fit_transform(text['newtext'])
File "<ipython-input-573-4d2aef685725>", line 1, in <module>
newtext = savedtfidf.fit_transform(text['PSW_Attention_3_cl'])
AttributeError: 'dict' object has no attribute 'fit_transform'
Any idea what I am doing wrong?
The issue is that you are serializing and deserializing only the model's vocabulary - and, as the error says, the vocabulary is simply a dictionary that doesn't have the
fit_transformmethod.What you want to do is to initialize a new TF-IDF model with your serialized vocabulary: