Save TFIDF vocab and transformation and use on new dataset

682 Views Asked by At

I am trying to save all the vocab words and the tfidf vectorizer from the train/test set so that I can use it on a new set of text at a later time. I got the vocab and idf dictionary using this code:

cvec_tfidf = TfidfVectorizer(analyzer="word", tokenizer=nltk.word_tokenize, strip_accents='unicode', min_df = .01, max_df = .99, ngram_range=(1,3))   
cvec_tfidf.fit(X_train['answer'])
vocab_tfidf = cvec_tfidf.get_feature_names()

def tfidf (tokens, vocab, cvec):
    cvec_counts = cvec.transform(tokens)
    cvec_matrix = cvec_counts.toarray()
    tfidf_model = pd.DataFrame(cvec_matrix, columns=cvec.vocabulary_)
    idf = dict(zip(vocab, cvec.idf_))
    return tfidf_model, idf

X_train, X_train_idf = tfidf(X_train['answer'], vocab_tfidf, cvec_tfidf)
X_test, X_test_idf = tfidf(X_test['answer'], vocab_tfidf, cvec_tfidf)

I think I have saved and loaded the vocab with

import pickle
pickle.dump(cvec_tfidf.vocabulary_, open("feature.pkl", "wb"))

## LOAD TFIDF
savedtfidf = pickle.load(open("feature.pkl", 'rb'))

I tried to run it on new text but got an error

## USE TFIDF ON NEW DATA
newtext = savedtfidf.fit_transform(text['newtext'])


  File "<ipython-input-573-4d2aef685725>", line 1, in <module>
    newtext = savedtfidf.fit_transform(text['PSW_Attention_3_cl'])

AttributeError: 'dict' object has no attribute 'fit_transform'

Any idea what I am doing wrong?

1

There are 1 best solutions below

0
Amir On

The issue is that you are serializing and deserializing only the model's vocabulary - and, as the error says, the vocabulary is simply a dictionary that doesn't have the fit_transform method.

What you want to do is to initialize a new TF-IDF model with your serialized vocabulary:

saved_vocabulary = pickle.load(open("feature.pkl", 'rb'))
cvec_tfidf = TfidfVectorizer(analyzer="word", tokenizer=nltk.word_tokenize, strip_accents='unicode', min_df = .01, max_df = .99, ngram_range=(1,3), vocabulary=saved_vocabulary)
cvec_tfidf.fit_transform(text['newtext'])