I am working with a project to detect out-of-domain text input, with the help of IsolationForest and tf-idf feature. Following is my works in summarized form:
TRAINING
On tfidf:
- Fit and transform in-domain dataset using
CountVectorizer(). - Fit a tfidftransformer() with my with this
CountVectorizer()and save the transformer (to use it during test time). - Therefore, transform the training data using
tfidftransformer() - Save both
CountVectorizer()'svocabulary_andTfidfTransformer()object usingpicklefor test time usage.
- Fit and transform in-domain dataset using
On IsolationForest:
- Collect the transformed in-domain dataset and train a
IsolationForest()novelity detector. - Save the model using
joblib.
- Collect the transformed in-domain dataset and train a
TESTING:
- Load all of the saved models.
- Get the tfidf transformed feature of current out-of-domain input text after replicating all the steps (transformations only) similar to training step.
- Predict if it is out-of-domain or not, using the saved
IsolationForestmodel.
But what I have found even if the tf-idf feature is quite different for each of my test input, the IsolationForest always predicting 1.
What is probably going wrong?
NB: I also tried inputting dummy vectors to IsolationForest model by mimicking the output of tf-idf transformer to make sure if the tf-idf module is responsible for this or not but no matter which random vector I provide I always get 1 as output from IsolationForest. Also note that, tf-idf has a lot of features (tokens), in my case the count is 48015.