I am trying to predict positive sentiment reviews from the Stanford IMDB reviews using a simple multi-layer perceptron. However, my model prediction accuracy takes a giant loss when I include 1 word which is present in my train dataset and I have to add it to my test data set.
More specifically, I have been using multi-hot encoding of the words per review and then have been increasing the number of train features/words by the number of times a word appears in the entire corpus. I have then been filtering my test data to include only words present in my training data. I had been able to achieve a prediction accuracy of ~80%.
This worked until I hit the point where my train data included a word not present in my test data. I then included this word by adding the it to the test directory with all the entries set to zero. This resulted in my prediction accuracy dropping to 60%. Hence, just by purely adding 1 word with all zeros into my test data reduced my accuracy by 20%. Can anyone help shed some light on this?
This question is similar, but does not specifically address free-text input and does not have any model results to compare. Machine Learning - test set with fewer features than the train set
Full code can be found here: https://codeshare.io/aJNMJZ
data source:
download.file("http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", "C:/Users/Jono - Desktop/Documents/ML_stuff/aclImdb_v1.tar.gz")
Example training data encoding:
imdb_train_words_dt_1000[army>0,c(1:2,150:160)]
doc_id target_pos army art arthur artist artistic artists arts asian asleep aspect aspects
1: 10130_2.txt 0 1 0 0 0 0 0 0 0 0 0 0
2: 10142_2.txt 0 1 0 0 0 0 0 0 0 0 0 0
3: 10173_8.txt 1 2 0 0 0 0 0 0 0 0 0 0
4: 10220_3.txt 0 1 0 0 0 0 0 0 0 0 0 0
5: 10231_10.txt 1 1 0 0 0 0 0 0 0 0 0 0
---
346: 975_9.txt 1 2 0 0 0 0 0 0 0 0 0 0
347: 9778_2.txt 0 1 0 0 0 0 0 0 0 0 0 0
348: 97_1.txt 0 2 0 0 0 0 0 0 0 0 0 0
349: 9830_7.txt 1 1 0 0 0 0 0 0 0 0 0 0
350: 9903_2.txt 0 1 0 0 0 0 0 0 0 0 0 0
Model results:
Before adding the word:

After adding the 1 word:

Update: It after looking at the results more closely, it looks as though there appears to be some significant overtraining of the model after the inclusion of the extra word. I am currently applying L2 regularisation with some minor success.