Machine Learning - Difference between test and train features, IMDB movie review sentiment prediction

53 Views Asked by At

I am trying to predict positive sentiment reviews from the Stanford IMDB reviews using a simple multi-layer perceptron. However, my model prediction accuracy takes a giant loss when I include 1 word which is present in my train dataset and I have to add it to my test data set.

More specifically, I have been using multi-hot encoding of the words per review and then have been increasing the number of train features/words by the number of times a word appears in the entire corpus. I have then been filtering my test data to include only words present in my training data. I had been able to achieve a prediction accuracy of ~80%.

This worked until I hit the point where my train data included a word not present in my test data. I then included this word by adding the it to the test directory with all the entries set to zero. This resulted in my prediction accuracy dropping to 60%. Hence, just by purely adding 1 word with all zeros into my test data reduced my accuracy by 20%. Can anyone help shed some light on this?

This question is similar, but does not specifically address free-text input and does not have any model results to compare. Machine Learning - test set with fewer features than the train set

Full code can be found here: https://codeshare.io/aJNMJZ

data source:

download.file("http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", "C:/Users/Jono - Desktop/Documents/ML_stuff/aclImdb_v1.tar.gz")

Example training data encoding:

imdb_train_words_dt_1000[army>0,c(1:2,150:160)]
           doc_id target_pos army art arthur artist artistic artists arts asian asleep aspect aspects
  1:  10130_2.txt          0    1   0      0      0        0       0    0     0      0      0       0
  2:  10142_2.txt          0    1   0      0      0        0       0    0     0      0      0       0
  3:  10173_8.txt          1    2   0      0      0        0       0    0     0      0      0       0
  4:  10220_3.txt          0    1   0      0      0        0       0    0     0      0      0       0
  5: 10231_10.txt          1    1   0      0      0        0       0    0     0      0      0       0
 ---                                                                                                 
346:    975_9.txt          1    2   0      0      0        0       0    0     0      0      0       0
347:   9778_2.txt          0    1   0      0      0        0       0    0     0      0      0       0
348:     97_1.txt          0    2   0      0      0        0       0    0     0      0      0       0
349:   9830_7.txt          1    1   0      0      0        0       0    0     0      0      0       0
350:   9903_2.txt          0    1   0      0      0        0       0    0     0      0      0       0

Model results:

Before adding the word: Before adding the word

After adding the 1 word:
After adding the 1 word

Update: It after looking at the results more closely, it looks as though there appears to be some significant overtraining of the model after the inclusion of the extra word. I am currently applying L2 regularisation with some minor success.

0

There are 0 best solutions below