I am currently training a FastText classifier and I'm facing the issue of overfitting.
model = fasttext.train_supervised( f'train{runID}.txt', lr=1.0, epoch=10, wordNgrams=2, dim=300, thread=2, verbose=100)
The model seems to be fitting the training data too well, resulting in poor generalization on unseen data. I would like to know how I can set a regularization parameter to address this problem and improve the model's performance.
Overfitting, here, is likely due to the model being oversized for the data/task. It does have enough internal state/complexity to memorize your training-set, including nonsense idiosyncratic details about individual examples that help it look up (rather then generalizably deduce) the 'right answer' for those.
(An interesting comparison to apply if you save the model to disk, is it larger – perhaps much larger – than your training data? In a very real sense, lots of machine-learning is compression, and any time your model is close to, or larger than, the size of the training data, overfitting is likely.)
In such cases, two major things to try are to get more data, or shrink the model - so that's it's forced to learn rules, not become a big 1:1 lookup table.
The main ways to shrink the model:
-dim)-wordNgrams 1)-minCounthigher than the default - keeping rare words often weakens word-vector models, and in a classification task, any singleton words always associated in training with a single label are highly likely to overwhelm other influences, if they're not truly reliable signals)-bucketvalues)Separately,
-lr 1.0is way, way higher than typical values of0.025to (supervisedmode default0.1), so that might be worth changing to a more typical range, too.With regard to the idea that proper regularization might remedy any amount of model oversizing, as suggested in your comment:
The Fasttext algorithm & its common implementations don't specify any standard or proven regularizations that can fix an oversized model. Choosing one or more approaches, and adding them to the operation of a major Fasttext implementation, & evaluating their success, would involve your own customizations/extensions.
Further, I've not noticed any work demonstrating a regularization that can remedy an oversized shallow-neural-network word-vector model (like word2vec or Fasttext). Though I may have overlooked something – & in such a case would love pointers! – that suggests it may not be a preferred approach, compared to the usual "shrink model" or "find more data" tactics.
Looking up the context of Ng's quote, he's talking about the circumstances of "modern deep learning", with the additional caveat of "so long as you can keep getting more data".
Further, word-vector algorithms like word2vec or Fasttext aren't really 'deep' learning – they use only a single 'hidden layer'. While definitions vary a bit, & these algorithms are definitely a stepping-stone to deeper neural-networks, I believe most practicioners would call them "shallow learning" using only a "shallow neural network".
Here's Ng's quote, as attributed to a Coursera lecture here, with more context and my added emphases and paragraph breaks:
So, it'd likely be an interesting experiment/article as to whether usual regularization techniques can cure shallow word-vector model overfitting, including in extreme cases where the model remains larger than the training data.
But such a hypothesized fix isn't available as an off-the-shelf option.