Word2Vec - to be trained on train data or whole data

117 Views Asked by At

I wish to create a word2vec model and want to train it on my local data. so, the question is, should I train word2vec model on my whole data or should I split the data into train and test and then train my word2vec model on train data to avoid data leakage as I intend to perform classification task using ml algorithms. I don't want to use pretrained embeddings.

I've trained word2vec on whole dataset but feel like it will lead to data leakage during ml model building

1

There are 1 best solutions below

0
gojomo On

The word2vec algorithm is an 'unsupervised' method: it works from the raw full data available, without you directly telling it desired results, based on peeking at your intended classes.

Thus in some senses, for many uses, it is appropriate to use it as a generic feature-enhancement step that can use all available data, even including what would normally be 'held-back' test examples for your separate supervised classification step. Just be sure no part of your texts associates known class labels directly to textual words.

(Similarly: if you have lots of unlabeled examples from the same usage-domain, they can often help improve this unsupervised step, even though of course such unlabeled examples can't be supplied to a supervised-classification step.)

Note that the whole reason to "avoid data leakage" is to avoid fooling yourself on evaluations, when you're experimenting & testing multiple-methods against each other. That is: you want your full training/learning pipeline to be a fair simulation, and thus trustworthy estimation, of your chosen methods' eventual success on real situations with truly unlabeled data (including new data not even available at experimental-time).

But then, *after& you've chosen a method, when deploying a frozen method to production, you'll often re-train with all your labeled examples, because re-estimating the expected performance on the same data no longer needs to be accurate.

You priority, at that later step, is to use everything you know to do the best on truly unknown items. The level of success on those new items will be proven in other ways, later. (Perhaps: further manual review/labeling of some subset, or implicit indicators of relative success, or reopening a more-rigorous experimental phase later.)

Now, there could be some situations where the eventual use will regularly be trying to classify new examples whose unlabeled text wasn't available at model-creation, and you want a more-accurate estimate's technique in such situations. You might want to simulate such cases by using a unsupervised word2vec features that had no peek at bulk unlabeled text.

But conversely, there could be (big-batch) deployment situations where inherently, all the patterns of unlabeled/unknwon texts are always available to enhance the unsupervised modeling feature-enrichment. Then, because such data will be regularly be available before the essential classifier-training, including it is definitively more-representative of your true usage scenarios.

So overall, training your unsupervised features on the bulk without-labels text of 'test' examples is…

  • often OK, & practically done without problems, because it's not contaminating the supervised step
  • but sometimes, peculiarities of what you really want to measure might suggest not to do it – or conversely, strongly suggest it should be done

You might as well try it both ways, to better develop your understanding of its effects, both generally & with your specific data.

(And: if there are massive rather than incremental differences, dig deeper to see why - your data/needs might be a unique situation.)