I wish to create a word2vec model and want to train it on my local data. so, the question is, should I train word2vec model on my whole data or should I split the data into train and test and then train my word2vec model on train data to avoid data leakage as I intend to perform classification task using ml algorithms. I don't want to use pretrained embeddings.
I've trained word2vec on whole dataset but feel like it will lead to data leakage during ml model building
The word2vec algorithm is an 'unsupervised' method: it works from the raw full data available, without you directly telling it desired results, based on peeking at your intended classes.
Thus in some senses, for many uses, it is appropriate to use it as a generic feature-enhancement step that can use all available data, even including what would normally be 'held-back' test examples for your separate supervised classification step. Just be sure no part of your texts associates known class labels directly to textual words.
(Similarly: if you have lots of unlabeled examples from the same usage-domain, they can often help improve this unsupervised step, even though of course such unlabeled examples can't be supplied to a supervised-classification step.)
Note that the whole reason to "avoid data leakage" is to avoid fooling yourself on evaluations, when you're experimenting & testing multiple-methods against each other. That is: you want your full training/learning pipeline to be a fair simulation, and thus trustworthy estimation, of your chosen methods' eventual success on real situations with truly unlabeled data (including new data not even available at experimental-time).
But then, *after& you've chosen a method, when deploying a frozen method to production, you'll often re-train with all your labeled examples, because re-estimating the expected performance on the same data no longer needs to be accurate.
You priority, at that later step, is to use everything you know to do the best on truly unknown items. The level of success on those new items will be proven in other ways, later. (Perhaps: further manual review/labeling of some subset, or implicit indicators of relative success, or reopening a more-rigorous experimental phase later.)
Now, there could be some situations where the eventual use will regularly be trying to classify new examples whose unlabeled text wasn't available at model-creation, and you want a more-accurate estimate's technique in such situations. You might want to simulate such cases by using a unsupervised word2vec features that had no peek at bulk unlabeled text.
But conversely, there could be (big-batch) deployment situations where inherently, all the patterns of unlabeled/unknwon texts are always available to enhance the unsupervised modeling feature-enrichment. Then, because such data will be regularly be available before the essential classifier-training, including it is definitively more-representative of your true usage scenarios.
So overall, training your unsupervised features on the bulk without-labels text of 'test' examples is…
You might as well try it both ways, to better develop your understanding of its effects, both generally & with your specific data.
(And: if there are massive rather than incremental differences, dig deeper to see why - your data/needs might be a unique situation.)