Pre-training or using the existing model of FastText?

25 Views Asked by At

I am planning to create a classification model. Instead of using traditional models, I decided to use a new technique of creating word embeddings, clustering them using k-means, then use the mean of each cluster for comparision with the input(s). I decided to use fasttext as it supports subwords. I also have a large unsupervised text data. I would like to know if I should train the fasttext model with the data I have or I can go with the pre-trained model. If I should train, what are the benefits? Can someone explain me please

1

There are 1 best solutions below

0
gojomo On

You should try them both an see which scores better on whatever repeatable quality evaluation you'll be using to make other tuning choices.

There's a fair chance, but no guarantee, that with enough of your own domain text data, your own trained model will better capture the words/subwords of your domain.

But there's no firm rule-of-thumb on how much is "enough", for either most projects or more importantly your specific project, or how much the text and word-meanings in your area may be different from the more generic word-meaning in others' pretrained models. So you have to test them against each other - which shouldn't be hard. (Run once with your best-trained model, or several variants of it, then with one or more external pretrained models from others. Compare the results. Choose the best.)

Note that your "new technique" sounds like a pretty common naive but intuitively-attractive classification approach – compute one "average" vector to represent each known class, compute a vector for each candidate text, predict the class with the nearest vector for each text. Or: report the relative distances as ranked possibilities.

It is likely to perform poorly compared to traditional approaches, even very quick & simple approaches, because they will not squeeze as much of the available learning data into a simplistic model where each class is exactly "around" a single summary point. Real categories are often of diverse irregular shapes in the learning data, & the way usual techniques can learn that 'lumpiness', rather than reducing classes to a single centroid point, will do better.

(If you are in fact computing distances to multiple unlabeled clusters, larger than your number of final labels, then using those distances as input to a typical learned classifier, it may perform better than the 1-center-per-class approach I describe above. It will have then retained a bit more of the original learnable "shapes" and decision-boundaries in the original data. But again, traditional classifiers, with adequate feature choices/enrichment, are likely to subsume and exceed any value from that style of model.)