Solution to solve problem different results when run Doc2vec gensim?

Question

Solution to solve problem different results when run Doc2vec gensim?

25 Views Asked by Nguyễn Thanh Huyền At 19 March 2024 at 03:20

I try to find information about problem that Doc2vec returns different results when it runs. I saw many previous questions about this and I know It happens because vector is randomly initialize. However, I am creating a website which uses this result to display in frontend. The difference in results makes reliability of systems reduce. I know my dataset is really small. But infer_vector() can't return same vectors with same documents and results most_similar() are different in each run. How do I prevent this problem or having alternative way to apply doc2vec model in my application to avoid difference of results?

This is some code:

model = gensim.models.doc2vec.Doc2Vec(vector_size=50, dm=1, window=5, min_count=2, epochs=100, negative=0, workers=5)

But I received warning: You must set either 'hs' or 'negative' to be positive for proper training. When both 'hs=0' and 'negative=0', there will be no training. I try to set negative=-1 but I see explain from gensim: negative must be integer.

Original Q&A

There are 1 best solutions below

**gojomo** · Accepted Answer · 2024-03-19T12:50:10.263000

These are potentially, two different issues.

With regard to the warning you're seeing:

You must set either 'hs' or 'negative' to be positive 
for proper training. When both 'hs=0' and 'negative=0', 
there will be no training.

The warning is complete and truthful, it already describes what you're doing wrong and how to solve it.

You must set either hs or negative to be positive or else no training will happen in your model.

negative=-1 is an illegal setting, and not positive.

If you want to use Doc2Vec, you need to either have the negative parameter as a positive integer (as with its default value negative=5), or if you want to set negative=0 then you need to enable the alternative "hierarchical softmax" mode with hs=1.

The algorithm will do nothing but error or given nonsense untrained results if you give it illegal configurations.

As is explained in the Q12 of the Gensim Project FAQ & other StackOverflow answers, the operation of the Doc2Vec algorithm naturally allows for variance in the vectors returned by infer_vector() from run to run.

And, if that "jitter" between inferences is s making a big difference in results, there are probably other serious problems in your use of Doc2Vec, such as insufficient data or bad parameters, that you should fix, rather than trying to force a false determinism onto your calculations.

In particular, if the model whose changing infer_vector() results was "trained" – not really – with the shown parameters (negative=0 without enabled hs), ignoring the warning that won't work, that is the first big problem to solve. It will make all inferred vetor random and meaninglfess (as opposed to just "a little noisy").

But, if after fixing the total failure of training you then insistently want to do the incorrect thing, you can force inference determinism as is described in another answer at:

removing randomization of vector initialization for doc2vec

Solution to solve problem different results when run Doc2vec gensim?

There are 1 best solutions below

Related Questions in NLP

Related Questions in DOC2VEC

Trending Questions

Popular # Hahtags

Popular Questions