I try to find information about problem that Doc2vec returns different results when it runs. I saw many previous questions about this and I know It happens because vector is randomly initialize. However, I am creating a website which uses this result to display in frontend. The difference in results makes reliability of systems reduce.
I know my dataset is really small. But infer_vector() can't return same vectors with same documents and results most_similar() are different in each run. How do I prevent this problem or having alternative way to apply doc2vec model in my application to avoid difference of results?
This is some code:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, dm=1, window=5, min_count=2, epochs=100, negative=0, workers=5)
But I received warning: You must set either 'hs' or 'negative' to be positive for proper training. When both 'hs=0' and 'negative=0', there will be no training.
I try to set negative=-1 but I see explain from gensim: negative must be integer.
These are potentially, two different issues.
With regard to the warning you're seeing:
The warning is complete and truthful, it already describes what you're doing wrong and how to solve it.
You must set either
hsornegativeto be positive or else no training will happen in your model.negative=-1is an illegal setting, and not positive.If you want to use
Doc2Vec, you need to either have thenegativeparameter as a positive integer (as with its default valuenegative=5), or if you want to setnegative=0then you need to enable the alternative "hierarchical softmax" mode withhs=1.The algorithm will do nothing but error or given nonsense untrained results if you give it illegal configurations.
As is explained in the Q12 of the Gensim Project FAQ & other StackOverflow answers, the operation of the
Doc2Vecalgorithm naturally allows for variance in the vectors returned byinfer_vector()from run to run.And, if that "jitter" between inferences is s making a big difference in results, there are probably other serious problems in your use of
Doc2Vec, such as insufficient data or bad parameters, that you should fix, rather than trying to force a false determinism onto your calculations.In particular, if the model whose changing
infer_vector()results was "trained" – not really – with the shown parameters (negative=0without enabledhs), ignoring the warning that won't work, that is the first big problem to solve. It will make all inferred vetor random and meaninglfess (as opposed to just "a little noisy").But, if after fixing the total failure of training you then insistently want to do the incorrect thing, you can force inference determinism as is described in another answer at:
removing randomization of vector initialization for doc2vec