I'm working aboute anologies(famous "king - woman + man = queen") on vectors pretraind from nlp.stanford.edu/projects/glove (glove.6B.50d.txt), but I get confusing results:
analogy (Thanks @gojomo)
the most similar word to "king" is "king"?? the second similarity is "queen"(as I expected :) but why it's only 86%?? I expected ~90-95%, it's only math (euclidean distance), right?? Maybe ,the similarity depends on number of dimensions(more similarity to "queen" with 100/200 dimensions)??
Thank so much.
Even though you're not getting an error in your saving, the actual problem might be there: writing something that's not properly encoded for the later read.
But you probably don't need to be writing your own reading & writing code at all. Gensim can already read GLoVe-format vectors directly, as their format is almost identical to what Gensim calls the 'word2vec_format' (because it was the save/load format of Google's original
word2vec.crelease). The GLoVe format just leaves out the (helpful) 1st-line declaration of the number of vectors to expect. But theno_headerparameter to Gensim'sload_word2vec_format()methods will tell them not to expect the count, & figure it out instead.So, if all you need is the GLoVe vectors into a Gensim
KeyedVector, you can just do:Voila, you're done.
Now, the figuring-out of the count of upcoming vectors does take one extra full pass over the file. In many cases, that'll be a minor cost compared to other things. But if you're in a situation where load-time is crucial, you could achieve the intent of your question's code by juse re-saving that
KeyedVectorsobject in your desired format:I haven't noticed the
.binstyle of this particular format saving that much space/time overall, especially if you usually keep such files compressed on disk – which often saves more in IO time than it costs in compression/decompression CPU time. So I suspect you'd really want your re-save to be:The
.gzsuffix is enough for these methods to automatically compress/decompress the file. That is, you can reload those compressed vectors with just:Finally, if load time is especially critical, another option worth trying, which might offer an additional slight speedup, would be to let Gensim re-save the vectors in its own format, via the
.save()method. Gensim's format is Python pickle-based – which is often a bit slower than more raw custom formats – but it offers an option to store the big internal vector array as one separate raw memory-mappable file.Then, upon
.load(filename, mmap='r'), the OS can just memory-map the whole file (nearly instantly, not doing any initial reading), and later when vectors are accessed, relevant ranges of the file are paged-in at the OS level, with no extra parsing overhead or excess buffer-copying.The array load thus appears nearly instantaneous, but later initial accesses may be a bit slower, until the whole file winds up paged-in. (But, a single operation that touches every vector – like a
.most_similar()– will act to page the whole thing in.)To try this option, you'd re-save the vectors using:
…then load them via:
Note that when using this Gensim-native
.save():.gzsuffix) was used when saving