confusing with most similar word?

109 Views Asked by At

I'm working aboute anologies(famous "king - woman + man = queen") on vectors pretraind from nlp.stanford.edu/projects/glove (glove.6B.50d.txt), but I get confusing results:

analogy (Thanks @gojomo)

the most similar word to "king" is "king"?? the second similarity is "queen"(as I expected :) but why it's only 86%?? I expected ~90-95%, it's only math (euclidean distance), right?? Maybe ,the similarity depends on number of dimensions(more similarity to "queen" with 100/200 dimensions)??

Thank so much.

1

There are 1 best solutions below

0
gojomo On

Even though you're not getting an error in your saving, the actual problem might be there: writing something that's not properly encoded for the later read.

But you probably don't need to be writing your own reading & writing code at all. Gensim can already read GLoVe-format vectors directly, as their format is almost identical to what Gensim calls the 'word2vec_format' (because it was the save/load format of Google's original word2vec.c release). The GLoVe format just leaves out the (helpful) 1st-line declaration of the number of vectors to expect. But the no_header parameter to Gensim's load_word2vec_format() methods will tell them not to expect the count, & figure it out instead.

So, if all you need is the GLoVe vectors into a Gensim KeyedVector, you can just do:

from gensim.models import KeyedVectors
glove_vectors = KeyedVectors.load_word2vec_format('glove.6B/glove.6B.50d.txt', binary=False, no_header=True)

Voila, you're done.

Now, the figuring-out of the count of upcoming vectors does take one extra full pass over the file. In many cases, that'll be a minor cost compared to other things. But if you're in a situation where load-time is crucial, you could achieve the intent of your question's code by juse re-saving that KeyedVectors object in your desired format:

glove_vectors.save_word2vec_format('ppl6B50d.bin', binary=True)

I haven't noticed the .bin style of this particular format saving that much space/time overall, especially if you usually keep such files compressed on disk – which often saves more in IO time than it costs in compression/decompression CPU time. So I suspect you'd really want your re-save to be:

glove_vectors.save_word2vec_format('ppl6B50d.bin.gz', binary=True)

The .gz suffix is enough for these methods to automatically compress/decompress the file. That is, you can reload those compressed vectors with just:

reloaded_vectors = KeyedVectors.load_word2vec_format('ppl6B50d.bin.gz', binary=True)

Finally, if load time is especially critical, another option worth trying, which might offer an additional slight speedup, would be to let Gensim re-save the vectors in its own format, via the .save() method. Gensim's format is Python pickle-based – which is often a bit slower than more raw custom formats – but it offers an option to store the big internal vector array as one separate raw memory-mappable file.

Then, upon .load(filename, mmap='r'), the OS can just memory-map the whole file (nearly instantly, not doing any initial reading), and later when vectors are accessed, relevant ranges of the file are paged-in at the OS level, with no extra parsing overhead or excess buffer-copying.

The array load thus appears nearly instantaneous, but later initial accesses may be a bit slower, until the whole file winds up paged-in. (But, a single operation that touches every vector – like a .most_similar() – will act to page the whole thing in.)

To try this option, you'd re-save the vectors using:

glove_vectors.save('ppl6B50d.model')

…then load them via:

KeyedVectors.load('ppl6B50d.model', mmap='r')

Note that when using this Gensim-native .save():

  • the model may be (& at typical sizes usually is) saved as more than one file on disk, all with the same prefix, and those files must be kept/moved together for the main model file to re-load
  • the memory-mapping on read will only work if no compression (no .gz suffix) was used when saving
  • if you're loading the same read-only model into multiple separate system processes, asking the OS to map the file to addressable-memory like this will let those processes share the same memory - potentially avoiding a lot of duplicate loading/memory use (as for example if some web server has many processes all consulting the same read-only set of vectors)