Word 2Vec pretrained embedding KeyError with Gensim==4

52 Views Asked by At

I'm traying to use a pretrained word2vec model for Arabic language the code suppose to be written as following

unknownArray=[]

 # load the whole embedding into memory
w2v_embeddings_index={}
w2v_model =KeyedVectors.load('/content/drive/MyDrive/PythonProjectsUtilities/full_grams_cbow_100_twitter/full_grams_cbow_100_twitter.mdl')
for word in w2v_model.wv:
  try:
   w2v_embeddings_index[word] = w2v_model.wv[word]
  except KeyError:
    unknownArray.append(word)
print('Loaded %s word vectors.' % len(w2v_embeddings_index))

# create a weight matrix for words in training docs
embedding_matrix = np.zeros((len(word_index) + 1 , embedding_vecor_length))
for word, i in t.word_index.items():
    embedding_vector = w2v_embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
print('Embedding Matrix shape:', embedding_matrix.shape)

but it keep rising a KeyError although I added the exception part , so I tried to solve the problem by editing the code to only include indices that alreay founded in the dataset word_index to the w2v_embeddings_index={} so the code after editing is :

w2v_embeddings_index={}
embedding_matrix = np.zeros((len(word_index) + 1 , embedding_vecor_length))

w2v_model = KeyedVectors.load("/content/drive/MyDrive/PythonProjectsUtilities/full_grams_sg_100_twitter.mdl")
print(len(w2v_model.wv))
for word , i in word_index.items():
  if word in w2v_model.wv:
     w2v_embeddings_index[word] = w2v_model.wv[word]

print('Loaded %s word vectors.' % len(w2v_embeddings_index))


# create a weight matrix for words in training docs
for word, i in word_index.items():
    embedding_vector = w2v_embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
print('Embedding Matrix shape:', embedding_matrix.shape)

but I don't know if that is correct or not, can any expert confirm, I have to deliver this task tomorrow

I tried to utilize a pretrained wor2vec model for Arabic language for Text classification task

2

There are 2 best solutions below

4
gojomo On

If you're facing an exception, it's best to include the full exception information, including any lines of 'traceback' shown, in your question, to clearly show the involved lines-of-code, in what you've written, & and any libraries being used, without having to guess what's happening.

I suspect, but can't be sure, that your t.word_index object, whose creation/value is not shown, included words that weren't in your loaded KeyedVectors set of word-vectors. Trying to use its list of words to look-up words that aren't present would, by design, raise a KeyError when the vectors for missing words aren't present.

Your revision only asks for words that are already known to be available, so it avoids the error.

Whether that's 'correct' depends on your goal/requirements, which you haven't described. Does the final embedding_matrix.shape printed make sense for your needs? Are you able to perform whatever next-steps are expected, and get sensible results, from the work this code completed? Those are the real determinants of whether the code is 'correct'.

I can say the code is inefficient, needlessly copying information from the compact KeyedVectors object into a plain Python dict (w2v_embeddings_index) where it will take up more memory, and not offer the same convenience functions for common operations that area available from the KeyedVectors object.

You can already request individual word-vectors, by their word key, from a KeyedVectors object – or do many other common operations, like iterate over all known words. So there's no need to create the w2v_embedding_index dictionary.

And, the KeyedVectors object already contains an internal dense numpy array, of shape (count_of_words, vector_dimensions) in its .vectors variable. So there's generally no need to reconstruct another copy of that array.

As another note: if your 'full_grams_cbow_100_twitter.mdl' file is truly a saved Gensim KeyedVectors object, then you don't need to be using .wv to reach the set-of-word vectors. The object loaded is already what you need.

On the other hand, if 'full_grams_cbow_100_twitter.mdl' is really a saved full Word2Vec model object, then loading it using KeyedVectors.load() is not reliable. You should load a Word2Vec object via Word2Vec.load() instead - and in that case, then yes, you would need to use .wv to reach the KeyedVectors part of that model.

So this would be fine if the file is really a saved KeyedVectors:

kv_model = KeyedVectors.load("full_grams_sg_100_twitter.mdl")
print(kv_model[some_word])

...but if it's really a Word2Vec model, you'd instead want to do something more like:

w2v_model = Word2Vec.load("full_grams_sg_100_twitter.mdl")
print(kv_model[some_word])
kv_model = w2v_model.wv
print(kv_model[some_word])
0
Arwa Ahmed On

It have been solved using this code:

embedding_vecor_length=100
embedding_matrix = np.zeros((len(word_index) + 1, 
embedding_vecor_length))
w2v_model = KeyedVectors.load(".../full_grams_sg_100_twitter.mdl")
for word, token in word_index.items():
    if w2v_model.wv.__contains__(word):
    embedding_matrix[token] =w2v_model.wv.__getitem__(word)
print('Embedding Matrix shape:', embedding_matrix.shape)