I'm traying to use a pretrained word2vec model for Arabic language the code suppose to be written as following
unknownArray=[]
# load the whole embedding into memory
w2v_embeddings_index={}
w2v_model =KeyedVectors.load('/content/drive/MyDrive/PythonProjectsUtilities/full_grams_cbow_100_twitter/full_grams_cbow_100_twitter.mdl')
for word in w2v_model.wv:
try:
w2v_embeddings_index[word] = w2v_model.wv[word]
except KeyError:
unknownArray.append(word)
print('Loaded %s word vectors.' % len(w2v_embeddings_index))
# create a weight matrix for words in training docs
embedding_matrix = np.zeros((len(word_index) + 1 , embedding_vecor_length))
for word, i in t.word_index.items():
embedding_vector = w2v_embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
print('Embedding Matrix shape:', embedding_matrix.shape)
but it keep rising a KeyError although I added the exception part , so I tried to solve the problem by editing the code to only include indices that alreay founded in the dataset word_index to the w2v_embeddings_index={} so the code after editing is :
w2v_embeddings_index={}
embedding_matrix = np.zeros((len(word_index) + 1 , embedding_vecor_length))
w2v_model = KeyedVectors.load("/content/drive/MyDrive/PythonProjectsUtilities/full_grams_sg_100_twitter.mdl")
print(len(w2v_model.wv))
for word , i in word_index.items():
if word in w2v_model.wv:
w2v_embeddings_index[word] = w2v_model.wv[word]
print('Loaded %s word vectors.' % len(w2v_embeddings_index))
# create a weight matrix for words in training docs
for word, i in word_index.items():
embedding_vector = w2v_embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
print('Embedding Matrix shape:', embedding_matrix.shape)
but I don't know if that is correct or not, can any expert confirm, I have to deliver this task tomorrow
I tried to utilize a pretrained wor2vec model for Arabic language for Text classification task
If you're facing an exception, it's best to include the full exception information, including any lines of 'traceback' shown, in your question, to clearly show the involved lines-of-code, in what you've written, & and any libraries being used, without having to guess what's happening.
I suspect, but can't be sure, that your
t.word_indexobject, whose creation/value is not shown, included words that weren't in your loadedKeyedVectorsset of word-vectors. Trying to use its list of words to look-up words that aren't present would, by design, raise aKeyErrorwhen the vectors for missing words aren't present.Your revision only asks for words that are already known to be available, so it avoids the error.
Whether that's 'correct' depends on your goal/requirements, which you haven't described. Does the final
embedding_matrix.shapeprinted make sense for your needs? Are you able to perform whatever next-steps are expected, and get sensible results, from the work this code completed? Those are the real determinants of whether the code is 'correct'.I can say the code is inefficient, needlessly copying information from the compact
KeyedVectorsobject into a plain Pythondict(w2v_embeddings_index) where it will take up more memory, and not offer the same convenience functions for common operations that area available from theKeyedVectorsobject.You can already request individual word-vectors, by their word key, from a
KeyedVectorsobject – or do many other common operations, like iterate over all known words. So there's no need to create thew2v_embedding_indexdictionary.And, the
KeyedVectorsobject already contains an internal dense numpy array, of shape(count_of_words, vector_dimensions)in its.vectorsvariable. So there's generally no need to reconstruct another copy of that array.As another note: if your
'full_grams_cbow_100_twitter.mdl'file is truly a saved GensimKeyedVectorsobject, then you don't need to be using.wvto reach the set-of-word vectors. The object loaded is already what you need.On the other hand, if
'full_grams_cbow_100_twitter.mdl'is really a saved fullWord2Vecmodel object, then loading it usingKeyedVectors.load()is not reliable. You should load aWord2Vecobject viaWord2Vec.load()instead - and in that case, then yes, you would need to use.wvto reach theKeyedVectorspart of that model.So this would be fine if the file is really a saved
KeyedVectors:...but if it's really a
Word2Vecmodel, you'd instead want to do something more like: