1.I want to write a Python program that needs to preprocess the segmented texts (there are a total of 129 txt files in jieba2) before processing.
2.Create individual word embeddings for each word in the text (average over all 129 models).
3.Save the averaged models, which have been trained with individual word embeddings, into the word2vec2 folder. Currently, the code I have doesn't perform individual word embeddings averaging; it only averages the overall models.
The following is my program code. Where is the error?
from gensim.models import Word2Vec
import os
import numpy as np
import pickle
input_folder = "jieba2"
output_folder = "word2vec2"
if not os.path.exists(output_folder):
os.makedirs(output_folder)
for txt_filename in os.listdir(input_folder):
if txt_filename.endswith(".txt"):
txt_path = os.path.join(input_folder, txt_filename)
with open(txt_path, "r", encoding="utf-8") as txt_file:
lines = txt_file.readlines()
sentences = [line.split() for line in lines]
model = Word2Vec(vector_size=100, window=5, min_count=1, workers=4)
model.build_vocab(sentences)
if not model.wv.key_to_index:
continue
model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)
model_vectors = model.wv.vectors
model_average_vector = np.mean(model_vectors, axis=0)
model_average_vectors[txt_filename] = model_average_vector
model_filename = txt_filename.replace(".txt", ".model")
model_path = os.path.join(output_folder, model_filename)
model.save(model_path)
average_vectors_output_path = os.path.join(output_folder, "average_vectors.pkl")
with open(average_vectors_output_path, "wb") as output_file:
pickle.dump(model_average_vectors, output_file)
Thank you
Please help me find the code error or fix it Thank you