How to calculate cosine similarity with bert over 1000 random example

32 Views Asked by At

I'm trying to calculate over random 1000 quest and 1000 answer using cosine similarity with bert-base-uncased, and after I want to find most similar 5 asnwer, after calculate top1 and top5 real answer accuracy. But im receiving output always 0.0 accuracy and answers not similar.

sample_1000_quest = train_ds['questions'].sample(1000)
sample_1000_answer = train_ds['answers'].sample(1000)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')
model_bert = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True).eval()


selected_question = sample_1000_quest.iloc[1]

selected_question_idx = sample_1000_quest.index.get_loc(30574)


encoded_question = tokenizer_bert(selected_question, return_tensors='pt', padding=True, truncation=True)


with torch.no_grad():
    outputs = model_bert(**encoded_question)
    question_embedding = outputs.last_hidden_state.mean(dim=1)


encoded_answers = []
answer_embeddings = []
for answer in sample_1000_answer:
    encoded_answer = tokenizer_bert(answer, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        outputs = model_bert(**encoded_answer.to(device))
        answer_embedding = outputs.last_hidden_state.mean(dim=1)
        answer_embeddings.append(answer_embedding)


similarities = []
for answer_embedding in answer_embeddings:
    similarity = cosine_similarity(question_embedding, answer_embedding)
    similarities.append(similarity.item())


most_similar_indices = np.argsort(similarities)[-5:][::-1]


ground_truth_idx = train_ds['answers'].iloc[selected_question_idx]


top1_accuracies = []
top5_accuracies = []

top1_idx = most_similar_indices[0]
top1_accuracy = 1 if top1_idx == ground_truth_idx else 0
top5_accuracy = 1 if ground_truth_idx in most_similar_indices else 0

top1_accuracies.append(top1_accuracy)
top5_accuracies.append(top5_accuracy)

print("Selected Question:", selected_question)
print("Most similar 5 asnwer:")
for i, idx in enumerate(most_similar_indices):
    print(f"{i+1}. {sample_1000_answer.iloc[idx]}")

print("Top-1 Accuracy:", top1_accuracy)
print("Top-5 Accuracy:", top5_accuracy)

Output:

Selected Question:  bir sunum oluşturmak için beş adım yazın.
Most similar 5 asnwer:
1.  doğum günü gülüm bütün yaz aldığım en güzel hediyeydi.
2.  bu deneyin amacı ilkeleri anlamaktır.
3.  bir satış elemanı sunum yapıyor.
4.  hangi konuda yardıma ihtiyacın olduğunu söyle.
5.  konuşmanın içeriği, projede bir sonraki adım için onay almakla ilgilidir.
Top-1 Accuracy: 0
Top-5 Accuracy: 0
1

There are 1 best solutions below

1
ewz93 On

bert-base-uncased is a model that was mainly pre-trained on English. You could try a model pre-trained for Turkish instead, such as dbmdz/bert-base-turkish-cased.

Also seeing that you use a relatively large data set, but calculate accuracy to only give any value to the correct 5 indices seems seems like too much of a harsh cut-off. It would be better if you found a way to rate how far from the predicted answer is from the expected one instead of just rating it as either 0 or 1, as this will just mean that almost all answers get a score of 0. Alternatively, you could add an even more lenient measure such as Top-10 Accuracy or Top-25 Accuracy to see if this gives higher values.