How to calculate cosine similarity with bert over 1000 random example

Question

How to calculate cosine similarity with bert over 1000 random example

32 Views Asked by Zephyrus At 24 March 2024 at 12:49

I'm trying to calculate over random 1000 quest and 1000 answer using cosine similarity with bert-base-uncased, and after I want to find most similar 5 asnwer, after calculate top1 and top5 real answer accuracy. But im receiving output always 0.0 accuracy and answers not similar.

sample_1000_quest = train_ds['questions'].sample(1000)
sample_1000_answer = train_ds['answers'].sample(1000)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')
model_bert = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True).eval()


selected_question = sample_1000_quest.iloc[1]

selected_question_idx = sample_1000_quest.index.get_loc(30574)


encoded_question = tokenizer_bert(selected_question, return_tensors='pt', padding=True, truncation=True)


with torch.no_grad():
    outputs = model_bert(**encoded_question)
    question_embedding = outputs.last_hidden_state.mean(dim=1)


encoded_answers = []
answer_embeddings = []
for answer in sample_1000_answer:
    encoded_answer = tokenizer_bert(answer, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        outputs = model_bert(**encoded_answer.to(device))
        answer_embedding = outputs.last_hidden_state.mean(dim=1)
        answer_embeddings.append(answer_embedding)


similarities = []
for answer_embedding in answer_embeddings:
    similarity = cosine_similarity(question_embedding, answer_embedding)
    similarities.append(similarity.item())


most_similar_indices = np.argsort(similarities)[-5:][::-1]


ground_truth_idx = train_ds['answers'].iloc[selected_question_idx]


top1_accuracies = []
top5_accuracies = []

top1_idx = most_similar_indices[0]
top1_accuracy = 1 if top1_idx == ground_truth_idx else 0
top5_accuracy = 1 if ground_truth_idx in most_similar_indices else 0

top1_accuracies.append(top1_accuracy)
top5_accuracies.append(top5_accuracy)

print("Selected Question:", selected_question)
print("Most similar 5 asnwer:")
for i, idx in enumerate(most_similar_indices):
    print(f"{i+1}. {sample_1000_answer.iloc[idx]}")

print("Top-1 Accuracy:", top1_accuracy)
print("Top-5 Accuracy:", top5_accuracy)

Output:

Selected Question:  bir sunum oluşturmak için beş adım yazın.
Most similar 5 asnwer:
1.  doğum günü gülüm bütün yaz aldığım en güzel hediyeydi.
2.  bu deneyin amacı ilkeleri anlamaktır.
3.  bir satış elemanı sunum yapıyor.
4.  hangi konuda yardıma ihtiyacın olduğunu söyle.
5.  konuşmanın içeriği, projede bir sonraki adım için onay almakla ilgilidir.
Top-1 Accuracy: 0
Top-5 Accuracy: 0

Original Q&A

There are 1 best solutions below

**ewz93** · Answer 1 · 2024-03-24T13:57:45.313000

bert-base-uncased is a model that was mainly pre-trained on English. You could try a model pre-trained for Turkish instead, such as dbmdz/bert-base-turkish-cased.

Also seeing that you use a relatively large data set, but calculate accuracy to only give any value to the correct 5 indices seems seems like too much of a harsh cut-off. It would be better if you found a way to rate how far from the predicted answer is from the expected one instead of just rating it as either 0 or 1, as this will just mean that almost all answers get a score of 0. Alternatively, you could add an even more lenient measure such as Top-10 Accuracy or Top-25 Accuracy to see if this gives higher values.

How to calculate cosine similarity with bert over 1000 random example

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in NLP

Related Questions in BERT-LANGUAGE-MODEL

Related Questions in COSINE-SIMILARITY

Trending Questions

Popular # Hahtags

Popular Questions