Let us suppose that we have the following small dataset, based on which we should calculate texts embeddings and check if our model can accurately match sentence with similar idea, data is following:
data = [
"US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, " +
"forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends " +
"in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"
]
I would like to estimate embeddings using sentence-transformers/nli-mpnet-base-v2 model, but for some text it accurately guesses the correct text, while for some others it fails. For instance if I search query for text:
query ="climate change in world"
Then it returns the following result:
Climate change in world -> Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg.
Not exact statement, but logically ice shelf collapsed, because of global warming right? If I search for:
Temperature is increasing around world -> Beijing mobilises invasion craft along coast as Taiwan tensions escalate
Nonsense right? How can I improve result? Here is code given:
from txtai.embeddings import Embeddings
embeddings =Embeddings(path='sentence-transformers/nli-mpnet-base-v2')
data = [
"US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, " +
"forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends " +
"in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"
]
embeddings.index(data)
query ="temperature is increasing around world"
# for query in ("feel good story", "climate change", "public health story", "war",
# "wildlife", "asia", "lucky", "dishonest junk"):
uid =embeddings.search(query,1)[0][0]
print(f'{query:20} -> {data[uid]}')