How to query embeddings for semantic search?

66 Views Asked by At

I have 1000 description for some SKU merchandise and I want to generate inverse embedding mapping to do semantic search

For example here is what I have

item   description
item1  [word1, word2, word3, word4..........]
item2  [word1, word2_2, word3_3, word4_4..........]

As you can see item1 and item2 shares word1, but item1 and item2 has two different context, by generating embedding, we should be able to capture the context of each word

Here is how i generate embeddings

my_description = []
with open('/content/gdrive/My Drive/my.csv', 'r') as data:
    df = pd.read_csv(data, encoding = ('utf-8'),nrows=100)
    for index, row in df.iterrows():
        my_str = row['description']
        my_description.append(my_str)



import torch
from transformers import BertTokenizer, BertModel
%matplotlib inline
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True, # Whether the model returns all hidden-states.
)
model.eval()


text2 = company_description[0]

# Add the special tokens.
marked_text2 = "[CLS] " + text2 + " [SEP]"

# Split the sentence into tokens.
tokenized_text2 = tokenizer.tokenize(marked_text2)

# Map the token strings to their vocabulary indeces.
indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)

segments_ids2 = [1] * len(tokenized_text2)
tokens_tensor2 = torch.tensor([indexed_tokens2])
segments_tensors2 = torch.tensor([segments_ids2])

with torch.no_grad():
    outputs2 = model(tokens_tensor2, segments_tensors2)
    hidden_states2 = outputs2[2]

token_embeddings2 = torch.stack(hidden_states2, dim=0)
token_embeddings2.size()
token_embeddings2 = torch.squeeze(token_embeddings2, dim=1)
token_embeddings2.size()
token_embeddings2 = token_embeddings2.permute(1,0,2)
token_embeddings2.size()

token_vecs_cat2 = [] 

for token in token_embeddings2:
     cat_vec = torch.cat((token[-1], token[-2], token[-3], token[-4]), dim=0)
     token_vecs_cat2.append(cat_vec)
token_vecs_sum2 = []
import numpy as np
x_token = np.empty((0, 768))

for token in token_embeddings2:
    sum_vec = torch.sum(token[-4:], dim=0)
    token_vecs_sum2.append(sum_vec)
    x_token = np.concatenate((x_token, sum_vec.numpy().reshape((1,-1))), axis=0)

x_token would be the embeddings for all my word/token in one description For example say that item1 has 500 tokens and embedding is 700 the shape of x_token would be (500 x 700)

so for each item i would have something like this

item        token          embeddings
item 1      token 1        [x1,x2,x3,.....] 
item 1      token 2        [x1,x2,x3,.....] 
....
item 2      token 1_2      [x1,x2,x3,.....] 
item 2      token 2_2      [x1,x2,x3,.....] 
....
item n      token 1_n      [x1,x2,x3,.....] 
item n      token 2_n      [x1,x2,x3,.....] 

Now my question is how do i perform search

If my search query is a sentence

"word1 word2 word3.....wordn"

If I generate embedding for each word in the sentence and perform ANN for top 10 nearest neighbor for each token

If my query has 10 tokens, I would get 100 item description back (10 for each token) In that case, how do i shortlist to top 10 item description? Which token should i use?

query = [token1, token2.......tokenN]

                   top 10 nearest_neighbor's item, 
query_token1 ->    [itemx1_1, itemx1_2, itemx1_10]
query_token2 ->    [itemx2_1, itemx2_2, itemx2_10]

Am i doing semantic search wrong?

0

There are 0 best solutions below