How to perform a dot product between lists of vectors that don't fit into memory to find the most similar vectors?

99 Views Asked by At

I'm trying to calculate the dot product between one test file with sentences:

from datasets import load_dataset
test_file = load_dataset('parquet', data_files='test_file.parquet')['train']
test_file
>>> features: ['sentences'] # length ~100k

And multiple files with sentences and IDs:

for idx in range(30):
    training_list_chunk = load_dataset('parquet', data_files=f'file_{idx}.parquet')['train']  # length ~100k
>>> features: ['sentences', 'ids'] # length ~100k

Each of these training files has one column called sentences that is a list composed of sentences strings and one column called ids (integers in range [0,30M]).

To calculate the dot product I'm converting the sentences strings to vectors with a huggingface model:

from transformers import AutoTokenizer, AutoModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained('facebook/contriever')
model = AutoModel.from_pretrained('facebook/contriever').to(device)

# Apply tokenizer
sentences_inputs = tokenizer(list_of_strings, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
outputs = model(**sentences_inputs.to(device))

# Mean pooling
def mean_pooling(token_embeddings, mask):
    token_embeddings = token_embeddings.masked_fill(~mask[..., None].bool(), 0.)
    sentence_embeddings = token_embeddings.sum(dim=1) / mask.sum(dim=1)[..., None]
    return sentence_embeddings

embeddings = mean_pooling(outputs[0], sentences_inputs['attention_mask'])

The similarity scores between the different sentences are obtained with a dot product between the embeddings.

The test file has about 100k sentences, and the total of the other files is about 30M (30 files with ~100k entries each). I'm trying to get the top_k (e.g., top 100) closest ids (ordered by similarity) for each of the entries in the test file (i.e., the output should be a list/file/etc of length len(test_file), composed of sublists of len(top_k) with the closest ids.

The issue is that I cannot have in memory more than ~200 embeddings at the same time.

1

There are 1 best solutions below

2
KonstantinosKokos On

For each vector, split the comparison targets into batches that fit into the memory, get the batch maximum, compare to the next batch max, repeat.

In pseudocode:

for vector in sources:
   offset = 0
   max_id, max_value = None, -1e8
   batch = targets[offset:offset+batch_size]
   batch_max_id, batch_max_value = (vector * batch).sum(-1).max()
   if batch_max_value > max_value:
       max_id, max_value = batch_max_id + offset, batch_max_value
   offset += batch_size

Adapt for topk as needed. Read the targets dynamically as needed.