How to perform a dot product between lists of vectors that don't fit into memory to find the most similar vectors?

99 Views Asked by Penguin At 11 September 2023 at 17:10

I'm trying to calculate the dot product between one test file with sentences:

from datasets import load_dataset
test_file = load_dataset('parquet', data_files='test_file.parquet')['train']
test_file
>>> features: ['sentences'] # length ~100k

And multiple files with sentences and IDs:

for idx in range(30):
    training_list_chunk = load_dataset('parquet', data_files=f'file_{idx}.parquet')['train']  # length ~100k
>>> features: ['sentences', 'ids'] # length ~100k

Each of these training files has one column called sentences that is a list composed of sentences strings and one column called ids (integers in range [0,30M]).

To calculate the dot product I'm converting the sentences strings to vectors with a huggingface model:

from transformers import AutoTokenizer, AutoModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained('facebook/contriever')
model = AutoModel.from_pretrained('facebook/contriever').to(device)

# Apply tokenizer
sentences_inputs = tokenizer(list_of_strings, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
outputs = model(**sentences_inputs.to(device))

# Mean pooling
def mean_pooling(token_embeddings, mask):
    token_embeddings = token_embeddings.masked_fill(~mask[..., None].bool(), 0.)
    sentence_embeddings = token_embeddings.sum(dim=1) / mask.sum(dim=1)[..., None]
    return sentence_embeddings

embeddings = mean_pooling(outputs[0], sentences_inputs['attention_mask'])

The similarity scores between the different sentences are obtained with a dot product between the embeddings.

The test file has about 100k sentences, and the total of the other files is about 30M (30 files with ~100k entries each). I'm trying to get the top_k (e.g., top 100) closest ids (ordered by similarity) for each of the entries in the test file (i.e., the output should be a list/file/etc of length len(test_file), composed of sublists of len(top_k) with the closest ids.

The issue is that I cannot have in memory more than ~200 embeddings at the same time.

Original Q&A

There are 1 best solutions below

KonstantinosKokos On 11 September 2023 at 18:39

For each vector, split the comparison targets into batches that fit into the memory, get the batch maximum, compare to the next batch max, repeat.

In pseudocode:

for vector in sources:
   offset = 0
   max_id, max_value = None, -1e8
   batch = targets[offset:offset+batch_size]
   batch_max_id, batch_max_value = (vector * batch).sum(-1).max()
   if batch_max_value > max_value:
       max_id, max_value = batch_max_id + offset, batch_max_value
   offset += batch_size

Adapt for topk as needed. Read the targets dynamically as needed.

How to perform a dot product between lists of vectors that don't fit into memory to find the most similar vectors?

There are 1 best solutions below

Related Questions in MACHINE-LEARNING

Related Questions in PYTORCH

Related Questions in NLP

Related Questions in HUGGINGFACE-TRANSFORMERS

Related Questions in DOT-PRODUCT

Trending Questions

Popular # Hahtags

Popular Questions