I'm trying to calculate the dot product between one test file with sentences:
from datasets import load_dataset
test_file = load_dataset('parquet', data_files='test_file.parquet')['train']
test_file
>>> features: ['sentences'] # length ~100k
And multiple files with sentences and IDs:
for idx in range(30):
training_list_chunk = load_dataset('parquet', data_files=f'file_{idx}.parquet')['train'] # length ~100k
>>> features: ['sentences', 'ids'] # length ~100k
Each of these training files has one column called sentences that is a list composed of sentences strings and one column called ids (integers in range [0,30M]).
To calculate the dot product I'm converting the sentences strings to vectors with a huggingface model:
from transformers import AutoTokenizer, AutoModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained('facebook/contriever')
model = AutoModel.from_pretrained('facebook/contriever').to(device)
# Apply tokenizer
sentences_inputs = tokenizer(list_of_strings, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
outputs = model(**sentences_inputs.to(device))
# Mean pooling
def mean_pooling(token_embeddings, mask):
token_embeddings = token_embeddings.masked_fill(~mask[..., None].bool(), 0.)
sentence_embeddings = token_embeddings.sum(dim=1) / mask.sum(dim=1)[..., None]
return sentence_embeddings
embeddings = mean_pooling(outputs[0], sentences_inputs['attention_mask'])
The similarity scores between the different sentences are obtained with a dot product between the embeddings.
The test file has about 100k sentences, and the total of the other files is about 30M (30 files with ~100k entries each). I'm trying to get the top_k (e.g., top 100) closest ids (ordered by similarity) for each of the entries in the test file (i.e., the output should be a list/file/etc of length len(test_file), composed of sublists of len(top_k) with the closest ids.
The issue is that I cannot have in memory more than ~200 embeddings at the same time.
For each vector, split the comparison targets into batches that fit into the memory, get the batch maximum, compare to the next batch max, repeat.
In pseudocode:
Adapt for topk as needed. Read the targets dynamically as needed.