How to add millions of documents to ChromaDB efficently

377 Views Asked by At

I have 2 million articles that are being chunked into roughly 12 million documents using langchain. I want to run a search over these documents so I would like to have them into ideally one chroma database. Would the quickest way to insert millions of documents into chroma database be to insert all of them upon database creation or to use db.add_documents(). Right now I'm doing it in db.add_documents() in chunks of 100,000 but the time to add_documents seems to get longer and longer with each call. Should I just try inserting all 12 million chunks when I create it? I have a GPU and a lot storage and it used to take 30 min per 100K but now were at a little past an hour for adding 100k documents with add_document.

from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings import SentenceTransformerEmbeddings
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter



model_path = "./multi-qa-MiniLM-L6-cos-v1/"
model_kwargs = {"device": "cuda"}
embeddings = SentenceTransformerEmbeddings(model_name="./multi-qa-MiniLM-L6-cos-v1/",  model_kwargs=model_kwargs)

documents_array = documents[0:100000] 

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
    is_separator_regex=False,
)

docs = text_splitter.create_documents(documents_array)

persist_directory = "chroma_db"

vectordb = Chroma.from_documents(
    documents=docs, embedding=embeddings, persist_directory=persist_directory
)

vectordb.persist()
vectordb._collection.count()


docs = text_splitter.create_documents(documents[500000:600000])

def batch_process(documents_arr, batch_size, process_function):
    for i in range(0, len(documents_arr), batch_size):
        batch = documents_arr[i:i + batch_size]
        process_function(batch)

def add_to_chroma_database(batch):
    vectordb.add_documents(documents=batch)

batch_size = 41000

batch_process(docs, batch_size, add_to_chroma_database)
0

There are 0 best solutions below