How to perform document retrieval in large database to augment prompt of LLM?

Question

How to perform document retrieval in large database to augment prompt of LLM?

456 Views Asked by Bruno Vaz At 15 November 2023 at 12:29

I have a large database of documents (these “documents” are essentially web pages and they are all in HTML). They have information regarding the business itself and can contain a lot of similar information. What I want to do is to create a chatbot on top of this database that can answer any question regarding the content of its documents.

Now, if I pass the correct information to GPT, it can answer the question easily. The main difficulty is how to get that relevant information, so that it can be provided to GPT. Right now, I’m chunking the documents, embedding those chunks, storing them in a vector database and, when a question is asked, I fetch the k-nearest embeddings from this vector database (I'm using text-embedding-ada-002 to create the embedding, by the way).

So, my questions are:

How can I create the embeddings in the best way possible, so that the information retrieval step has a high performance? (I assume OpenAI, Google, etc. did something like this when crawling and scraping the web to fetch relevant information, but I don’t seem to find anything of interest online )
This is a little of topic, but is there a rule of thumb to intuitively understand why one embedding had a higher score than other in the k-nearest embeddings search? From my experience, I see that very small embeddings tend to be chosen with higher scores. For example, if the question is “How can I make popcorn?”, an embedding from a sentence with 10 words will have a higher score than an embedding from a chunk of text with 1000 words (even if that chunk actually answers the question)

(I've also made the same questions in this OpenAI's Community Forum post)

Original Q&A

There are 2 best solutions below

**r000bin** · Answer 1 · 2023-11-17T20:16:46.273000

As you already know of the technical way of building a Retrieval Augmented Generation (RAG) system. I'm going to share some experience I made.

RAG works best if your data is as clean as possible. This sucks, as it's much work. If you're having a lot of html tags, this will add noise to your embeddings. Also if your documents contain a lot of similar data will give your retriever a hard time, as everything is similar.
There is a paper stating that RAG combined with a LLM with a large context window will work best. So you can send much more chunks and let the LLM do the rest. GPT4-turbo has a context size of 128k tokens. Compared to 4k for GPT3.5 this is a lot more. There are also open-source models with 200k tokens.

Paper: https://arxiv.org/abs/2310.03025

GPT4-turbo: https://help.openai.com/en/articles/8555510-gpt-4-turbo

Open-Source model with 200k context window: https://huggingface.co/01-ai/Yi-34B

**Nick Magnanini - preprocess.co** · Answer 2 · 2024-03-06T16:00:34.470000

Your seems to be a chunking problem. Your content is in HTML so is semistructured. It's really important to chunk text following the original layout and content semantics.

If you use a modern text splitter you will probably solve this problem. Maybe you are asking why...

The size of the chunk normally doesn't matter a lot, the key point is to split documents into pieces of text that are semantically relevant. As humans, we tend to do that naturally when releasing a visual document (HTML for example) and for this reason, the layout information are important.

Optimal chunks will increase your rag performances (both retrieval and LLM output) and create better embeddings that will mitigate the issue of highest score for short content.

I don't like to promote myself, but we are releasing that solution at https://preprocess.co s you don't have to.

How to perform document retrieval in large database to augment prompt of LLM?

There are 2 best solutions below

Related Questions in EMBEDDING

Related Questions in INFORMATION-RETRIEVAL

Related Questions in LARGE-LANGUAGE-MODEL

Related Questions in CHUNKING

Trending Questions

Popular # Hahtags

Popular Questions