I have a large database of documents (these “documents” are essentially web pages and they are all in HTML). They have information regarding the business itself and can contain a lot of similar information. What I want to do is to create a chatbot on top of this database that can answer any question regarding the content of its documents.
Now, if I pass the correct information to GPT, it can answer the question easily. The main difficulty is how to get that relevant information, so that it can be provided to GPT. Right now, I’m chunking the documents, embedding those chunks, storing them in a vector database and, when a question is asked, I fetch the k-nearest embeddings from this vector database (I'm using text-embedding-ada-002 to create the embedding, by the way).
So, my questions are:
- How can I create the embeddings in the best way possible, so that the information retrieval step has a high performance? (I assume OpenAI, Google, etc. did something like this when crawling and scraping the web to fetch relevant information, but I don’t seem to find anything of interest online )
- This is a little of topic, but is there a rule of thumb to intuitively understand why one embedding had a higher score than other in the k-nearest embeddings search? From my experience, I see that very small embeddings tend to be chosen with higher scores. For example, if the question is “How can I make popcorn?”, an embedding from a sentence with 10 words will have a higher score than an embedding from a chunk of text with 1000 words (even if that chunk actually answers the question)
(I've also made the same questions in this OpenAI's Community Forum post)
As you already know of the technical way of building a Retrieval Augmented Generation (RAG) system. I'm going to share some experience I made.
RAG works best if your data is as clean as possible. This sucks, as it's much work. If you're having a lot of html tags, this will add noise to your embeddings. Also if your documents contain a lot of similar data will give your retriever a hard time, as everything is similar.
There is a paper stating that RAG combined with a LLM with a large context window will work best. So you can send much more chunks and let the LLM do the rest. GPT4-turbo has a context size of 128k tokens. Compared to 4k for GPT3.5 this is a lot more. There are also open-source models with 200k tokens.
Paper: https://arxiv.org/abs/2310.03025
GPT4-turbo: https://help.openai.com/en/articles/8555510-gpt-4-turbo
Open-Source model with 200k context window: https://huggingface.co/01-ai/Yi-34B