I am very new to NLP and Machine Learning. I have been trying to build a conversational chatbot which can answer user questions related to my software application as well as a wide variety of questions specific to my domain (bioinformatics). I have used the llama-2-7B model and I have used retrieval augmented generation where I have given some custom, domain-specific data in the form of PDF embeddings and used it as context for my model to answer questions.
Here are my concerns and I would appreciate any advice about what direction I must take.
- I have used prompt fine tuning to redirect the model to use it's existing knowledge to answer questions in case it can not find answers directly in the context.
SYSTEM_PROMPT = """
Your name is 'Jo', you are an AI assistant, developed by organisation.
You are a knowledgeable, respectful and honest bioinformatics assistant who provides support to users of the software. Always answer as helpfully as possible, while being safe.
Use the given pieces of contexts from the PDF embeddings to answer questions. If a question does not make sense, or you don't know the answer, instead of assuming or giving incorrect answers tell the user that you can not answer the question.
""".strip()
template = generate_prompt(
"""
{context}
Question: {question}
""", system_prompt=SYSTEM_PROMPT
)
prompt = PromptTemplate(template=template, input_variables=["context", "question"])
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=pdf_vectordb.as_retriever(search_kwargs={"k":2}),
return_source_documents=True,
chain_type_kwargs={'prompt':prompt}
)
result = qa_chain("How to merge different VCF files?")
result = qa_chain("How is chatgpt different from other AI?")
Though it can clearly elaborate on questions directly found in the PDFs, it does get confused with closely related questions (as the Llama model already has knowledge on bioinformatics). How can I improve the accuracy?
- I'm running this on an Nvidia A5000 GPU machine with 126GB CPU RAM and 24GB GPU RAM. It uses approx. 5GB CPU RAM and 11GB GPU RAM to retrieve answers each time I ask a question. It also takes a lot of time to generate the answers (3-4 mins for the first question asked in the code above, and 5 mins for the second question). I want to significantly reduce the amount so that when multiple users utilise the model it can generate answers at a readable pace. How can I achieve that?