I'm using StuffDocumentsChain in my llm Q&A app, the model is Mistral 7b v0.2 instruct. I'm using load_qa_chain from langchain.chains.question_answering.
The chain is tarting to generate correct response, but it stops way to late and after finishing generation of valid response, it's generating lot of garbage.
docs = [.....]
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = LlamaCpp(
model_path=model_path,
n_gpu_layers=-1,
n_batch=n_batch,
callback_manager=callback_manager,
temperature=0.0,
n_ctx=8192,
top_p=0.001,
f16_kv=True,
verbose=True,
n_threads=10,
top_k=2,
repeat_penalty=1.07,
use_mlock=True,
max_tokens=4096,
stop=['</s>', '[INST]', '[/INST]']
)
template = """<s>[INST]{context}\n{question}\n[/INST]"""
prompt = PromptTemplate(
template=template,
input_variables=["context", "question"]
)
llm_chain = load_qa_chain(llm=sllm, prompt=prompt)
llm_answer = llm_chain({"input_documents": docs, "question": question,
"context": docs}, return_only_outputs=True)['output_text']
Is there anything that I'm missing or doing wrong? How can I make chain to stop at correct place?