Truncated responses in RAG (Retrieval-Augmented Generation) Q&A using Llama-index, Langchain and AWS Bedrock LLM (Anthropic-Claude2)

Question

Truncated responses in RAG (Retrieval-Augmented Generation) Q&A using Llama-index, Langchain and AWS Bedrock LLM (Anthropic-Claude2)

227 Views Asked by MGLondon At 25 January 2024 at 12:40

I am building a RAG based QnA chat assistant using LLama-Index, Langchain and Anthropic Claude2 (from AWS Bedrock) in Python using Streamlit.

To create the context (data) I used some online html pages which were converted to HTML markdown (.md) files. From these .md files, I created an index store (knowledge base) which is stored locally in a directory.

Below is the entire code to build the index. Basically the embeddings are created using the titan embedding model for the vector index store. However for QnA the Claude model will be used.

import logging
import os
from dotenv import load_dotenv
import sys
from shutil import rmtree

import boto3

from llama_index.llms import LangChainLLM
from langchain.llms import Bedrock

from llama_index.embeddings import LangchainEmbedding
from langchain.embeddings import BedrockEmbeddings

from llama_index import ServiceContext, set_global_service_context, SimpleDirectoryReader, TreeIndex, VectorStoreIndex

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

#####################################################################
# Amazon Bedrock - boto3
#####################################################################
load_dotenv()
AWS_BEDROCK_REGION=os.environ.get("AWS_BEDROCK_REGION", None)

# Setup bedrock
bedrock_runtime = boto3.client(
    service_name="bedrock-runtime",
    region_name=AWS_BEDROCK_REGION,
)

#####################################################################
# LLM - Amazon Bedrock LLM using LangChain
#####################################################################
model_id = "anthropic.claude-v2"
model_kwargs =  { 
    "max_tokens_to_sample": 4096,
    "temperature": 0.5,
    "top_k": 250,
    "top_p": 1,
    "stop_sequences": ["\n\nHuman:"],
}

llm = Bedrock(
    client=bedrock_runtime,
    model_id=model_id,
    model_kwargs=model_kwargs
)

#####################################################################
# Embedding Model - Amazon Titan Embeddings Model using LangChain
#####################################################################

# from llama_index import LangchainEmbedding -> from llama_index.embeddings import LangchainEmbedding
# Source code - https://github.com/run-llama/llama_index/blob/main/llama_index/embeddings/__init__.py

# create embeddings
bedrock_embedding = BedrockEmbeddings(
    client=bedrock_runtime,
    model_id="amazon.titan-embed-text-v1",
)

# load in Bedrock embedding model from langchain
embed_model = LangchainEmbedding(bedrock_embedding)

#####################################################################
# Service Context
#####################################################################
service_context = ServiceContext.from_defaults(
  llm=llm,
  embed_model=embed_model,
  system_prompt="You are an AI assistant answering questions."
)

set_global_service_context(service_context)

#####################################################################
# Build Index
#####################################################################
def build_index(data_dir: str, knowledge_base_dir: str) -> None:
    """Build the vector index from the markdown files in the directory."""
    print("Building vector index...")
    documents = SimpleDirectoryReader(data_dir).load_data()

    # index = TreeIndex.from_documents(documents, service_context=service_context)
    index = VectorStoreIndex.from_documents(documents, service_context=service_context)
    index.storage_context.persist(persist_dir=knowledge_base_dir)
    print("Done.")
    

def main() -> None:
    """Build the vector index from the markdown files in the directory."""
    base_dir = os.path.dirname(os.path.abspath(__file__))
    knowledge_base_dir = os.path.join(base_dir, "kb")
    # Delete Storage Directory
    if os.path.exists(knowledge_base_dir):
        rmtree(knowledge_base_dir)
    data_dir = os.path.join(base_dir, "content", "blogs")
    build_index(data_dir, knowledge_base_dir)


if __name__ == "__main__":
    main()

While doing the QnA I am using the same index to create the query engine. See below.

#####################################################################
# Load knowledge base
#####################################################################

@st.cache_resource()
def load_knowledge_base() -> Any:
    """Load the index from the storage directory."""
    print("Loading knowledge base index...")
    
    dir_path = os.path.join("components", "assets_Customer_Assist", "kb")

    # rebuild storage context
    storage_context = StorageContext.from_defaults(persist_dir=dir_path)
    # load index
    index = load_index_from_storage(storage_context)
    query_engine = index.as_query_engine(streaming=True, similarity_top_k=3, response_mode="tree_summarize")
    print("Done.")
    return query_engine

While returning the response from the above query engine, I use the same global service context created using the titan embedding model and the claude llm model. Below is the streamlit code for returning the responses.

with st.chat_message("assistant"):
            message_placeholder = st.empty()
            full_response = ""

            print("Querying query engine API...")
            prompt_templated = prompt_template.format(Question=prompt)
            response_stream = st.session_state.query_engine.query(prompt_templated)
            if response_stream:
                for response in response_stream.response_gen:
                    full_response += response
                    message_placeholder.markdown(full_response + "▌")

            print(full_response)
            message_placeholder.markdown(full_response)
            
            st.session_state.messages.append({"role": "assistant", "content": full_response})

For some reason the responses received for 50% of the questions are truncated. I tried some combinations of the 'response_mode' in the query engine object. However, I am still getting truncated responses.

See example response below (it is cut-off after the word rather):

Based on the information provided in the content sources, ABC Bank offers lending products to the following types of businesses:

Sole traders Partnerships Limited companies PLCs LLPs Trusts SIPPs SSASs Specifically, ABC Bank offers commercial mortgages and bridging loans to businesses looking to purchase or refinance property. The business must have been trading for at least 2 years and have a portfolio of at least 4 properties.

The business can be located in the UK, Isle of Man, or Channel Islands, though the properties must be located in England, Wales, or Scotland.

So in summary, ABC Bank lends to a range of business entities that are looking to finance property purchases or refinancing, have an established trading history, and have an existing property portfolio. The key criteria are around the business status, location, financial health, and property portfolio rather

What am I doing wrong? How do I fix this truncation issue in the response?

Original Q&A

Truncated responses in RAG (Retrieval-Augmented Generation) Q&A using Llama-index, Langchain and AWS Bedrock LLM (Anthropic-Claude2)

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in LANGCHAIN

Related Questions in LLAMA-INDEX

Related Questions in AMAZON-BEDROCK

Related Questions in RETRIEVAL-AUGMENTED-GENERATION

Trending Questions

Popular # Hahtags

Popular Questions