Based on this repo GitHub Link I am trying to build a system which answers users queries.
I was able to run the model on a CPU with response time of ~60s, now I want to improve the response time, so I am trying to load the model onto a GPU.
System specs
- Processor - Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz, 2195 Mhz, 2 Core(s), 2 LogicalProcessor with 24GB RAM
- GPU - Nvidia A40-12Q with 12gb.
So here are my queries
- How to load the llama 2 or any model onto the GPU?
- Can we improve the response time if we load the model onto a GPU?
- How to improve the answer quality?
- How should we make the model to answer the questions related only to the documents?
The CODE
from langchain.llms import CTransformers
from dotenv import find_dotenv, load_dotenv
import box
import yaml
from accelerate import Accelerator
import torch
from torch import cuda
from ctransformers import AutoModelForCausalLM
# Check if GPU is available and set device accordingly
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"Using Device: {device} in llm.py file")
# Load environment variables from .env file
load_dotenv(find_dotenv())
# Import config vars
with open('config/config.yml', 'r', encoding='utf8') as ymlfile:
cfg = box.Box(yaml.safe_load(ymlfile))
accelerator = Accelerator()
def build_llm():
config = {'max_new_tokens': cfg.MAX_NEW_TOKENS,
'temperature': cfg.TEMPERATURE,
'gpu_layers': 150
}
llm = CTransformers(model=cfg.MODEL_BIN_PATH,
model_type=cfg.MODEL_TYPE,
config= config
)
llm,config = accelerator.prepare(llm,config)
return llm
This is the part which loads in the model, but while querying, the cpu utilization shoots up till 100% and the GPU utilization remains at 2%
Since you're using
accelerate, the best way to do this is check the accelerate docs. Note that standard Llama2 is too large for your GPU, so you may need to use a quantized version.It depends. Most speed gains from GPU inference come from batch inference. If you're running inference on a single item at a time, you might not see major speed improvements. Inference on a single item tends to be more bottlenecked by memory transfers rather than flops, which is why codebases like llama.cpp get good performance on laptops.
You can try improve the prompt you give the model or curate a dataset of proper question/answer pairs for fine-tuning.
This is an open research question