I have a Mistral and ChromaDB question n answer application hosted in AWS EC2 g5.2xlarge instance. I used to kill the Python application without deleting llm variable so that CUDA is deallocated. Even when i reboot my EC2 instance i am facing the issue. I tried
torch.cuda.empty_cache() gc.collect()
but not helping. When i try to hard reset in the terminal using nvidia-smi --gpu-reset
it gives me "Insufficient Permissions" error. The following code shows how i instantiate my LLM
hf_pipeline = pipeline(
task="text-generation",
model = "mistralai/Mistral-7B-Instruct-v0.1",
tokenizer = tokenizer,
trust_remote_code = True,
max_new_tokens=1000,
model_kwargs={
"device_map": "auto",
"load_in_4bit": True,
"max_length": 512,
"temperature": 0.01,
"do_sample": True,
"torch_dtype":torch.bfloat16,
}
)
What is the solution for CUDA ran out of memory error?