Cant load "TheBloke/Mistral-7B-v0.1-GGUF" model on GPU

2.1k Views Asked by At

My code

from ctransformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("TheBloke/Mistral-7B-v0.1-GGUF", model_file='mistral-7b-v0.1.Q4_K_M.gguf', model_type='mistral', hf=True)
tokenizer = AutoTokenizer.from_pretrained(model)
device = 'cuda:0'

prompt = 'text'
model_inputs = tokenizer(prompt, return_tensors="pt")
model_inputs.to(device)
model.to(device)

But model is still on cpu enter image description here

I've also tried

model = model.to(device)

I've tried to make device type with torch

device = torch.device('cuda')
1

There are 1 best solutions below

0
fucalost On

It looks like you're using the ctransformers library, which makes GPU-based inference a little tricky. As noted here, you must specify the gpu_layers parameter.

The following snippet should work if you're using the regular transformers library.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# see: https://huggingface.co/mistralai/Mistral-7B-v0.1#troubleshooting
assert transformers.__version__ >= 4.34.0

MODEL_NAME = "mistralai/Mistral-7B-v0.1"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# [1.] Load model and move to GPU
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = model.to(DEVICE)

# [2.] Do inference
input_data = "..."
encoded_input = tokenizer(input_data, 
                          padding=True, 
                          truncation=True, 
                          return_tensors="pt").to(DEVICE)
resp = model(**encoded_input)

Info is available here on GPU sizing for this model.