An ML Model is running under Triton Inference Server on a GPU instance group and after a certain amount of successful inferences starts throwing the exception:
CUDA error: device-side assert triggered
With export CUDA_LAUNCH_BLOCKING=1 the stacktrace points to {key: val.to(device=COMPUTE_DEVICE) for key, val in inputs.items()}:
Traceback (most recent call last):
File "/opt/triton_models/feature_based_pwsh_classifier/1/script_embeddings.py", line 129, in compute_code_embeddings
inputs = {key: val.to(device=COMPUTE_DEVICE) for key, val in inputs.items()}
File "/opt/triton_models/feature_based_pwsh_classifier/1/script_embeddings.py", line 129, in <dictcomp>
inputs = {key: val.to(device=COMPUTE_DEVICE) for key, val in inputs.items()}
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Here is a simplified form of the problematic code:
max_length = llm.config.max_position_embeddings
# inputs is a dict with keys: [input_ids, attention_mask]
inputs = tokenizer(text, return_tensors='pt', max_length=max_length, truncation=True, padding=True)
# Move the inputs to the CUDA device
inputs = {key: val.to(device=COMPUTE_DEVICE) for key, val in inputs.items()}
with torch.no_grad():
outputs = llm(**inputs)
Where:
COMPUTE_DEVICEistorch.device('cuda')llmandtokenizerare loaded viatransformerslibrary from Graph-CodeBERT- once the exception occurs, all following InderenceRequests yield error, and the Triton Server needs to be restarted
- the
inputslooks valid with:dtype:torch.int64, size:(1, xxx), device:cpu, has_NAN:False, has_inf:False - GPU VRAM is usually under 20% when the exception occurs
Help and recommendation are appreciated!
The issue was caused by using max_position_embeddings of size
514fromgraphcodebertconfig:while in fact, the
512, which is standard for BERT models allowed Tokenizer to produce valid outputs.Few notes on debugging process:
export CUDA_LAUNCH_BLOCKING=1--log-verboseto the Triton IS command:tritonserver --model-repository /opt/triton_models/ --log-verbose=1CUDA erroris thrown, the model enters undetermined state and any following CUDA Tensor operation will produceCUDA error: device-side assert triggered, heavily polluting the logstextand replicate failing operations in local Jupyter Notebook on a CPU, which caused:Helpful discussion: