Deploying LLM on Sagemaker Endpoint - CUDA out of Memory

50 Views Asked by At

I am trying to deploy LLM to Sagemaker Endpoint using custom scripts and I am facing an error : CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacty of 22.20 GiB of which 13.12 MiB is free. Process 13234 has 2.25 GiB memory in use. Process 13238 has 3.82 GiB memory in use. Process 13236 has 8.06 GiB memory in use. Process 13239 has 8.06 GiB memory in use. Of the allocated memory 6.93 GiB is allocated by PyTorch, and 49.59 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF : 400

**The thing is I was able to get a response for a couple of pings and then I started getting this error. I am clearing the cache in my predict function but I cannot understand 3 things:

  1. Why I am getting this error after couple of successful response from Endpoint?
  2. What and Why are the four processes eating space?
  3. And obvious, How can I resolve this error?**
0

There are 0 best solutions below