I'm facing an issue when trying to run an NVIDIA GPU-supported Docker container on my system. Despite successful detection of NVIDIA drivers and GPUs via nvidia-smi, attempting to run a Docker container with the command docker run --rm --gpus all ubuntu:18.04 nvidia-smi results in the following error:
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
Here's the output of nvidia-smi, showing that the NVIDIA drivers and GPUs are correctly detected and operational:
$ nvidia-smi
Thu Feb 22 02:39:45 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A5000 On | 00000000:18:00.0 Off | 0 |
| 30% 37C P8 14W / 230W | 11671MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A5000 On | 00000000:86:00.0 Off | 0 |
| 55% 80C P2 211W / 230W | 13119MiB / 23028MiB | 79% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
To troubleshoot, I ran nvidia-container-cli -k -d /dev/tty info, which confirmed that the NVIDIA libraries, including libnvidia-ml.so.525.85.12, are detected. However, the Docker error persists, suggesting an issue with locating libnvidia-ml.so.1.
So far, I've attempted:
Reinstalling NVIDIA drivers and CUDA Toolkit. Reinstalling NVIDIA Container Toolkit. Ensuring Docker and NVIDIA Container Toolkit are correctly configured. Setting the LD_LIBRARY_PATH to include the path to NVIDIA libraries. Despite these efforts, the problem remains unresolved. I'm operating on a Linux system with NVIDIA driver version 525.85.12.
Has anyone experienced a similar issue or can offer insights into what might be causing this error and how to resolve it? I would greatly appreciate any suggestions or guidance.
What I Tried:
Running a Docker Container with NVIDIA GPU Support: You attempted to start a Docker container using the NVIDIA GPU with the command
docker run --rm --gpus all ubuntu:18.04 nvidia-smi.Checking NVIDIA Driver and GPU Detection: You used nvidia-smi to ensure that the NVIDIA drivers and GPUs were correctly detected and operational on your system.
Diagnostic with NVIDIA Container Toolkit: You ran nvidia-container-cli -k -d /dev/tty info to diagnose the issue, which confirmed that the NVIDIA libraries, including libnvidia-ml.so.525.85.12, were detected by your system.
Attempted Solutions for Resolution:
- Reinstalling NVIDIA Container Toolkit to ensure proper integration with Docker.
- Setting the LD_LIBRARY_PATH to include the path to NVIDIA libraries, attempting to resolve any issues related to the library path.
What You Were Expecting:
Successful Container Initialization: You expected the Docker container to initialize successfully with NVIDIA GPU support, allowing you to use GPU resources within the container.
Resolution of Library Detection Issue: You anticipated that the steps taken would resolve any issues related to the detection of libnvidia-ml.so.1, ensuring that Docker and the NVIDIA Container Toolkit could access and utilize the necessary NVIDIA libraries.
Operational GPU Support in Docker: Ultimately, you expected these troubleshooting steps to enable seamless GPU support within Docker containers, allowing for GPU-accelerated applications to run as intended.
The discrepancy between the expected outcomes and the actual results – the persistent error message indicating an inability to find libnvidia-ml.so.1 despite confirmed detection of NVIDIA drivers and libraries – suggests that there might be an underlying issue with the Docker and NVIDIA integration setup, library paths, or possibly with specific versions of the tools and drivers involved.