Docker Error with NVIDIA GPU: libnvidia-ml.so.1 Not Found Despite Successful nvidia-smi and Driver Detection

447 Views Asked by At

I'm facing an issue when trying to run an NVIDIA GPU-supported Docker container on my system. Despite successful detection of NVIDIA drivers and GPUs via nvidia-smi, attempting to run a Docker container with the command docker run --rm --gpus all ubuntu:18.04 nvidia-smi results in the following error:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
Here's the output of nvidia-smi, showing that the NVIDIA drivers and GPUs are correctly detected and operational:
$ nvidia-smi
Thu Feb 22 02:39:45 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    On   | 00000000:18:00.0 Off |                    0 |
| 30%   37C    P8    14W / 230W |  11671MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000    On   | 00000000:86:00.0 Off |                    0 |
| 55%   80C    P2   211W / 230W |  13119MiB / 23028MiB |     79%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

To troubleshoot, I ran nvidia-container-cli -k -d /dev/tty info, which confirmed that the NVIDIA libraries, including libnvidia-ml.so.525.85.12, are detected. However, the Docker error persists, suggesting an issue with locating libnvidia-ml.so.1.

So far, I've attempted:

Reinstalling NVIDIA drivers and CUDA Toolkit. Reinstalling NVIDIA Container Toolkit. Ensuring Docker and NVIDIA Container Toolkit are correctly configured. Setting the LD_LIBRARY_PATH to include the path to NVIDIA libraries. Despite these efforts, the problem remains unresolved. I'm operating on a Linux system with NVIDIA driver version 525.85.12.

Has anyone experienced a similar issue or can offer insights into what might be causing this error and how to resolve it? I would greatly appreciate any suggestions or guidance.

What I Tried:

  1. Running a Docker Container with NVIDIA GPU Support: You attempted to start a Docker container using the NVIDIA GPU with the command docker run --rm --gpus all ubuntu:18.04 nvidia-smi.

  2. Checking NVIDIA Driver and GPU Detection: You used nvidia-smi to ensure that the NVIDIA drivers and GPUs were correctly detected and operational on your system.

  3. Diagnostic with NVIDIA Container Toolkit: You ran nvidia-container-cli -k -d /dev/tty info to diagnose the issue, which confirmed that the NVIDIA libraries, including libnvidia-ml.so.525.85.12, were detected by your system.

  4. Attempted Solutions for Resolution:

    • Reinstalling NVIDIA Container Toolkit to ensure proper integration with Docker.
    • Setting the LD_LIBRARY_PATH to include the path to NVIDIA libraries, attempting to resolve any issues related to the library path.

What You Were Expecting:

  • Successful Container Initialization: You expected the Docker container to initialize successfully with NVIDIA GPU support, allowing you to use GPU resources within the container.

  • Resolution of Library Detection Issue: You anticipated that the steps taken would resolve any issues related to the detection of libnvidia-ml.so.1, ensuring that Docker and the NVIDIA Container Toolkit could access and utilize the necessary NVIDIA libraries.

  • Operational GPU Support in Docker: Ultimately, you expected these troubleshooting steps to enable seamless GPU support within Docker containers, allowing for GPU-accelerated applications to run as intended.

The discrepancy between the expected outcomes and the actual results – the persistent error message indicating an inability to find libnvidia-ml.so.1 despite confirmed detection of NVIDIA drivers and libraries – suggests that there might be an underlying issue with the Docker and NVIDIA integration setup, library paths, or possibly with specific versions of the tools and drivers involved.

0

There are 0 best solutions below