AMD ROCm with Pytorch on Navi10 (RX 5700 XT) and HSA_OVERRIDE_GFX_VERSION=10.3.0 fails

3.4k Views Asked by At

I saw AMD ROCm with Pytorch on Navi10 (RX 5700 / RX 5700 XT) recommending to use HSA_OVERRIDE_GFX_VERSION=10.3.0 to run Pytorch with ROCm on a 5700XT card, but I couldn't get it to work.

My steps:

$ sudo pacman -S python-pytorch-opt-rocm
$ git clone https://github.com/pytorch/examples.git
$ cd examples/mnist
$ HSA_OVERRIDE_GFX_VERSION=10.3.0 python3 main.py

Result: I get this output when trying to train mnist. The GPU just runs hot and no training progress is shown.

rocminfo output

Note: Also tried with python-pytorch-rocm package, but python-pytorch-opt-rocm should be fine as lscpu shows avx2 support.

So the question is: does anyone know whether this workaround for Navi10 GPUs still works? Or did I miss anything to set this up correctly?

1

There are 1 best solutions below

0
makesense On

Before I answer your question I need to state that ROCm is officially not supported for Navi10 and I don't think it will ever be. So what you are doing is overriding the setting so your system "acts" like you have a 6900XT which you don't.

This means there are and will be tons of bugs. I, so far encountered many, such as loss.backward() not working, to('cuda') getting stuck forever etc. . I would assume you are running into a similar issue.

My first advice is to use previous versions of pytorch-rocm. Because many newer libraries even directly crash with HIP errors. I personally find the most stable -> torch==1.12.1+rocm5.1.1. So do:

pip uninstall torch 

and

pip install torch==1.12.1+rocm5.1.1 torchvision==0.13.1+rocm5.1.1 torchaudio==0.12.1 --extra-index-url  https://download.pytorch.org/whl/rocm5.1.1

You don't need to worry about ROCm version since it is backward compatible.

My second advice is to use breakpoints during the execution of your code and spot which PyTorch method is causing a compatibility problem and report it so others can also benefit.

Hope this helps, good luck ;)