HuggingFace Trainer() does nothing - only on Vertex AI workbench, works on colab

451 Views Asked by At

I am having issues getting the Trainer() function in huggingface to actually do anything on Vertex AI workbench notebooks.

I'm totally stumped and have no idea how to even begin to try debug this.

I made this small notebook: https://github.com/andrewm4894/colabs/blob/master/huggingface_text_classification_quickstart.ipynb

If you set framework=pytorch and run it in colab it runs fine.

I wanted to move from colab to something more persistent so tried Vertex AI Workbench notebooks on GCP. I created a user managed notebook (PyTorch:1.11, 8 vCPUs, 30 GB RAM, NVIDIA Tesla T4 x 1) and if i try run the same example notebook in jupyterlab on the notebook it just seems to hang on the Trainer() call and do nothing.

It looks like the GPU is not doing anything either for some reason (it might not be supposed to since i think Trainer() is some pretraining step):

(base) jupyter@pytorch-1-11-20220819-104457:~$ nvidia-smi
Fri Aug 19 09:56:10 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I found this thread that maybe seems like a similar problem so i played with as many Trainer() args as i could but no luck.

So im kind of totally blocked here - i refactored the code to be able to use Tensorflow which does work for me (after i installed tensorflow on the notebook) but its much slower for some reason.

Basically this was all working great (in my actual real code im working on) on colab's but when i tried to move to Vertex AI Notebooks i seem to be now blocked by this strange issue.

Any help or advice much appreciated, i'm new to HuggingFace and Pytorch etc too so not even sure what things i might try or ways to try run in debug etc maybe.

Workaround

i noticed that if i make a new workbook NumPy/SciPy/scikit-learn 4 vCPUs, 15 GB RAM , NVIDIA Tesla T4 x (instead of the official pytorch one from the dropdown) and install pytorch myself with conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch it all works.

1

There are 1 best solutions below

0
ifedapo olarewaju On

We faced exactly the same issue on Vertex AI, when we selected the pre-built PyTorch=1.13 environment. The training would just freeze and do nothing for hours.

What worked for us was to select an older version of the PyTorch environment. So we went for the PyTorch=1.9 environment and the training worked just fine.

enter image description here

We could see the logs already as the training commenced. So my guess is that there's something specifically about the newer PyTorch environment on Vertex AI that prevents things from working as expected.