Use multiple GPUs to train a model, and use single GPU to load the model

128 Views Asked by At

When using two GPUs to train my model, I use torch.save(model.module.static_dict(), model_path) to save the weights. But when I use single GPU to test the saved model's performance, I get errors like

"RuntimeError: [enforce fail at inline_container.cc:222] . file not found: archive/data/94095921816640"

and

"RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading file data/94773592765216: invalid header or archive is corrupted".

Even if I use two GPUs to load the weight, I get the same error. So I suppose the problem isn't about the number GPUs I use.

All the saved weights is complete, because they get the same size as the normal ones. What's more strange is that some weights can be saved and loaded correctly and some can not. The only difference between them is when to save them (they are saved at different iter)

This is how I save the model: enter image description here enter image description here

self.save_model_interval equals to 50000, which means I save weights every 50000 iters.

This is how I load my weights on two GPUs: enter image description here

This is how I load my weights on single GPU: enter image description here

This is the error discription: enter image description here enter image description here

In the total 300000 iters only the weights saved at iter 100000 can be loaded correctly, the others all get error. The evironment is Python 3.9.0, torch 1.8.1+cu102 and torchvision 0.9.1+cu102. Thanks for your help!!!

0

There are 0 best solutions below