Problem with torch.nn.DataParallel - data is distributed, but not the model, it seems

46 Views Asked by At

Expected to train model on 4 GPUs but 3/4 of the data disappears(?) somewhere

Looked all over other issues but to no avail - so asking my question here.

I am trying to train the model on 4 GPUs using torch.nn.DataParallel. Batch size is 64, so data on one GPU should have shape of [16, ..., ...]. The strange thing is, the data IS distributed and GPU-Util on nvidia-smi shows that there are calculations being performed on each... When I try next line for input (before any calculations) print(src_vid.get_device(), src_vid.shape) it gives 0 torch.Size([16, 75, 256]) - showing data only on the first GPU. Same happens for output - it is supposed to be collected on a single GPU (0 by default), but again, it shows the shape of [16,...,...].

Trained almost identical model in the same virtual environment and everything was okay...

0

There are 0 best solutions below