What is the proper way to do multi-GPU inference with Pytorch?

444 Views Asked by At

I'm facing some issues with multi-GPU inference using pytorch and pytorch-lightning models. At inference time, I need to use two different models in an auto-regressive manner. Since the auto-regressive steps are computationally expensive, I wanted to split my dataset into smaller parts and send them to several GPUs so it can run in parallel and decrease inference time by a factor similar to the number of GPUs used. The auto-regressive step is complex and use 2 different models (see code snippet below) so I can not use the standard pytorch-lightning test_step. After computing the final output, I want to export the generated images batch as image files.

The issue I'm facing: when looking at the saved images, I noticed that some of them are missing, i.e. generated images corresponding to some batches are not saved.

import os
import hydra
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from hydra.utils import instantiate
from omegaconf import DictConfig



def ddp_setup(rank: int, world_size: int):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "47144"
    dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)


@hydra.main(version_base=None, config_path="../../configs", config_name="transfert")
def main(cfg: DictConfig) -> None:
    world_size = torch.cuda.device_count()
    mp.spawn(run_da, args=(world_size, cfg), nprocs=world_size)


def run_da(rank: int, world_size: int, cfg: DictConfig) -> None:
    ddp_setup(rank, world_size)

    # define models map location
    map_location = {"cuda:%d" % 0: "cuda:%d" % rank}

    # get eps model of source domain
    source_model = Model.load_from_checkpoint(
        cfg.source_model_path,
        map_location=map_location,
    )
    source_model.eval()
    source_model.to(rank)

    target_model = Model.load_from_checkpoint(
        cfg.target_model_path,
        map_location=map_location,
    )
    target_model.eval()
    target_model.to(rank)

    # load source and target datamodules
    source_datamodule = instantiate(cfg.source)
    source_datamodule.setup()

    test_data_sampler = torch.utils.data.distributed.DistributedSampler(
        source_datamodule.test,
        num_replicas=world_size,
        rank=rank,
    )
    dataloaders = {
        "val": source_datamodule.val_dataloader(sampler=test_data_sampler),
        "test": source_datamodule.test_dataloader(sampler=test_data_sampler),
    }

    for stage, dataloader in dataloaders.items():
        for batch in dataloader:
            x = batch["img"].to(rank)
            cond = batch["cond"].to(rank)

            with torch.no_grad():
                # compute output with several calls with source and target model
                for _ in range(100):
                    x = autoregressive_step(source_model, target_model, x, cond)

            # save output as images
            export_output(x, stage)


if __name__ == "__main__":
    main()

Hypothesis: some batches sent to some GPUs were not saved/processed. I have been monitoring how batches are sent to the GPUs and it turns out that some GPUs received a lot more batches than others.

I followed some tutorials about multi-GPUs training but it seems that it is pretty different in the case of inference (Distributed Data Parallel does not seem appropriate as far as I understand), so I'm wondering if my code includes an obvious bug that can be fixed or if there is any good ressources about multi-GPU inference using pytorch / pytorch-lightning models.

Thank you for your attention,

Erik

0

There are 0 best solutions below