sagemaker ml.p3.8xlarge instance with 4 gpus quadruples inference output responce

28 Views Asked by At

I have a custom model that works fine when deployed on single gpu instances. on multi gpu instances it outputs the response and does computation x #gpus, in this case 4 . This means after each call to the endpoint I have to call 3 more times with a failed state to "clean" the queue.

Assume I am forced to use this instance, how can we prevent this. I already tried using cuda:0 as device but others still run in the logs.

def model_fn(model_dir, contex=None):

logger.info("in models")

device = "cuda:0" if torch.cuda.is_available() else "cpu"

logger.info(device)

model = somemodel(checkpoint=os.path.join(model_dir, 'model.pt'))

logger.info("in model ")

model.to(device=device)

logger.info("todevice")

model_run= model_runner(model)

logger.info("out model")

return model_run

I have tried the things explained above.

0

There are 0 best solutions below