I have a custom model that works fine when deployed on single gpu instances. on multi gpu instances it outputs the response and does computation x #gpus, in this case 4 . This means after each call to the endpoint I have to call 3 more times with a failed state to "clean" the queue.
Assume I am forced to use this instance, how can we prevent this. I already tried using cuda:0 as device but others still run in the logs.
def model_fn(model_dir, contex=None):
logger.info("in models")
device = "cuda:0" if torch.cuda.is_available() else "cpu"
logger.info(device)
model = somemodel(checkpoint=os.path.join(model_dir, 'model.pt'))
logger.info("in model ")
model.to(device=device)
logger.info("todevice")
model_run= model_runner(model)
logger.info("out model")
return model_run
I have tried the things explained above.