Is there a way to share a PyTorch model across multiple processes without using multiple copies?

57 Views Asked by At

I have a custom PyTorch model that bottlenecks my application due to how it is currently used.

The application is a web server built in Flask that receives job submissions for the PyTorch model to process. Due to the processing time of each job, I use Celery to handle the computation, where Flask queues the tasks for Celery to execute.

Each job consists of loading the PyTorch model from the disk, moving the model and data to a GPU, and making a prediction on the data submitted. However, loading the model takes around 6 seconds. In many instances, that is a magnitude or two larger than prediction time.

Thus, is it possible to load the model and move it to a GPU on server startup (specifically when the Celery worker starts), avoiding the time needed to load the model and copy it to the GPU every job? Ideally, I'd want to load the model and copy it to every available GPU on server startup, leaving each Celery job to choose an available GPU and copy the data over. Currently, I only have one GPU, so a multi-GPU solution is not a requirement at the moment, but I'm planning ahead.

Further, the memory constraints of the model and data allow for only one job per GPU at a time, so I have a single Celery worker that processes jobs sequentially. This could reduce the complexity of the solution due to avoiding multiple jobs attempting to use the model in shared memory at the same time, so I figured I'd mention it.

As of this moment, I am using PyTorch's multiprocessing package with the forkserver start method, but I've had trouble determining exactly how this works and if it behaves in the way I prefer. If you have any input on my configuration or suggestion for a solution, please leave a comment! I'm open to efficiency suggestions, as I intend to scale this solution. Thank you!

2

There are 2 best solutions below

3
Yash Soni On BEST ANSWER

Yes, there are ways to share a PyTorch model across multiple processes without creating copies.

torch.multiprocessing and model.share_memory_():

This method utilizes the torch.multiprocessing module from PyTorch. You can call model.share_memory_() on your model to allocate shared memory for its parameters. This allows all processes to access the same model parameters in shared memory, avoiding redundant copies. This approach is efficient for training a model in parallel across multiple CPU cores.

Some resources for further exploration: https://www.geeksforgeeks.org/getting-started-with-pytorch/

1
Karl On

My two cents is trying to share the model across multiple processes which are all trying to use the same GPU will be a massive headache with minimal performance improvements.

You want a setup where each model you want to deploy has a container image for just that model that loads the container-specific model on startup. Each model will have a separate inference queue. ie if you want to deploy Model_A and Model_B, you have Image_A running Model_A consuming from Queue_A and Image_B running Model_B consuming from Queue_B.

Trying to run multiple models from the same queue/container is easier at first but becomes a nightmare once you start to scale.

Regardless of your inference framework, there's always going to be some cold start delay getting the container going, but you should only need to load a model once for the lifetime of the inference container.

Since you're using GPU inference, you want to make sure your inference containers are doing batch inference whenever possible.

You might also find this article useful.