I am a beginner in MLOps and I have a Python script that uses a PyTorch model (Whisper Tiny) for speech-to-text (STT). According to the model card, this model has about 39 million parameters and is very small in size compared to my GPU memory (24 GB).
I want to deploy multiple instances of this model on the same GPU and process requests in parallel, so that I can make use of the GPU memory and improve the throughput. However, when I try to do that, the requests are processed sequentially and the GPU utilization is low.
I am using FastAPI and Docker to build and run my app. I have created a Dockerfile that uses pytorch/pytorch:latest as the base image and runs the app with gunicorn. I have deployed two containers from this image, one on port 8000 and another on port 8001. When I send two concurrent requests to these containers, the first request takes 5 seconds and the second request takes 10 seconds, implying that it waits for the first request to complete.
Following is how I am running these containers:
docker run -d -p 8000:8000 eng_api
docker run -d -p 8001:8000 eng_api
Following is my Dockerfile file:
FROM pytorch/pytorch:latest
RUN pip install fastapi uvicorn transformers ...
COPY main.py /main.py
WORKDIR /
CMD ["uvicorn", "main:app", "--host=0.0.0.0", "--port=8000"]
Following is my main.py file:
import ...
@app.post("/asr")
async def asr(audio: UploadFile = File(...)):
audio_data = await audio.read()
dset = Dataset.from_dict({"audio": [audio_data]})
dset = dset.cast_column("audio", Audio(sampling_rate=16000))
audio_array = dset[0]["audio"]["array"]
sampling_rate = dset[0]["audio"]["sampling_rate"]
input_features = processor(audio_array, sampling_rate=sampling_rate, return_tensors="pt").input_features
output = model.generate(input_features)
transcription = processor.batch_decode(output, skip_special_tokens=True)[0]
return {"transcription": transcription}
How can I resolve this issue? How can I ensure that the containers run in parallel and use the GPU resources efficiently? Is there a way to specify the GPU memory allocation for each container? Or do I need to use a different framework or tool to manage the deployment?