Improve flask application performance with batching

50 Views Asked by At

I have a flask application which runs on a gunicorn server (2 workers, 8 threads) with one endpoint which takes in a string and returns the embeddings for that string using a SentenceTransformer model. The server is supposed to receive a lot of concurrent request.

pseudo code

#shared model
model = SentenceTransformer("my_model")

@app.route("/search", methods=["POST"])
def search():
    if request.method == "POST":
        query = request.form.get("query")
        return model.encode(query).tolist()

since SentenceTransformer model.encode() function is cpu bound, it's causing a bit of a slowdown and thereby queueing up alot of requests. However, if provided with a list of queries/strings, model.encode() function will be able to process it faster due to its parallel processing capabilities.

Is there a service queue for flask i can use which could process requests normally if the load is manageable, but if too many requests starts queueing up, batch some of the requests queries and return the results to the thread/worker that was handling the request. it could potentially save a lot of time.

I tested model.encode() function with sequential processing and batch processing (list of queries) and batch processing is a lot faster.

0

There are 0 best solutions below