Sudden increase in 460 status codes and ClientDisconnected errors in Flask app running in ECS

231 Views Asked by At

My Setup

I have 1 ECS Cluster with 1 Service that has 4 tasks with each 1 app container and 1 CloudWatchAgent container. There is a Application Load Balancer setup that routes the traffic to the Service. I use Fargate and each task & app container has 0.5 vCPU and 2GB memory. The app container is running Gunicorn with 4 gevent workers. The app is doing https and grpc requests to other services and uses pillow-simd to convert images that are uploaded as png to jpeg. https requests are monkeypatched to work with gevent by gunicorn. grpc i patched myself. There is one main endpoint where clients upload images. The app is implemented with Flask.

There are around 3.5 requests per second at the peak. Most images are jpeg so i don't have to convert them, but some are png.

Problem

  • A while ago ECS started to kill and restart tasks with the event: (service App, taskSet ecs-svc/XXXXXXXXXXX) port 8080 is unhealthy in target-group AppTG-2 due to (reason Request timed out).

  • Also i got GreenletExit errors and WorkerGracefull timeout errors. warnings.

  • Also the count of 460 errors in the Application Load Balancer increased. (I know that a lot of the clients have a bad internet connection, but i think the number is to high.) On App side i get a 500 because at the point where i want to read the request body it throws an ClientDisconnected error.

  • The timeout in the ALB is 60sec and in the app 61sec.

  • I use statsd to record the time of all requests to other services and for some of the requests with ClientDisconnected errors i get durations of multiple minutes to hours.

  • When i changed the health check to not call the db but just return 200 the errors and restarting stopped but the number of 460 increased. My guess is that before when a task was stuck/overloaded replacing it with a new instance helped.

  • cpu and memory are always really low, what i don't get because i thought with enough requests an event loop should more or less use up to 100% of the cpu.

  • I have the strong feeling that the gevent loop is blocked. But i'm not sure. Also i could not replicate this behaviour on the my staging environment.

  • i also tried it with more or less event loops per task and with more or less tasks but the result was inconclusive.

  • I also put the image conversion into an extra Greenlet, this also din't help.

I thought about using gthread workers instead of gevent but this increased the average request duration drastically.

Does anybody have any ideas how i could solve this?

0

There are 0 best solutions below