I started hosting a Redis/Celery/Python(Dash) app on Heroku about 1 month ago. In this time it was working flawlessly, however in my latest update, my datastore credentials changed and somehow a new bug was introduced.
It happens every time I restart my dyno, as I receive the error below for about 5 minutes.
The error is:
kombu.exceptions.OperationalError: Error 8 connecting to ec2-44-208-193-34.compute-1.amazonaws.com:19130. EOF occurred in violation of protocol (_ssl.c:1129).
somehow, after about 5 minutes, the error resolves itself and the error disappears.
my code looks like
celery_app = Celery(
__name__,
broker = "rediss://:*@ec2-44-208-193-34.compute-1.amazonaws.com:19130/0",
backend = "rediss://:*@ec2-44-208-193-34.compute-1.amazonaws.com:19130/1",
broker_use_ssl = {
'ssl_cert_reqs': ssl.CERT_NONE
},
redis_backend_use_ssl = {
'ssl_cert_reqs': ssl.CERT_NONE
}
)
does anyone have insight to what might be causing it and how to prevent?
Check which Redis version your Heroku addon is running, there are changes with v6 where exhausting the max connection pool does not result in a "max number of clients reached" kind of error but a TLS one.
When a deploy happens, the existing dynos are using some baseline plus traffic count of connections and the replacement dynos try to pick up new ones but are rejected by Redis.
The reason it fixes itself after some period is Redis' timeout setting which is usually 300s by default. All the old dyno connections after 300 seconds are cleaned up and then all the TLS connection issues from being over the max clients goes away.
You can lower the timeout to reduce the duration the error occurs but the better fix is to either increase the max connections via the plan for Redis, or reduce the ones Celery is using (which is a complicated topic), hopefully this and this helpful.
Frustratingly, if you look at the Redis stats Heroku will not report that you went over the limit at all. It shows no indication that you suddenly tried to double or more the client connections. This is misleading as it's never really opened but outright rejected so quickly it is not accounted for. Worse, the error is poorly masked as a SSL issue when it is a resource exhaustion issue.