I have a .net 4.5 ASP.NET WebAPI application. Deployed in IIS using 1 worker on an 8gig VM with 4 CPUs.
I made changes to it recently (upgraded ServiceStack.Interfaces, ServiceStack.Common, ServiceStack.Redis and a bunch of dependencies) and started noticing that the IIS app pool this app is deployed on recycles about once an hour (give or take a few minutes).
There is nothing in my application logs that show any kind of issues. I collect metrics using telegraf and I do NOT see memory metrics increase at all, as far as all the metrics I look at everything looks absolutely normal and then the app pool recycles.
I looked at the event viewer and filtered the logs by WAS source and see event with ID 5011. Which basically means the IIS worker crash as I understand.
So then I used the DebugDiag and ran it on my local box with the app deployed on my box (I can reproduce the issue locally). It ran for a while and finally got the same event in the event viewer. Looked at the crash analysis logs from DebugDiag and all I see if a bunch of exceptions logged but nothing concrete right before the crash.
At this point I'm not entirely sure what else I can to figure out what's causing the crash so hoping there are more suggestions on what I can do to get more transparency.
What I think is happening is, there is some incompatibility with one of my dependencies and some of the upgraded packages which cause an exception to be thrown which is not handled by anything and crashes the IIS worker.
My application is working perfectly fine, as far as all API endpoints functions wit no issues, memory is NOT increasing, CPU is fine. So as far as I can tell there are no issues upto the crash.
Wondering if anyone knows any tricks to find whats causing the crash and/or handle it, prevent this exception from escaping and crashing the worker.
I was able to narrow down with some confidence that the issue lies somewhere within the ServiceStack.Redis RedisPubSubServer. What is the actual issue, I don't know as that would take a lot more time to dig and I've wasted too much time already.
However, piggybacking on some existing code I had (from before ServiceStack supported sentinel) I created a new implementation of the redis client wrapper for the which I call LazySentinelServiceStackClientWrapper; instead of using the built-in sentinel manager, it relies on a custom sentinel provider which I created LazySentinelApiSentinelProvider this implementation attempts to interrogate the available sentinel hosts in random order for master and slave nodes and then I construct a pool using the retrieved read/write and readonly hosts and this pool is used to run the redis operations. The pool is refreshed whenever an error occurs (after a failover). Opposed to the builtin sentinel manager that comes with ServiceStack.Redis which instantiates Redis pubsub server and listens for messages from sentinel whenever configuration changes such as fail-overs occur and updates the managed redis connection pool.
I installed my version of this redis client wrapper into my application has seen no app pool recycle events since (other than the scheduled ones).
Above is the log of app pool recycle events before I disabled the ServiceStack.Redis sentinel manager.
And here's the log of app pool recycle events after installing my new lazy sentinel manager
The first spike is me recycling the app manually and second one is the scheduled 1am recycle. So clearly the issue is solved.
What is the actual reason why the sentinel manager via redis pub sub server is causing IIS rapid fail protection to fire and recycle the app pool I do not know. Maybe someone with much more redis experience and/or IIS experience can attest to that. Also I did not test this in .net core and only tested for a .net 4.5.1 application deployed in IIS but on many different machines including local development machine and beefy production machines.
Finally one last note, that first image which shows all the recycle events, that's on my CI machine which is barely taking any traffic, maybe 1 request every few minutes. So this means the issue is not some memory leak or some resource exhaustion. Whatever the issue is, it happens regardless of traffic, CPU load, memory load, it just happens periodically.
Needless to say I will not be using the builtin sentinel manager at least for now.