In IIS we had a Website with 3 applications, each configured with its own AppPool. After a refactoring process we decided to unify these application into one. In order to let not updated clients to continue to communicate with the old URLs I decided to configure Url Rewrite and Application Request Routing (ARR) and to use the ARR proxy to accept requests on old URLs and remake them into the new ones.
On Monday (this is the first Monday after we set up the new configuration, and the past days everything worked fine) at 8am we have the clients that start a procedure to send some data to the server (on one of the old applications). We have more than 1000 clients and the procedure makes about 10 calls to the API. (The requests are normally resolved in less than 10ms.) So we have this burst of more than 10000 calls (plus the ordinary ones) in 1-2 minutes.
With the old configuration with the 3 separate applications and no proxy we seldom had some issues, but this morning we had a system outage.
In the IIS logs I see that from 7:59:00 to 7:59:59 the server handled around 7000 requests (about half are the original ones from the clients and the other the ones made by the proxy) with no problems, but shortly after that I see only the original requests, all ended in timeout after more than 120s and a status of 502.3 64. That change was sudden, so from 1-10 ms to 120000ms+ in 2 requests, so it is quite obvious that a limit was hit.
I am no such a server expert and I searched if there are some limits on the ARR proxy, but I didn't find many useful info. I suspect, however, that the issue is not at all in the proxy, but in the AppPools configuration.
The current configuration is:
- 1 Website with the 3 applications (let's call it N: the new one and O1 and O2 the old ones),
- the directories of the O1 and O2 applications contains only the web.config file with the URL Rewrite rules,
- 3 Application Pools (NAP, O1AP, O2AP), so each application still have its own application pool,
- the configuration of the 3 AppPools is pretty much the standard one (so a queue length of 1000) with no managed code and integrated pipeline,
- the timeouts are 120s either on the ARR side and the IIS side.
The thing I don't know is what exactly happen when an old URL is hit and which AppPools are involved. From what I understood, the request made to O1 lives in O1AP, then the proxy makes it own request to N and that request lives in NAP, so there should be no much difference between this configuration and the old one with the 3 separate applications.
The culprit I think is the 1000 requests queue length, but the issue I having with this is that on the advanced settings of the AppPool, when selecting the "Queue length" row, on the bottom pane the description says that when the queue is full the new requests are failed with HTTP503, but in the IIS logs I see 502.3.
Hoping the situation is clear enough, my questions are:
- is there something I am missing?
- which are the options to monitor the situation?
- what can I do to solve the issue?
Some versions:
- Windows server 2019
- IIS 10.0.17763.1,
- IIS URL Rewrite Module 2 7.2.1993
- ARR 3.0 3.0.05311