How to debug high latency for clients for my REST API hosted on IIS

180 Views Asked by At

I have a OData REST API application running on IIS, using http.sys as the web server. I notice for around 5-10% of the requests my clients are noticing high latency (> 300 ms), but when I look at server side application logs, I see I processed the request in less than 50 ms. My clients are NOT geographically far from the web server. I think there is some non-network bottleneck, but I don't know where to look.

So far, I ran an Http.Sys ETW trace on the machine to capture a single request with high client reported latency. Here are the results:

Client Reported Latency: 323 ms
Server application reported processing time: 38 ms                           

Http.Sys trace:

1.RecvReq: 2023-08-14T22:33:39.653675000Z (“Request received”)

2.Parse:   2023-08-14T22:33:39.653723100Z (“Parsed request with URI …”)

3.Deliver: 2023-08-14T22:33:39.653755300Z (“Delivered request to server application from request queue IISPOOL_<AppName>”)

4.Deliver: 2023-08-14T22:33:39.653811300Z (“Delivered request to server application from request queue IISPOOL_<AppName>”)

5.RecvResp: 2023-08-14T22:33:39.697113000Z (“Server application passed response”) (EntityChunkCount: 23, HeaderLength: 329, StatusCode: 200)
                                                                                                                                                                                                                                            
6.RecvBody: 2023-08-14T22:33:39.701603000Z (“Server application passed entity body”)

7.FastRespLast: 2023-08-14T22:33:39.859224500Z (“Server Application passed the last response”)

8.FastSend: 2023-08-14T22:33:39.859284800Z (“Queued last response for sending. Status code is 200”)

I don't know fully how to interpret these logs, so I am not finding them helpful. (If you know where I can find resources to help interpret these logs, please share that).

Do these logs reveal anything insightful, like there is some bottleneck in my application logic, or maybe on the client side? What additional thing you would you suggest I look at next? I also checked performance counters on my machine (CPU/Memory) and it was low.

1

There are 1 best solutions below

0
Alois Kraus On

You can record CPU and Network traces with ETW. Download ETWController:

Note: ETWController and ETWAnalyzer are written by me.

Start it as Administrator and select as Preset CSWitch

enter image description here

Then give the recording file a descriptive file name like

C:\temp\SlowHttpSys_%TS%_%COMPUTERNAME%.etl

After that select Manual Editing to tweak the recording settings enter image description here Change the Profile to also record Network Profiling with this command line

xxwpr -start "MultiProfile.wprp!CSwitch" -start "MultiProfile.wprp!Network" -start HookEvents.wprp

Now press Start and let your clients have slow response times. This generates quite a bit of data (1-2 GB/minute) in a up to 4 GB Ring buffer so your time horizon of the recording is no more than 1-2 minutes. Be sure to stop recording soon enough after something bad did happen.

When the recording is finished you can analyze the ETW data with a command line tool called ETWAnalyzer.

Download and extract it to a folder and add the PATH to the folder it so you can execute the tool from anywhere.

Now you can extract the relevant data with

EtwAnalyzer -extract all -fd "C:\Temp\SlowHttpSys_2023-10-03_212630_MAGNON.etl" -symserver ms 

The first time you should use the -symsver ms option to download Microsoft symbols for all Windows code. When you did not patch/update Windows in between you can omit this next time which will extract the data quicker.

Now you can check TCP metrics with an IP filter of one of your client IPs

EtwAnalyzer -dump tcp -fd C:\Temp\Extract\SlowHttpSys_2023-10-03_212630_MAGNON.json -topn 5 -ipport *8b87* 

If there are showing up Retrans Count/%/Delay then TCP retransmissions did occur which indicate network overload. Your reported number of 300ms is the default TCP retransmission delay for Windows TCP sockets connected to the Internet. For more information see https://github.com/Siemens-Healthineers/ETWAnalyzer/blob/main/ETWAnalyzer/Documentation/DumpTCPCommand.md

If it is not the network you should consider using WPA (Windows Performance Analyzer) to look deeper inside IIS if there are some waits due to e.g. AV solutions. A generic query could help also like

EtwAnalyzer -fd C:\Temp\Extract\SlowHttpSys_2023-10-03_212630_MAGNON.json -dump cpu -stacktags *virus* -minmaxwaitms 100  -pn w3wp

to get some first indications if some Antivirus solution is interfering in one of your w3wp.exe processes.

enter image description here enter image description here