I am trying to benchmark my production application with Java 21 virtual threads and without.
Not that this is not a JMH kind of benchmark, I am redirecting 50% traffic on 2 set of kubernetes deployment, one where I have code deployed serving with Tomcat Virtual Threads implementation and other where I am normally using Tomcat NIO.
I have kept same HPA for both the set of fleets and I am trying to monitor the following:
- Number of Pods in each fleet
- Throguhput
- CPU utilisation
- p99, p95, p75, p50 latency metrics
- Memory Ulitisation
- Load Average
Node Details
I am using Google Cloud n2d instances (AMD Processor 2nd Generation) 16 core machine (8 physical cores) 16G Memory
Application Details
My application is both CPU & IO intensive, since it receives a lot of requests, it calls various services to enrich the data, split & send the request to another service and then joins back the responses finally to deliver the response. As I have seen from profiling most of the CPU time goes in serilisation & deserialisation.
Since the application is IO itensive, using virtual threads was a good choice here for us.
The implementation not having virtual threads is optimised as well, when initiating multiple request parallely it forks them asyncornously and then waits for them in the current thread. But we are using 300-500 worker threads in Tomcat. For async calls we are using apache http async client and grpc where we are using threads = number of processors.
Average Throughput: 2500 qps p99 latency: 195ms p95 latency: 115ms p75 latency: 50ms p50 latency: 30ms Average Active Request: 84
Pod Scheduling
Each pod is scheduled on a 16 core node, which means 1 node contains only 1 pod. There is no limit set for on the pods. These pods thus can consume the entire resources of the node.
Observations
Before we conducted this experiment we assumed that we would be increasing our throughput since now we won't be wasting time context switching between platform threads (300-500).
We did see the following:
- 50-100qps improvement in throughput
- Memory reduction (expected as the number of threads reduced)
- Latency was similar, slight reduction but nothing that can be accounted
- Load average was down from 17 to 14
The thing that I was not able to understand here was that since now I am only doing CPU processing inside my threads why is my load average not aligning with my CPU utlisation which was close to 50% or (8000 - 9000m)
On investigating further and reading more about load averages, I tried by supplying
-XX:ActiveProcessorCount=8
I did this so that it matches my number of physical cores and surprisingly everything changed:
- Throughput improvement remained same 50-100 qps, but this time became clear difference while earlier it was going up / down
- Memory reduction as further more threads reduced, earlier where common pool, fork join, httpclient, grpc etc were taking 16 threads now everyone is taking 8.
- Latency improvements became more clear and 2-3% improvements were seen in all the metrics
- CPU utilisation reduced by 1 core
- Load Average was down to 8 and closer to CPU Utilisation
I wanted to understand why this happened and how things are playing underneath the system. Usually in all my applications which are either async or using virtual threads, I am using Thread Counts = Number of Processors, but here when I instead make them Thread Count = Number of Physical Processors I am seeing much improved results and load average.