java how to avoid stop-the-world garbage collection

615 Views Asked by At

we have an market app that streams market data, this gets a spike of data on a daily basis. due to the allocations during the spike the VM ends up doing a stop-the-world garbage collection. From the usage graphs it looks like all the allocation during the spike goes directly to old gen? are there any g1gc parameters that can be used to change this behavior? i am thinking by keeping them in eden space we'll be able to avoid the STW-gc

tried -XX:MaxTenuringThreshold=15 does not seem to have an effect.

using java 8 with g1gc

Total Heap Usage Total Heap Usage

G1 Eden space, during the spike the usage is going down G1 Eden space: during the spike the usage is going down

G1 old gen, looks like all allocation during the spike is going directly to old gen G1 old gen: looks like all allocation during the spike is going directly to old gen

1

There are 1 best solutions below

0
dchristle On

The problem is likely that your application is allocating a very large amount of long-lived objects in a short amount of time. As other commenters have mentioned, you cannot stop all STW pauses, but given your large heap & the fact that you're posting here, it probably means you hit a very long STW pause - maybe even a Full GC - during the spike. Enabling detailed logs with -verbose:gc -XX:+PrintGCDetails would give us better details to help.

Since you didn't provide GC logs, it's difficult to say exactly, but there are some observations from your plots that can help us figure it out:

The Eden space usage gets very small, relative to the steady state.

Under normal conditions in G1, a "Young GC" is by far the most common GC, and it includes a STW pause that should ideally be small. Most of the points on your plots are probably Young GC's; when a GC event happens, the GC log prints the region sizes & a log analysis/plotter can process them. In Young GC, G1 starts at the reference roots of your application, scans the reference tree deeper and deeper to find all "live" objects, and copies them to one of two "Survivor" spaces.

Roughtly speaking, because the length of a Young GC STW pause is proportional to the amount of live objects, G1 will shrink the size of Eden to try to ensure future pauses meet the latency target.

The Old Gen usage has a rapid increase during the traffic spike of approx. 16GB. It is nearly monotonic, but contains a very small decrease mid way. Increasing MaxTenuringThreshold from its default of 6 to 15 didn't improve things.

Since your application continues to allocate lots of memory quickly, and the Eden space is now smaller, it fills up fast. The time between when an object is allocated & when a Young GC processes it is now even shorter, meaning those objects have less time to die. On average, this means a higher fraction will need to be copied to Survivor. By default, the Survivor region is just 1/6'th the size of Eden. When the fraction of live Eden objects is high, existing Survivor objects will get prematurely promoted to Old to free up space. Your application may allocate so much that objects promote directly from Young -> Old. This explains why you see a rapid increase in the size of Old Gen. And since objects don't stay in Young or Survivor long enough to reach even the default MaxTenuringThreshold, explaining why increasing it did nothing.

Normally, G1 tries to collect Old concurrently with a series of Mixed collections. The small dip during the spike suggests at least one Mixed collection may have happened, but it freed up almost no space. Since there are a few more points on the plot before the big drop, that may mean no further Mixed collections ran or they ran but freed up very little space. The large drop could be a Mixed collection, but given that you're asking about how to stop bad pause behavior, it's probably a Full GC. With your heap size (>40GB) and the amount of live data in your Old gen (~26GB), a Full GC would generally be quite long.

Suggested strategies:

A: (Avoid Full GC): If it's true the large drop is a Full GC, because your Old gen live set size is already quite large, you need to either increase your heap by at least ~5GB, or refactor your application to keep fewer objects in memory long term (reduce the steady state 26GB Old size). Theoretically, if you did try setting G1NewSize to 35% so that the ~16GB spike fits within Eden, it could stay in Eden long enough to die, avoiding a long Young GC pause. This isn't likely to work well in practice, though, since a larger Eden will make your Old even smaller, and increase the chance of Full GC. It's also relies on luck that Eden is almost empty exactly at the time the spike starts. Otherwise, large fractions of the spike will get copied to Old anyway & the Young GC pause will be large.

B: (Application refactor): This is the best approach, if it's possible to do. G1 is designed under the assumption that most objects in Eden are dead at the time of a Young GC, and that's not true for your application. The large amount of data you're receiving seems to stay live for a few minutes & then get collected. Perhaps your code reads the entire incoming dataset into memory, and then only after it's fully in-memory, copies it to a database or does some other aggregation, before discarding it. Refactoring to process the data incrementally would mean the chunks die quickly in Eden & little promotion happens. This is the ideal operating condition for G1 and would eliminate the Full GC.