Ignite keeps crashing with OOM

201 Views Asked by At

We are running a 2-client 3-server cluster. We keep seeing OOM even as we provide more heap. Below is the heap dump snapshot, any idea why we see so many CacheEvent objects, what is really happening here?

Looking at the source for 'GridSelectorNioSessionImpl', seems it has an unbounded queue that is accumulating WriteRequests. Any idea why are these not being flushed? I was reading about setting messageQueue, even the ignite logs warns about it possibly resulting into an OOM, since its unbounded. But i could not correlate that setting to the queue initialization happening in 'GridSelectorNioSessionImpl'.

enter image description here

Below is the thread stack which resulted in the OOM,

"sys-#300%ip-192-168-4-45_us-west-2_compute_internal.cache-2%" prio=5 tid=367 RUNNABLE
at java.lang.OutOfMemoryError.<init>(OutOfMemoryError.java:48)
at org.apache.ignite.internal.binary.streams.BinaryHeapOutputStream.arrayCopy(BinaryHeapOutputStream.java:101)
   local variable: org.apache.ignite.internal.binary.streams.BinaryHeapOutputStream#1
at org.apache.ignite.internal.binary.BinaryWriterExImpl.array(BinaryWriterExImpl.java:239)
at org.apache.ignite.internal.binary.GridBinaryMarshaller.marshal(GridBinaryMarshaller.java:253)
   local variable: org.apache.ignite.internal.binary.BinaryWriterExImpl#1
at org.apache.ignite.internal.binary.BinaryMarshaller.marshal0(BinaryMarshaller.java:84)
at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.marshal(AbstractNodeNameAwareMarshaller.java:56)
   local variable: org.apache.ignite.internal.binary.BinaryMarshaller#1
   local variable: java.lang.String#77501
at org.apache.ignite.internal.util.IgniteUtils.marshal(IgniteUtils.java:10824)
at org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1953)
   local variable: org.apache.ignite.internal.processors.continuous.GridContinuousProcessor#1
   local variable: java.util.Collections$SingletonList#198
   local variable: org.apache.ignite.internal.processors.continuous.GridContinuousMessage#72196
   local variable: org.apache.ignite.internal.processors.continuous.GridContinuousProcessor$9#72196
   local variable: org.apache.ignite.internal.processors.continuous.GridContinuousMessage#72196
at org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1934)
at org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1916)
at org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:1321)
at org.apache.ignite.internal.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:1258)
at org.apache.ignite.internal.GridEventConsumeHandler$2$1.run(GridEventConsumeHandler.java:250)
   local variable: org.apache.ignite.internal.GridEventConsumeHandler$2$1#1
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   local variable: org.apache.ignite.thread.IgniteThreadPoolExecutor#53
   local variable: java.util.concurrent.ThreadPoolExecutor$Worker#47
   local variable: org.apache.ignite.thread.IgniteThread#78
   local variable: org.apache.ignite.internal.GridEventConsumeHandler$2$1#1
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   local variable: java.util.concurrent.ThreadPoolExecutor$Worker#47
at java.lang.Thread.run(Thread.java:834)

The complete thread dump off the OOM is available here https://easyupload.io/fs3y8x

TIA

1

There are 1 best solutions below

16
Alexandr Shapkin On

GridContinuousMessage comes from a Continuous Query that needs to notify the listeners about an update.

It looks like the queue grows too big due to a heavy load and a message context [dataBytes] is big too.

I think you can play around the following properties:

/** Maximum size of buffer for pending events. Default value is {@code 10_000}. */
public static final int MAX_PENDING_BUFF_SIZE =
    IgniteSystemProperties.getInteger("IGNITE_CONTINUOUS_QUERY_PENDING_BUFF_SIZE", 10_000);

/** Batch buffer size. */
private static final int BUF_SIZE =
    IgniteSystemProperties.getInteger("IGNITE_CONTINUOUS_QUERY_SERVER_BUFFER_SIZE", 1000);

The most interesting one is IGNITE_CONTINUOUS_QUERY_SERVER_BUFFER_SIZE.

Internally, a continuous query stores updates in buffers, one buffer per partition. The buffer is being filled up with events up to 1000 and is recreated when it is full. I.e. there might be 1000 live objects per partition. A node might have up to 1024 partition (the default value), in that case the total number of stored events can be up to 1 million. Now multiply that by your objects' size which is 100+ KB and check the result.

Most likely setting IGNITE_CONTINUOUS_QUERY_SERVER_BUFFER_SIZE to 100 or 50 will solve the issue.