Flink 1.15.2 OOM issue due to RocksDB

23 Views Asked by Akshat Shukla At 27 March 2024 at 02:24

I am working on migrating my team's service to Flink 1.15.2 from Flink 1.8. I've had luck running new Flink on a single host (1 jobmanager and 1 taskmanager on the same node), however on a multi-node cluster where I have 1 jobmanager and 3 taskmanagers, the Taskmanagers dont seem to process anything beyond source event deserialization. I think this maybe due to an out of memory issue caused by RocksDB backend. I have tried numerous combinations suggested in https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/large_state_tuning/#tuning-rocksdb-memory but they dont work. Does anyone have any suggestions or similar experiences with RocksDB backend memory going out of control and killing taskmanager hosts?

All the hosts are r7i.12xlarge, with 48 cores and 384Gig memory.

I have tried the options

state.backend.rocksdb.memory.managed: false

and also have tried increasing managed memory to about 90% of Flink total memory (note that we do not use Kubernetes, its a standalone deployment). I have also tried disabling RocksDB block-cache as stated in https://stackoverflow.com/a/75883508 but have no improvement. I have also tried solutions listed in Flink RocksDB custom options factory config error disable block cache

Currently the Flink job starts with high backpressure since our source is Kinesis data with 1 day minimum retention and for taskmanager, it does not move past deserializing events (whereas Jobmanager does). Throughput is affected badly, and after minutes its only jobmanager trying to cope up with traffic.

Original Q&A

Flink 1.15.2 OOM issue due to RocksDB

There are 0 best solutions below

Related Questions in JAVA

Related Questions in APACHE-FLINK

Related Questions in FLINK-STREAMING

Related Questions in ROCKSDB

Trending Questions

Popular # Hahtags

Popular Questions