I am working on migrating my team's service to Flink 1.15.2 from Flink 1.8. I've had luck running new Flink on a single host (1 jobmanager and 1 taskmanager on the same node), however on a multi-node cluster where I have 1 jobmanager and 3 taskmanagers, the Taskmanagers dont seem to process anything beyond source event deserialization. I think this maybe due to an out of memory issue caused by RocksDB backend. I have tried numerous combinations suggested in https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/large_state_tuning/#tuning-rocksdb-memory but they dont work. Does anyone have any suggestions or similar experiences with RocksDB backend memory going out of control and killing taskmanager hosts?
All the hosts are r7i.12xlarge, with 48 cores and 384Gig memory.
I have tried the options
state.backend.rocksdb.memory.managed: false
and also have tried increasing managed memory to about 90% of Flink total memory (note that we do not use Kubernetes, its a standalone deployment). I have also tried disabling RocksDB block-cache as stated in https://stackoverflow.com/a/75883508 but have no improvement. I have also tried solutions listed in Flink RocksDB custom options factory config error disable block cache
Currently the Flink job starts with high backpressure since our source is Kinesis data with 1 day minimum retention and for taskmanager, it does not move past deserializing events (whereas Jobmanager does). Throughput is affected badly, and after minutes its only jobmanager trying to cope up with traffic.