We are using spark Cassandra Connector (com.datastax.spark:spark-cassandra-connector_2.12:3.2.0) to connect with Amazon Keyspace. Observed very strange issue where streaming application gets stuck after processing certain amount of data across multiple micro-batches. There is no definite way to reproduce issue and it could happen any time after certain duration.
Could see all Cassandra Connector threads are stuck around below lines.
sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:281)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:105)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:98) => holding Monitor(com.datastax.oss.driver.shaded.netty.channel.nio.SelectedSelectionKeySet@-1275491359})
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:109)
com.datastax.oss.driver.shaded.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:814)
com.datastax.oss.driver.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:457)
com.datastax.oss.driver.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
com.datastax.oss.driver.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
java.lang.Thread.run(Thread.java:830)
Can you have a look at the above logs and help us understand any possible reason for this.
Unfortunately, the stack trace you posted is too generic and mostly concerned with network I/O which isn't very helpful.
You will need to do a bit more investigation to narrow it down. Try to identify a specific Spark application/job that is problematic then using the Spark UI, narrow down the code where you think it is getting stuck. Cheers!