Python Spark application does not end properly in Azure HDInsight (ERROR RawSocketSender, java.net.SocketException: Broken pipe)

140 Views Asked by At

I submit several python Spark applications using a shell script like this on headnode:

spark-submit app1.py > app1.log
spark-submit app2.py > app2.log
...
spark-submit app<N>.py > app<N>.log

This worked fine in our in-premise cluster and in Amazon AWS. When run in Azure HDInsight, after the python program finishes running and all expected results correctly obtained, the system enters in a what looks like an infinite cycle of attempts / errors. The result is that the yarn process still appears to be in RUNNING state and the shell script does not switch to the next application until I manually kill the current one using "yarn application -kill <APP_ID>"

This is a partial, representative content of the log.

After the "ERROR RawSocketSender" stage, it restarts over and over again flushing the same messages like this. Can you help me solve this issue ?

23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=GAUGE, name=application_1674235371030_0122.driver.jvm.total.used, value=742471344
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=COUNTER, name=application_1674235371030_0122.driver.HiveExternalCatalog.fileCacheHits, count=0
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=COUNTER, name=application_1674235371030_0122.driver.HiveExternalCatalog.filesDiscovered, count=5514
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=COUNTER, name=application_1674235371030_0122.driver.HiveExternalCatalog.hiveClientCalls, count=0
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=COUNTER, name=application_1674235371030_0122.driver.HiveExternalCatalog.parallelListingJobCount, count=0
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=COUNTER, name=application_1674235371030_0122.driver.HiveExternalCatalog.partitionsFetched, count=0
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=COUNTER, name=application_1674235371030_0122.driver.LiveListenerBus.numEventsPosted, count=779086
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=COUNTER, name=application_1674235371030_0122.driver.LiveListenerBus.queue.appStatus.numDroppedEvents, count=0
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=COUNTER, name=application_1674235371030_0122.driver.LiveListenerBus.queue.eventLog.numDroppedEvents, count=149557
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=COUNTER, name=application_1674235371030_0122.driver.LiveListenerBus.queue.executorManagement.numDroppedEvents, count=0
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=COUNTER, name=application_1674235371030_0122.driver.LiveListenerBus.queue.shared.numDroppedEvents, count=0
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=HISTOGRAM, name=application_1674235371030_0122.driver.CodeGenerator.compilationTime, count=803, min=3, max=261, mean=9.150622723510025, stddev=1.2956431643848236, median=9.0, p75=11.0, p95=11.0, p98=11.0, p99=11.0, p999=11.0
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=HISTOGRAM, name=application_1674235371030_0122.driver.CodeGenerator.generatedClassSize, count=1635, min=542, max=27404, mean=3265.8857185687893, stddev=2830.01431389721, median=3350.0, p75=6524.0, p95=7317.0, p98=7317.0, p99=7317.0, p999=7317.0
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=HISTOGRAM, name=application_1674235371030_0122.driver.CodeGenerator.generatedMethodSize, count=5763, min=1, max=8729, mean=249.27456538109183, stddev=415.13290882444375, median=297.0, p75=301.0, p95=301.0, p98=908.0, p99=2919.0, p999=2919.0
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=HISTOGRAM, name=application_1674235371030_0122.driver.CodeGenerator.sourceCodeSize, count=803, min=966, max=147208, mean=15462.819959544297, stddev=4712.348922117468, median=18242.0, p75=20643.0, p95=20643.0, p98=20643.0, p99=20643.0, p999=20643.0
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=TIMER, name=application_1674235371030_0122.driver.DAGScheduler.messageProcessingTime, count=708275, min=0.0012, max=15.237273, mean=0.8333984633220494, stddev=2.5007734950774014, median=0.172903, p75=0.19320299999999999, p95=7.305616, p98=9.421037, p99=13.102708999999999, p999=13.102708999999999, mean_rate=28.08797667425645, m1=5.0770928049716195E-58, m5=1.2116851459485948E-10, m15=0.013328324529832, rate_unit=events/second, duration_unit=milliseconds
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=TIMER, name=application_1674235371030_0122.driver.LiveListenerBus.listenerProcessingTime.com.microsoft.hdinsight.spark.metrics.SparkMetricsListener, count=779086, min=0.013099999999999999, max=0.14810199999999998, mean=0.01677254442182652, stddev=0.0039667861875592155, median=0.0159, p75=0.0168, p95=0.024700999999999997, p98=0.032799999999999996, p99=0.032799999999999996, p999=0.0332, mean_rate=30.916663002987704, m1=0.4000000000000003, m5=0.40000000014472287, m15=0.4145886531127122, rate_unit=events/second, duration_unit=milliseconds
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=TIMER, name=application_1674235371030_0122.driver.LiveListenerBus.listenerProcessingTime.org.apache.spark.HeartbeatReceiver, count=779086, min=0.0015, max=0.069701, mean=0.002689762461835416, stddev=5.545635169906004E-4, median=0.0026999999999999997, p75=0.003, p95=0.0034, p98=0.0036999999999999997, p99=0.0036999999999999997, p999=0.0036999999999999997, mean_rate=30.895990124838672, m1=0.39167148585090295, m5=0.39833337205564195, m15=0.41398157742799496, rate_unit=events/second, duration_unit=milliseconds
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=TIMER, name=application_1674235371030_0122.driver.LiveListenerBus.listenerProcessingTime.org.apache.spark.scheduler.EventLoggingListener, count=629529, min=0.0025, max=0.084601, mean=0.0031536850172346445, stddev=0.0014603660201172616, median=0.003, p75=0.0033, p95=0.003601, p98=0.0042, p99=0.0042, p999=0.0157, mean_rate=24.981735615149216, m1=0.4000000000000003, m5=0.4000000000935715, m15=0.40697885181970933, rate_unit=events/second, duration_unit=milliseconds
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=TIMER, name=application_1674235371030_0122.driver.LiveListenerBus.listenerProcessingTime.org.apache.spark.sql.scheduler.EnhancementSparkListener, count=779086, min=0.0014, max=0.124802, mean=0.002198263875790473, stddev=0.0032177000831646076, median=0.0019, p75=0.0021, p95=0.0023, p98=0.0026, p99=0.0121, p999=0.076402, mean_rate=30.91666271135494, m1=0.4000000000000003, m5=0.40000000014472226, m15=0.4145886312041239, rate_unit=events/second, duration_unit=milliseconds
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=TIMER, name=application_1674235371030_0122.driver.LiveListenerBus.listenerProcessingTime.org.apache.spark.status.AppStatusListener, count=779086, min=0.0036999999999999997, max=0.343406, mean=0.042705296969570414, stddev=0.022447108482984236, median=0.0506, p75=0.057301, p95=0.061901, p98=0.067801, p99=0.077001, p999=0.128802, mean_rate=30.89481211761614, m1=0.39167148585090295, m5=0.39833337205608005, m15=0.41399773754297203, rate_unit=events/second, duration_unit=milliseconds
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=TIMER, name=application_1674235371030_0122.driver.LiveListenerBus.queue.appStatus.listenerProcessingTime, count=779086, min=0.028999999999999998, max=0.8874139999999999, mean=0.11084211566285655, stddev=0.15811620063670803, median=0.08060099999999999, p75=0.087102, p95=0.496608, p98=0.8874139999999999, p99=0.8874139999999999, p999=0.8874139999999999, mean_rate=30.89480911989612, m1=0.39167148585090295, m5=0.3983333720560811, m15=0.4139977748162365, rate_unit=events/second, duration_unit=milliseconds
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=TIMER, name=application_1674235371030_0122.driver.LiveListenerBus.queue.eventLog.listenerProcessingTime, count=629529, min=0.0118, max=0.191803, mean=0.01682499698961803, stddev=0.004276015583111248, median=0.016201, p75=0.0172, p95=0.0249, p98=0.0253, p99=0.025401, p999=0.062901, mean_rate=24.98173382512888, m1=0.4000000000000003, m5=0.4000000000935715, m15=0.4069788530626226, rate_unit=events/second, duration_unit=milliseconds
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=TIMER, name=application_1674235371030_0122.driver.LiveListenerBus.queue.executorManagement.listenerProcessingTime, count=779086, min=0.010201, max=0.750612, mean=0.04737932670567148, stddev=0.13098762337832828, median=0.019101, p75=0.022699999999999998, p95=0.177703, p98=0.750612, p99=0.750612, p999=0.750612, mean_rate=30.89598698218755, m1=0.39167148585090295, m5=0.39833337205564195, m15=0.4139815784109392, rate_unit=events/second, duration_unit=milliseconds
23/02/06 23:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=TIMER, name=application_1674235371030_0122.driver.LiveListenerBus.queue.shared.listenerProcessingTime, count=779086, min=0.0327, max=0.19630299999999998, mean=0.04590931789886219, stddev=0.011412958934052975, median=0.042900999999999995, p75=0.049901, p95=0.062902, p98=0.070801, p99=0.11340199999999999, p999=0.131002, mean_rate=30.916658838912987, m1=0.4000000000000003, m5=0.40000000014472287, m15=0.41458865614212337, rate_unit=events/second, duration_unit=milliseconds
23/02/07 00:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=GAUGE, name=application_1674235371030_0122.driver.BlockManager.disk.diskSpaceUsed_MB, value=0
23/02/07 00:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=GAUGE, name=application_1674235371030_0122.driver.BlockManager.memory.maxMem_MB, value=10022
23/02/07 00:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=GAUGE, name=application_1674235371030_0122.driver.BlockManager.memory.maxOffHeapMem_MB, value=0
23/02/07 00:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=GAUGE, name=application_1674235371030_0122.driver.BlockManager.memory.maxOnHeapMem_MB, value=10022
23/02/07 00:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=GAUGE, name=application_1674235371030_0122.driver.BlockManager.memory.memUsed_MB, value=1370
23/02/07 00:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=GAUGE, name=application_1674235371030_0122.driver.BlockManager.memory.offHeapMemUsed_MB, value=0
23/02/07 00:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=GAUGE, name=application_1674235371030_0122.driver.BlockManager.memory.onHeapMemUsed_MB, value=1370
23/02/07 00:42:45 ERROR RawSocketSender [MdsLoggerSenderThread]: org.fluentd.logger.sender.RawSocketSender
java.net.SocketException: Broken pipe (Write failed)
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    at org.fluentd.logger.sender.RawSocketSender.flush(RawSocketSender.java:196)
    at org.fluentd.logger.sender.RawSocketSender.send(RawSocketSender.java:184)
    at org.fluentd.logger.sender.RawSocketSender.emit(RawSocketSender.java:149)
    at org.fluentd.logger.sender.RawSocketSender.emit(RawSocketSender.java:131)
    at org.fluentd.logger.sender.RawSocketSender.emit(RawSocketSender.java:126)
    at org.fluentd.logger.FluentLogger.log(FluentLogger.java:101)
    at org.fluentd.logger.FluentLogger.log(FluentLogger.java:86)
    at com.microsoft.mdsdclient.MessageSendingRunnable$1.call(Unknown Source)
    at com.microsoft.mdsdclient.MessageSendingRunnable$1.call(Unknown Source)
    at com.microsoft.mdsdclient.RetryUtil.retry(Unknown Source)
    at com.microsoft.mdsdclient.RetryUtil.retry(Unknown Source)
    at com.microsoft.mdsdclient.MessageSendingRunnable.sendMessage(Unknown Source)
    at com.microsoft.mdsdclient.MessageSendingRunnable.run(Unknown Source)
    at java.lang.Thread.run(Thread.java:750)
23/02/07 00:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=GAUGE, name=application_1674235371030_0122.driver.BlockManager.memory.remainingMem_MB, value=8652
23/02/07 00:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=GAUGE, name=application_1674235371030_0122.driver.BlockManager.memory.remainingOffHeapMem_MB, value=0
23/02/07 00:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=GAUGE, name=application_1674235371030_0122.driver.BlockManager.memory.remainingOnHeapMem_MB, value=8652
23/02/07 00:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=GAUGE, name=application_1674235371030_0122.driver.DAGScheduler.job.activeJobs, value=0
23/02/07 00:42:45 INFO metrics [metrics-logger-reporter-1-thread-1]: type=GAUGE, name=application_1674235371030_0122.driver.DAGScheduler.job.allJobs, value=2224

I checked the result of the python program it all looked fine. Data are updated as expected and user reports are correctly generated. I used to run the same script in Amazon AWS and on in-premise cluster and it always ran with no issue.

0

There are 0 best solutions below