What could cause my spark history server start but then pod is completed immediately and crash in CrashLoopBackOff

513 Views Asked by At

To start with this is bit of context: in my cluster kubernetes there is spark app that is running and I want to add a deployment to start the spark history server that will read the logs generated by that app on a shared volume.

For some security measure in the project I can't use image of spark operator directly in my dockerfile. So I install spark via a conda env and pyspark in my dockerfile. I also export the env var ENV SPARK_HISTORY_OPTS instead of the config file as they should be the same.

SPARK_HISTORY_OPTS='-Dspark.history.fs.logDirectory=/execution-events -Dspark.eventLog.dir=/execution-events -Dspark.eventLog.enabled=true -Dspark.history.fs.cleaner.enabled=true -Dspark.history.ui.port=4039'

the shared volume that is mount on the deployment has the same path /execution-event

In my custom entrypoint.sh file there is a few steps,

- export the spark home
- start the spark history server with a simple: exec /usr/bin/tini -s -- $SPARK_HOME/sbin/start-history-server.sh

When I watch the deployment being created, the pod starts the server but then it die on the completed state and restarts in CrashLoopBackOff which is something I don't understand.

The spark history server should stay alive until I execute the stop-history-server.sh script, so why can't it stay alive ?

Thank for the futur answers.

PS: When I add a sleep of around 5 mins to debug and manually in ssh the pod, and start the server I can see the message: spark history server starts.

And I can see in the logs folder that the files are created.

This is the message in log of the pod:

+ exec /usr/bin/tini -s -- /opt/conda/envs/spark-env-3.1.2/lib/python3.7/site-packages/pyspark/sbin/start-history-server.sh                                                                                     │
│ starting org.apache.spark.deploy.history.HistoryServer, logging to /opt/conda/envs/spark-env-3.1.2/lib/python3.7/site-packages/pyspark/logs/spark--org.apache.spark.deploy.history.HistoryServer-1-spark-histor │
│ Stream closed EOF for ***NAMESPACE***/spark-history-deployment-65dd4dd6f5-wk27t (spark-history-container)
1

There are 1 best solutions below

0
thomas On BEST ANSWER

The problem was something I found recently, in the entrypoint.sh file where I start the spark-history-server.sh script I need to set a env var used by the daemon script to not be used in background but in foreground to keep the pod alive.

to add before the execution of start-history-server.sh

export SPARK_NO_DAEMONIZE=false

Hope it will help futur guys/girls with same the problem.