Shuffle map stage failure with indeterminate output: eliminate the indeterminacy by checkpointing the RDD before repartition

2k Views Asked by Martin Studer At 07 October 2023 at 14:31

I'm running into an issue with a Spark job that fails roughly every 2nd time with the following error message:

org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle map stage with indeterminate output was failed and retried. However, Spark cannot rollback the ResultStage XYZ to re-process the input data, and has to fail this job. Please eliminate the indeterminacy by checkpointing the RDD before repartition and try again.

This happens on Databricks 13.3 LTS (based on Apache Spark 3.4.1). I started out by step-wise eliminating calls to repartition(...) until there was none left, but I still receive the above error. My next hypothesis was that it's due to adaptive query execution (AQE) which may change partitioning on-the-fly. But turning off AQE didn't help either.

What else could be leading to the above error if not explicit calls to repartition or AQE, and what can be done to prevent it?

Original Q&A

There are 1 best solutions below

user18483304 On 20 February 2024 at 19:21

I faced the same issue. It was solved after I removed the autoscaling both on the cluster and on disk storage.

Shuffle map stage failure with indeterminate output: eliminate the indeterminacy by checkpointing the RDD before repartition

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in AZURE-DATABRICKS

Related Questions in APACHE-SPARK-SQL-REPARTITION

Trending Questions

Popular # Hahtags

Popular Questions