I'm running into an issue with a Spark job that fails roughly every 2nd time with the following error message:
org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle map stage with indeterminate output was failed and retried. However, Spark cannot rollback the ResultStage XYZ to re-process the input data, and has to fail this job. Please eliminate the indeterminacy by checkpointing the RDD before repartition and try again.
This happens on Databricks 13.3 LTS (based on Apache Spark 3.4.1). I started out by step-wise eliminating calls to repartition(...) until there was none left, but I still receive the above error. My next hypothesis was that it's due to adaptive query execution (AQE) which may change partitioning on-the-fly. But turning off AQE didn't help either.
What else could be leading to the above error if not explicit calls to repartition or AQE, and what can be done to prevent it?
I faced the same issue. It was solved after I removed the autoscaling both on the cluster and on disk storage.