EMR task keep in RUNNING state even after the Spark job has finished

44 Views Asked by At

I was running a PySpark job (with Apache Hudi) on AWS EMR on EKS, the driver code was like:

with (SparkSession.builder
            .appName(f"App")
            .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer')
            .config('spark.sql.extensions', 'org.apache.spark.sql.hudi.HoodieSparkSessionExtension')
            .getOrCreate()) as spark:
    # Add a new column to my Hudi table
    spark.sql('alter table my_table add columns (my_date date)')
    # Merge a data set into my Hudi table
    spark.sql('merge into _mytable ...')
    
    spark.stop()

print('FINISH')
sys.exit(0)

This job keeps in RUNNING state in EMR, but the job is finished and exited actually. I could see the job was finished in Spark UI. And in the output log, I could see the FINISH printed at the last line of my script. Also, I've checked S3, and the data modification has been finished. But the state of that task in EMR still keep RUNNING unless I cancel manually.

Example: Spark job finished in 2.3 minutes

Spark History Server shows this job finished in 2.3 minutes. But in AWS EMR Console, it still keep running until I stop it after 50 minutes:

EMR step keep running until I cancel it manually

Anyone know what causes this problem?

0

There are 0 best solutions below