I face a common problem when trying to write a large table in Hive/HDFS from a source table containing 24 billion rows.
The Spark job applies a Window function to retrieve only the latest states for each key in the source table. I end up having to insert around 11 billion rows in the target table.
The tables are partitioned on a dated field in YYYYMM format named dt_part.
The picture below is the data writing Stage:
This Stage is very long and gets bogged down at the end without any data being written down.
The data writing function is as follows:
df
.write
.partitionBy("dt_part")
.mode(SaveMode.Overwrite)
.saveAsTable(targetTable)
For information, the Spark job has the following resources (dynamic allocation):
- Driver memory = 15 G
- Executor memory = 16 G
- Executor cores = 8
- Max number of executors = 80
spark.sql.shuffle.partitions= 10000
I have tested several values of the spark.sql.shuffle.partitions (250, 3000 and 20000) but the result remains the same, I can't write the data in a resonable timeframe.
Do you have any suggestions for optimising the writing process?
