My question: Need to understand the time complexity of dynamic forward filling and back filling in spark Hello, I have a scala job that reads Delta Table A, transforms Data Frame and writes to Delta Table B (empty table)
The job runs processes 94,356,726 records and finishes in 16 minutes.
However, after adding dynamic filling logic to the Dataframe transform, the job executes for 2 hours and 20 minutes.
What's a big gap in performance, what's the time complexity of the backfilling?
Details Purpose of the backfilling
I have data like this:
| id | version | data_column |
|---|---|---|
| 0 | 4 | null |
| 0 | 3 | "test 2" |
| 0 | 2 | null |
| 0 | 1 | "test 1" |
I want to get the data like this:
| id | version | data_column |
|---|---|---|
| 0 | 4 | "test 2"// filled from version 3 |
| 0 | 3 | "test 2" |
| 0 | 2 | "test 1" // filled from version 2 |
| 0 | 1 | "test 1" |
My method Following method here: Pyspark : forward fill with last observation for a DataFrame
// before using backfilling, the job uses the "partitionWindow " for dedup
val partitionWindow = Window
.partitionBy("id")
.orderBy(desc("version"), desc("timestamp'))
val backfillWindow = partitionWindow
.rowsBetween(0, Window.unboundedFollowing)
df.withColumn("data_filled", coalesce(first("data_column", true).over(backfillWindow), col("data_column")))
My Observation
- Execution time: It works as expected, but huge performance downgrade, from 16 minutes to 2hours and 20 minutes
- Spark stage: In spark history server, the bottleneck is the "Delta Lake touching files: low shuffle merge", gap is 1.8 minute vs. 1.1 hours
- Physical plan: I use winmerge to compare the physical plan of the "Delta Lake touching files: low shuffle merge" stage, between 2 jobs, no big difference
- I check pyspark ffill and bfill, they shall have similar time complexity, the doc doesn't mention time complexity, only says: " Avoid this method against very large dataset."
More Thoughts: I'm doing one-direction backfilling, it shall be close to O(Nlog(N)), right? Maybe a bi-directional filling (forward and backward at the same time) would be different
Thank you!