What's the time complexity of forward filling and backward filling in spark?

60 Views Asked by At

My question: Need to understand the time complexity of dynamic forward filling and back filling in spark Hello, I have a scala job that reads Delta Table A, transforms Data Frame and writes to Delta Table B (empty table)

The job runs processes 94,356,726 records and finishes in 16 minutes.

However, after adding dynamic filling logic to the Dataframe transform, the job executes for 2 hours and 20 minutes.

What's a big gap in performance, what's the time complexity of the backfilling?

Details Purpose of the backfilling

I have data like this:

id version data_column
0 4 null
0 3 "test 2"
0 2 null
0 1 "test 1"

I want to get the data like this:

id version data_column
0 4 "test 2"// filled from version 3
0 3 "test 2"
0 2 "test 1" // filled from version 2
0 1 "test 1"

My method Following method here: Pyspark : forward fill with last observation for a DataFrame

// before using backfilling, the job uses the "partitionWindow " for dedup
val partitionWindow = Window
  .partitionBy("id")
  .orderBy(desc("version"), desc("timestamp'))

val backfillWindow = partitionWindow
  .rowsBetween(0, Window.unboundedFollowing)

df.withColumn("data_filled", coalesce(first("data_column", true).over(backfillWindow), col("data_column")))

My Observation

  • Execution time: It works as expected, but huge performance downgrade, from 16 minutes to 2hours and 20 minutes
  • Spark stage: In spark history server, the bottleneck is the "Delta Lake touching files: low shuffle merge", gap is 1.8 minute vs. 1.1 hours
  • Physical plan: I use winmerge to compare the physical plan of the "Delta Lake touching files: low shuffle merge" stage, between 2 jobs, no big difference
  • I check pyspark ffill and bfill, they shall have similar time complexity, the doc doesn't mention time complexity, only says: " Avoid this method against very large dataset."

More Thoughts: I'm doing one-direction backfilling, it shall be close to O(Nlog(N)), right? Maybe a bi-directional filling (forward and backward at the same time) would be different

Thank you!

0

There are 0 best solutions below