I'm somewhat new to PySpark. I understand that it uses lazy evaluation, meaning that execution of a series of transformations will be deferred until some action is requested, at which point the Spark engine optimizes the entire set of transformations and then executes them.
Therefore, from both functional and performance standpoint I would expect this:
Approach A
df = spark.read.parquet(path)
df = df.filter(F.col('state') == 'CA')
df = df.select('id', 'name', 'subject')
df = df.groupBy('subject').count()
to be the same as this:
Approach B
df = spark.read.parquet(path)\
.filter(F.col('state') == 'CA')\
.select('id', 'name', 'subject')\
.groupBy('subject').count()
I like the style of Approach A because it would allow me to break up my logic into smaller (and re-usable) functions.
However, I have come across a few blog posts (e.g. here, here and here) that are confusing to me. These posts deal specifically with the use of successive withColumn statements and suggest using a single select instead. The underlying rationale seems to be that since dataframes are immutable, successive withColumn usage is detrimental to performance.
So my question is... would I have that same performance issue when using Approach A? Or is it just an issue specific to withColumn?
There are 2 questions in fact.
First question with codes samples. Unlike the 1st answer I disagree.
Try both code segments with an
.explain()and you will see the generated Physical Plan for Execution is exactly the same.Spark is based on
lazy evaluation. That is to say:The upshot of all this is, that I ran similar code to yours with 2 filters applied, and note that as the Action
.countcauses just-in-time evaluation, Catalyst filtered out based on both the first and the 2nd filter. This is known as"Code Fusing"- which can be done to late execution aka Lazy Evaluation.Snippet 1 and Physical Plan
Snippet 2 and Same Physical Plan
Second Question - withColumn
Doing this:
or via chaining gives the same result as well. I.e. it does not matter how you write the transformations. The final Physical Plan is what counts, but that optimization has overhead though. Depends how you think on this,
selectis the alternative.This article https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015#:~:text=Summary,number%20of%20transforms%20on%20DataFrame%20 takes a different perspective on the cost of optimization as opposed to 'projection' issue. I am, was surprized that 3 withColumns would cause any issue. If it is an issue then it points to crap optimization. And I suspect that looping of many features is the real issue, not a small number of withColumns.