How many jobs,stages , tasks will be created here in this snippet of code of PySpark and Why

143 Views Asked by At

i have this code

from pyspark.sql.functions import *
df1=spark.read.option('header','true').csv('/FileStore/tables/ds_salaries.csv')\
                                   `enter code here`.withColumn('salary',col('salary').cast('int'))

df1=df1.filter(col('salary')>30000)
df1=df1.groupBy('work_year').agg(sum('salary').alias('total_salary'))
display(df1)

When i execute this code , i can see jobs:3, but there are only 2 actions here which are read and display(), so why there is an extra job and whats its for .Am using databricks community edition where its single node and its default configs.

1

There are 1 best solutions below

0
ASR On

spark.read is eagerly evaluated since it needs to find the number of columns & name of columns ('header', 'true'). Internally it is similar to df.limit(1).collect(). So the first job that you see is for the job that reads the first row.

The second job is triggered for the display(df).

Note:

  • If you set inferSchema = true, you will see one more job, this job will be for finding the datatype of each column, for which spark has to read the file.
  • If you supply the schema then it will only have one job for the display(df).
  • If you give pattern (glob) spark.read.csv("/parent/folder/") then you will see one job Listing leaf files...