i have this code
from pyspark.sql.functions import *
df1=spark.read.option('header','true').csv('/FileStore/tables/ds_salaries.csv')\
`enter code here`.withColumn('salary',col('salary').cast('int'))
df1=df1.filter(col('salary')>30000)
df1=df1.groupBy('work_year').agg(sum('salary').alias('total_salary'))
display(df1)
When i execute this code , i can see jobs:3, but there are only 2 actions here which are read and display(), so why there is an extra job and whats its for .Am using databricks community edition where its single node and its default configs.
spark.readis eagerly evaluated since it needs to find the number of columns & name of columns('header', 'true'). Internally it is similar todf.limit(1).collect(). So the first job that you see is for the job that reads the first row.The second job is triggered for the display(df).
Note:
inferSchema = true, you will see one more job, this job will be for finding the datatype of each column, for which spark has to read the file.display(df).Listing leaf files...