How many jobs,stages , tasks will be created here in this snippet of code of PySpark and Why

143 Views Asked by AzSurya Teja At 24 August 2023 at 11:23

i have this code

from pyspark.sql.functions import *
df1=spark.read.option('header','true').csv('/FileStore/tables/ds_salaries.csv')\
                                   `enter code here`.withColumn('salary',col('salary').cast('int'))

df1=df1.filter(col('salary')>30000)
df1=df1.groupBy('work_year').agg(sum('salary').alias('total_salary'))
display(df1)

When i execute this code , i can see jobs:3, but there are only 2 actions here which are read and display(), so why there is an extra job and whats its for .Am using databricks community edition where its single node and its default configs.

Original Q&A

There are 1 best solutions below

ASR On 08 September 2023 at 19:03

spark.read is eagerly evaluated since it needs to find the number of columns & name of columns ('header', 'true'). Internally it is similar to df.limit(1).collect(). So the first job that you see is for the job that reads the first row.

The second job is triggered for the display(df).

Note:

If you set inferSchema = true, you will see one more job, this job will be for finding the datatype of each column, for which spark has to read the file.
If you supply the schema then it will only have one job for the display(df).
If you give pattern (glob) spark.read.csv("/parent/folder/") then you will see one job Listing leaf files...

How many jobs,stages , tasks will be created here in this snippet of code of PySpark and Why

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in APACHE-SPARK-DATASET

Trending Questions

Popular # Hahtags

Popular Questions