I am trying to understand if there is any difference in the following approaches, in terms of memory usage, optimisation, parallelism etc.
Scenario: CSV files in an S3 bucket. 100 columns, more than 200 million rows in total
Read Option 1:
val df = spark.read.csv("s3://some-location")
Read Option 2: First create an external table in Hive using the s3 location and then read it in Spark
# Hive - Create External table
# Spark - val df = spark.sql("select * from external_table")
I would like to know what's the optimised way to read in such a scenario and what are the considerations.
Is one approach better than other when further transformations and actions are applied downstream?
Thanks!
They will both suffer as CSV is such an inefficient store of data. every query will have to scan the entire file, and, because of the way that numbers and dates are stored, the files are far bigger than binary formats
Do not use CSV as a data format except as a one-off ETL operation. Prefer Avro as an exchange format, and use parquet for your tables that you actually do queries on.
For that single ETL stage, don't use schema inference as spark will have to read the table twice, once to determine the schema, and again to process it.
200M rows with 100 columns isn't that big, so you could download locally and play with spark standalone until you had it exporting to an avro/parquet format you liked, then apply to the entire directory, after which you can create a new table with the output