using repartion in pyspark for huge set of data

142 Views Asked by Sidhant Gupta At 30 March 2022 at 08:30

I have a huge amount of data in a few oracle tables (the total size of data in these tables is around 50GB). I have to perform joins and perform some calculations, and these tables don't have any partitions created. I need to read this data in pyspark data frame, and finally write the CSV file into s3. It's taking a long time to run this query on the database, fetch the data and directly write to s3 (the fetched data, based on the query, is around 100MB only).
Can using repartition on this data frame help me in any way to improve the query performance?
Or is there any other way to make this operation faster?

Original Q&A

using repartion in pyspark for huge set of data

There are 0 best solutions below

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-SQL-REPARTITION

Trending Questions

Popular # Hahtags

Popular Questions