Is it possible (& more efficient) to read-in distinct rows using PySpark?

111 Views Asked by At

**Background: **

New to PySpark and big data. 10% of the rows in my data (read in from a CSV to pandas DataFrame) are duplicates, and pd.drop_duplicates() does not remove all of the duplicate rows. My trouble-shooting steps were to use splitting & concat, since the issue seems to be related to the size of the data (+13M rows). But this seems inefficient, especially since I will later have to run this operation on many CSVs of similar size. Some researching suggests that parallelizing operations with PySpark could lead to a more efficient solution overall.

**Questions: ** What is an efficient approach to reading in data from a large CSV file (+1GB) so that duplicate rows are removed?

Is it possible to read-in distinct rows from a CSV using PySpark, and would this be more efficient than reading in the entire dataset first, then applying PySpark distinct(), and writing to a new file? My intuition here is that reading-in distinct rows only, using parallelization, would reduce the amount of memory being used during each iteration--or maybe I'm missing key concepts that would make this inaccurate.

**Code example: **

--UPDATE-- I finally installed pyspark and its dependencies--seemingly correctly--and ran the code below on a test dataframe. Got hit with very long ConnectionRefusedError message and a warning that the task size exceeds the recommended size. Maybe its because I didn't include any code to parallelize the operation(????). Brain needs break.

Here is my understanding of how the second option could work (getting distinct rows after reading-in data) on just one CSV:

df = pd.read_csv("<data>.csv", sep=",")

# create spark session
spark = SparkSession.builder.appName('sparkdf').getOrCreate()

# create spark dataframe
df_spark = spark.createDataFrame(df)

# display distinct data
df_distinct = df_spark.distinct()

# write de-duplicated data to new CSV
df_distinct.to_csv("<filename>.csv", index=False)

I've tried splitting the data, then using drop_duplicates() and concatenating to a new data frame. I've also been researching spark.read.format and other PySpark methods.

1

There are 1 best solutions below

0
Jeremy On

Don't use Pandas to read the csv, use spark directly.

df_distinct = spark.read.csv("<data>.csv").distinct()

Then write with:

df_distinct.write.csv("filename>.csv")

You have to be able to read in the entire file though, spark needs to read in the data in order to figure out what the distinct records are. Can't get around that with a csv file loading into Spark.