How can I optimize orc snappy compression in spark?

105 Views Asked by At

My orc with snappy compression dataset was 3.3 GB when it was originally constructed via a series of small writes to 128 Kb files. It totals 400 million rows, has one timestamp column, and the rest are string columns no longer than 200 chars. There are 9 columns total.

When I read the whole folder using pyspark.read.orc("myfolder/*) and simply write out to another folder with no changes, the dataset skyrockets to 4 times the size using the same defaults.

This is a known problem:

I've tried the following to no avail.

dataframe_out.write.orc(dirname_out) # default write options, 4x increase
dataframe_out.write.option("maxRecordsPerFile", 50000).orc(dirname_out) # 4x increase
dataframe_out.write.orc(dirname_out, compression="zlib") # results in 3x instead of 4x
dataframe_out.write.mode("overwrite").orc(dirname_out) # 4x increase
dataframe_out.coalesce(10000).write.mode("overwrite").orc(dirname_out) # 4x increase
dataframe_out.repartition(10000).write.mode("overwrite").orc(dirname_out) # 4x increase
dataframe_out.repartition(10000, "name_column").write.mode("overwrite").orc(dirname_out) # 4x increase

Can someone give a brief overview of how to best optimize the compression when writing to an orc snappy file? This is not a question of what the best compression is to use; I would just like to the bottom of why using the same compression format is so inconsistent. I'd like get as close to the original dataset size if possible.

0

There are 0 best solutions below