How to install various compression codecs like LZO and BROTLI on pyspark?

1k Views Asked by Techie Baba At 05 November 2021 at 08:38

For writing a parquet file and compressing it with LZO codec, I wrote the following code -

df.coalesce(1).write.option("compression","lzo").option("header","true").parquet("PARQUET.parquet")

But, I am getting this error -

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.io.compress.lzo.LzoCodec

According to the spark documentation, brotli requires BrotliCodec to be installed. But there are no steps given to install it. The same error is given while compressing with Brotli codec.

How can I install/add the required codecs for running it on PySpark ?

EDIT - LZO compression works with ORC but not with Parquet

Original Q&A

There are 2 best solutions below

saurabh3091 On 26 July 2022 at 13:59

For writing in lzo, you need the below steps:

sudo apt-get install -y lzop
Add jar to pyspark jars(change path according to your pyspark env): wget https://maven.twttr.com/com/hadoop/gplcompression/hadoop-lzo/0.4.20/hadoop-lzo-0.4.20.jar -P /usr/local/lib/python3.7/dist-packages/pyspark/jars/
set this config option in SparkSession, ("spark.sql.parquet.compression.codec", "lzo")

Now you should be able to write using parquet with lzo compression.

Peeyush Majgawali On 24 May 2023 at 07:25

Copy the jar files to <python environment name>/lib/python3.9/site-packages/pyspark/jars

How to install various compression codecs like LZO and BROTLI on pyspark?

There are 2 best solutions below

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in COMPRESSION

Related Questions in LZO

Related Questions in BROTLI

Trending Questions

Popular # Hahtags

Popular Questions