For writing a parquet file and compressing it with LZO codec, I wrote the following code -
df.coalesce(1).write.option("compression","lzo").option("header","true").parquet("PARQUET.parquet")
But, I am getting this error -
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.io.compress.lzo.LzoCodec
According to the spark documentation, brotli requires BrotliCodec to be installed. But there are no steps given to install it. The same error is given while compressing with Brotli codec.
How can I install/add the required codecs for running it on PySpark ?
EDIT - LZO compression works with ORC but not with Parquet
For writing in lzo, you need the below steps:
sudo apt-get install -y lzopwget https://maven.twttr.com/com/hadoop/gplcompression/hadoop-lzo/0.4.20/hadoop-lzo-0.4.20.jar -P /usr/local/lib/python3.7/dist-packages/pyspark/jars/("spark.sql.parquet.compression.codec", "lzo")Now you should be able to write using parquet with
lzocompression.