How to fix "java.lang.ClassNotFoundException: org.apache.spark.internal.io.cloud.PathOutputCommitProtocol" Pyspark

1.8k Views Asked by At

Below are the runtime versions in pycharm.

Java Home   /Library/Java/JavaVirtualMachines/jdk-11.0.16.1.jdk/Contents/Home
Java Version    11.0.16.1 (Oracle Corporation)
Scala Version   version 2.12.15
Spark Version.         spark-3.3.1
Python 3.9

I am trying to write a pyspark dataframe to csv as below:

df.write.csv("/Users/data/data.csv")

and gets the error:

     Traceback (most recent call last):
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<input>", line 1, in <module>
  File "/Users/mambaforge-pypy3/envs/lib/python3.9/site-packages/pyspark/sql/readwriter.py", line 1240, in csv
    self._jwrite.csv(path)
  File "/Users/mambaforge-pypy3/envs/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/Users/mambaforge-pypy3/envs/lib/python3.9/site-packages/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/Users/mambaforge-pypy3/envs/lib/python3.9/site-packages/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o747.csv.
: java.lang.ClassNotFoundException: org.apache.spark.internal.io.cloud.PathOutputCommitProtocol

And spark conf is as below:

spark_conf = SparkConf()
        spark_conf.setAll(parameters.items())
        spark_conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.4')
        spark_conf.set('spark.hadoop.fs.s3.aws.credentials.provider',
                       'org.apache.hadoop.fs.s3.TemporaryAWSCredentialsProvider')
        spark_conf.set('spark.hadoop.fs.s3.access.key', os.environ.get('AWS_ACCESS_KEY_ID'))
        spark_conf.set('spark.hadoop.fs.s3.secret.key', os.environ.get('AWS_SECRET_ACCESS_KEY'))
        spark_conf.set('spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled', 'true')
        spark_conf.set("com.amazonaws.services.s3.enableV4", "true")
        spark_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
        spark_conf.set("fs.s3a.aws.credentials.provider",
                       "com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
        spark_conf.set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3A")
        spark_conf.set("hadoop.fs.s3a.path.style.access", "true")
        spark_conf.set("hadoop.fs.s3a.fast.upload", "true")
        spark_conf.set("hadoop.fs.s3a.fast.upload.buffer", "bytebuffer")
        spark_conf.set("fs.s3a.path.style.access", "true")
        spark_conf.set("fs.s3a.multipart.size", "128M")
        spark_conf.set("fs.s3a.fast.upload.active.blocks", "4")
        spark_conf.set("fs.s3a.committer.name", "partitioned")
        spark_conf.set("spark.hadoop.fs.s3a.committer.name", "directory")
        spark_conf.set("spark.sql.sources.commitProtocolClass",
                       "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
        spark_conf.set("spark.sql.parquet.output.committer.class",
                       "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
        spark_conf.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "1")

Any help to fix this issue is appreciated. Thanks!!

2

There are 2 best solutions below

1
Sean Owen On

Looks like you do not have the hadoop-cloud module added. The class is not part of core Spark. https://search.maven.org/artifact/org.apache.spark/spark-hadoop-cloud_2.12/3.3.1/jar

0
Boyan On

You have to add the required jar to the Spark job's required packages. It won't change anything if you just download it locally. The jar must be added to the job's classpath. In order to do it, you have to change your code from:

spark_conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.4')

to

spark_conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.4,org.apache.spark:spark-hadoop-cloud_2.12:3.3.1')