I'm trying to run Apache Hudi on Amazon EMR on EKS using the aws emr-containers start-job-run command, but I'm encountering a NoSuchMethodError with the following error message:
pod emr-on-eks-spark.spark-000000033dvo7gou032-driver exited with code 1 Error: 24/02/19 05:09:43 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 9873) (10.3.0.113 executor 1): java.lang.NoSuchMethodError: org.apache.spark.sql.internal.SQLConf$.LEGACY_AVRO_REBASE_MODE_IN_WRITE()Lorg/apache/spark/internal/config/ConfigEntry;
24/02/19 05:10:19 WARN TaskSetManager: Lost task 0.1 in stage 2.0 (TID 9874) (10.3.0.113 executor 1): java.lang.NoSuchMethodError: org.apache.spark.sql.internal.SQLConf$.LEGACY_AVRO_REBASE_MODE_IN_WRITE()Lorg/apache/spark/internal/config/ConfigEntry;
24/02/19 05:10:57 WARN TaskSetManager: Lost task 0.2 in stage 2.0 (TID 9875) (10.3.1.191 executor 2): java.lang.NoSuchMethodError: org.apache.spark.sql.internal.SQLConf$.LEGACY_AVRO_REBASE_MODE_IN_WRITE()Lorg/apache/spark/internal/config/ConfigEntry;
The command I am using is:
aws emr-containers start-job-run \
--name=orders \
--virtual-cluster-id <clusterId> \
--region us-east-1 \
--execution-role-arn arn:aws:I am::<accountId>:role/execution-role \
--release-label=emr-6.10.0-latest \
--job-driver='{
"sparkSubmitJobDriver": {
"entryPoint":"s3://<bucket_location>/hudi-utilities-bundle_2.12-0.12.2.jar",
"entryPointArguments": [
"--table-type","COPY_ON_WRITE",
"--source-ordering-field","created_time",
"--props",""s3://<bucket_location>/config/orders.properties",
"--source-class","org.apache.hudi.utilities.sources.ParquetDFSSource",
"--target-table","orders",
"--target-base-path","s3://<bucket_location>/orders",
"--transformer-class","org.apache.hudi.utilities.transform.AWSDmsTransformer",
"--transformer-class","org.apache.hudi.utilities.transform.SqlQueryBasedTransformer",
"--schemaprovider-class","org.apache.hudi.utilities.schema.FilebasedSchemaProvider",
"--payload-class","org.apache.hudi.payload.AWSDmsAvroPayload",
"--op","UPSERT"
],
"sparkSubmitParameters": "--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages org.apache.spark:spark-avro_2.12:3.5.0 --jars s3://<bucket_location>/config/hudi-utilities-bundle_2.12-0.12.2.jar,s3://<bucket_location>/config/hudi-spark3.3-bundle_2.12-0.12.2.jar --conf spark.driver.memory=2G --conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.sql.catalogImplementation=hive --conf spark.serializer=org.apache.spark.serializer.KryoSerializer"
}
}' \
--configuration-overrides '{
"monitoringConfiguration": {
"s3MonitoringConfiguration": {"logUri": "s3://<bucket_location>/elasticmapreduce/emr-containers"}
}
}'
I think the problem is with the spark avro package I am using but I am not able to make it work, any help will be much appreciated.