Not able to read data from azurite using pyspark for local testing

104 Views Asked by At

I am trying to read a parquet file stored in azurite from pyspark locally. I am getting below mentioned issues :

3/05/21 14:53:19 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-azure-file-system.properties,hadoop-metrics2.properties
23/05/21 14:53:19 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: wasbs://[email protected]/employees.     
org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
        at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.retrieveMetadata(AzureNativeFileSystemStore.java:2250)
        at org.apache.hadoop.fs.azure.NativeAzureFileSystem.getFileStatusInternal(NativeAzureFileSystem.java:2699)
        at org.apache.hadoop.fs.azure.NativeAzureFileSystem.getFileStatus(NativeAzureFileSystem.java:2644)
        at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:1777)
        at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
        at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
        at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.lang.Thread.run(Unknown Source)
Caused by: com.microsoft.azure.storage.StorageException: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
        at com.microsoft.azure.storage.StorageException.translateFromHttpStatus(StorageException.java:175)
        at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:94)
        at com.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:305)
        at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:175)
        at com.microsoft.azure.storage.blob.CloudBlobContainer.downloadAttributes(CloudBlobContainer.java:565)
        at org.apache.hadoop.fs.azure.StorageInterfaceImpl$CloudBlobContainerWrapperImpl.downloadAttributes(StorageInterfaceImpl.java:255)
        at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.checkContainer(AzureNativeFileSystemStore.java:1355)
        at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.retrieveMetadata(AzureNativeFileSystemStore.java:2166)
        ... 22 more

I have added below mentioned configurations :

"spark.hadoop.fs.azure.account.key.devstoreaccount1.dfs.core.windows.net":  "Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==",
"spark.jars.packages" : "org.apache.hadoop:hadoop-azure:3.2.4,com.microsoft.azure:azure-storage:3.1.0",
"spark.hadoop.fs.azure.account.key.devstoreaccount1.blob.core.windows.net":  "Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==",
"spark.executor.extraClassPath" : "C:\\Users\\RUPAGARW\\Downloads\\spark-3.2.4-bin-hadoop3.2\\jars\\*"

i am not sure why this is coming as i tried to use python to list data in storage account and it is working

I have already tried to add below configs:

self.spark.sparkContext._jsc.hadoopConfiguration().set("spark.hadoop.fs.azure.account.key.devstoreaccount1.dfs.core.windows.net",  "Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==")
self.spark.sparkContext._jsc.hadoopConfiguration().set("spark.hadoop.fs.azure.account.key.devstoreaccount1.blob.core.windows.net",  "Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==")
0

There are 0 best solutions below