Reading files Azure Data Lake Storage Gen2 using RDD API in Azure Databricks

Question

Reading files Azure Data Lake Storage Gen2 using RDD API in Azure Databricks

31 Views Asked by Ken Masters At 24 March 2024 at 13:11

Currently in my Azure Databricks workspace, the unity catalog is enabled with external locations configured.

I can read the dataframe API using:

testDf = spark.read.option("header", True).format("csv").load('abfss://container_name@storage_account_name.dfs.core.windows.net/RDD_testing/input.txt')

However, if I use the RDD API to read the file such as:

rdd_in=sc.textFile('abfss://container_name@storage_account_name.dfs.core.windows.net/RDD_testing/input.txt').

It shows the error:

Failure to initialize configuration for storage account storage_account_name.dfs.core.windows.net: Invalid configuration value detected for fs.azure.account.keyInvalid configuration value detected for fs.azure.account.key

I also tried to set the configuration below but no luck:

accessKey = 'my_access_key'
spark.conf.set("spark.hadoop.fs.azure.account.key.storage_account_name.dfs.core.windows.net", accessKey)

I want to read the file using the RDD API directly instead of reading it as dataframe then converting back to RDD.

How can I fix this ?

Thanks

Original Q&A

There are 1 best solutions below

**Anh Nguyen Ngoc** · Answer 1 · 2024-03-24T15:12:53.103000

I think it's crucial to correctly configure the access key for your storage account. This configuration should be done as early as possible, preferably immediately after initializing your Spark session. The error you're seeing suggests that the Spark context (sc) is not properly configured with the access key for ADLS Gen2. Here's how you can set the configuration directly on the Spark context's Hadoop configuration, which is a more direct method and often resolves such issues:

# Your ADLS Gen2 storage account access key
accessKey = "your_access_key_here"

# Setting the access key in the Hadoop configuration of the Spark context
sc._jsc.hadoopConfiguration().set("fs.azure.account.key.your_storage_account_name.dfs.core.windows.net", accessKey)

# Now, try reading the file using the RDD API
rdd_in = sc.textFile("abfss://container_name@your_storage_account_name.dfs.core.windows.net/RDD_testing/input.txt")

Hope this can help with solving your problem.

Reading files Azure Data Lake Storage Gen2 using RDD API in Azure Databricks

There are 1 best solutions below

Related Questions in AZURE-DATABRICKS

Related Questions in AZURE-DATA-LAKE-GEN2

Trending Questions

Popular # Hahtags

Popular Questions