How to run spark sql file through Azure Databricks

65 Views Asked by amamagar At 24 January 2024 at 17:45

We have a process that will write spark sql to a file, this process will generate thousands of spark sql files in the production environment. These files will be created in ADLS Gen2 directory.

Background: This process was used to generate PIG scripts and a MapReduce(MR) job was used to read these scripts. This MR job submits the pig script to the HDInsight cluster using PigServer API. We are migrating from HDInsight to Databricks and trying to achieve the same using Databricks.

sample spark file:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val 2023_I = spark.sql("select rm.* from reu_master rm where rm.year = 2023 and rm.system_part='I'")
val criteria1_r1 = 2023_I.filter($"field_id"==="abcned" or $"field_id"==="gei")
criteria1_r1.write.mode("overwrite").save(<path_t_adls_dir>)

We are exploring the best way to invoke these files from Azure Databricks. We would like to avoid reading files through Python to a variable and use this variable in the spark sql statement.

Original Q&A

There are 1 best solutions below

Sina Salam On 24 January 2024 at 21:08 BEST ANSWER

The best way depends on your resources, there are two ways that came to my thought by experience in handling these. To avoid reading files through python to a variable and uses of this variable in the spark sql statement you can use:

%sql magic command in Databricks.

The procedure is simple, and it reduces overhead. The example is here as follow: a. After you must have stored your Spark SQL scripts in separate files on ADLS Gen2, with assumption that you're using Databricks Notebook. b. In your Databricks Notebook, you will use the %sql command to execute the SQL script directly from the file. You can reference the file in the ADLS Gen2 path by creating temporary view. Sample code to create a temporary view:

%sql
CREATE OR REPLACE TEMPORARY VIEW my_temp_view AS
SELECT * FROM parquet.`<path_to_adls_gen2_file>`

c. After you have created the view, then you can run the Spark SQL queries on it. Your code will look like the following using magic command:

%sql
SELECT * FROM my_temp_view WHERE field_id = 'abcned' OR field_id = 'gei'

You should be fine if your Databricks cluster has the necessary permissions to access the ADLS Gen2 storage and read the SQL script files. Also, adjust the SQL scripts and commands as needed for your specific use case and file locations.

My second option to avoid reading files through python to a variable and uses of this variable in the spark sql statement.

Utilize Databricks Jobs and Notebooks. In this case, you will have to run Notebook as task using Databricks job and it also allow more parameters.

How to run spark sql file through Azure Databricks

sample spark file:

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in AZURE-DATABRICKS

Related Questions in APACHE-SPARK-DATASET

Related Questions in DATABRICKS-SQL

Trending Questions

Popular # Hahtags

Popular Questions