We have a process that will write spark sql to a file, this process will generate thousands of spark sql files in the production environment. These files will be created in ADLS Gen2 directory.
Background: This process was used to generate PIG scripts and a MapReduce(MR) job was used to read these scripts. This MR job submits the pig script to the HDInsight cluster using PigServer API. We are migrating from HDInsight to Databricks and trying to achieve the same using Databricks.
sample spark file:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val 2023_I = spark.sql("select rm.* from reu_master rm where rm.year = 2023 and rm.system_part='I'")
val criteria1_r1 = 2023_I.filter($"field_id"==="abcned" or $"field_id"==="gei")
criteria1_r1.write.mode("overwrite").save(<path_t_adls_dir>)
We are exploring the best way to invoke these files from Azure Databricks. We would like to avoid reading files through Python to a variable and use this variable in the spark sql statement.
The best way depends on your resources, there are two ways that came to my thought by experience in handling these. To avoid reading files through python to a variable and uses of this variable in the spark sql statement you can use:
The procedure is simple, and it reduces overhead. The example is here as follow: a. After you must have stored your Spark SQL scripts in separate files on ADLS Gen2, with assumption that you're using Databricks Notebook. b. In your Databricks Notebook, you will use the %sql command to execute the SQL script directly from the file. You can reference the file in the ADLS Gen2 path by creating temporary view. Sample code to create a temporary view:
c. After you have created the view, then you can run the Spark SQL queries on it. Your code will look like the following using magic command:
You should be fine if your Databricks cluster has the necessary permissions to access the ADLS Gen2 storage and read the SQL script files. Also, adjust the SQL scripts and commands as needed for your specific use case and file locations.
My second option to avoid reading files through python to a variable and uses of this variable in the spark sql statement.