Quickly, my need: create a Spark dataframe from a more or less complex query in T-SQL (SQL Server) and/or from the output of a SQL Server stored procedure.
As far I understand, Spark does not allow to execute queries in the dialect of the underlying data source. Yes, there is a way to obtain low level object and perform stored procedures but in this manner I don't have the Spark DF in output.
So, I thought to perform a query in the classical pyodbc way, obtain the results and then build the Spark dataframe with the function SparkSession.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True) providing the data and the schema. I can obtain the data, but I can't build the schema (a list of pairs (column name, data type)) from the output cursor. Follows a working example to (generate and) extract sample data from a local instance of SQL Server:
import pyodbc
connection_string = "Driver={SQL Server};Server=LOCALHOST;Database=master;Trusted_Connection=yes;"
db_connection = pyodbc.connect(connection_string)
sql_query = """
SET NOCOUNT ON
DECLARE @TBL_TEST AS TABLE (
column_1 INT NOT NULL PRIMARY KEY CLUSTERED IDENTITY(1, 1),
column_2 VARCHAR(10) NOT NULL,
column_3 VARCHAR(20) NULL,
column_4 INT NOT NULL
)
INSERT INTO @TBL_TEST (column_2, column_3, column_4)
VALUES
('test1_col2', 'test1_col3', 100),
('test2_col2', 'test2_col3', 200),
('test3_col2', NULL, 300)
SET NOCOUNT OFF
SELECT t.* FROM @TBL_TEST AS t
"""
cursor = db_connection.cursor()
rows = cursor.execute(sql_query).fetchall()
cursor.close()
db_connection.close()
print(rows)
How can I extract the schema from the returned cursor and obtain a schema object to give to the createDataFrame() function?
Remember that my goal is that on the topic, so other ways are also welcome!
Thank you in advance!



If you use pyodbc, the resulting java byte code generated by the catalyst optimizer runs as just one node (executor), not the whole cluster. For larger data sets, this prevents the full use of the cluster and performance issues.
It is better to use a spark driver for JDBC. Microsoft has one.
https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver16
Create the complex T-SQL as views and just read them. That is what spark was made for - reading files. Using the JDBC driver (spark), it will allow you to read in parallel if needed by changing the partition method.
Install the Marven library for the correct version of spark.
I am using Spark Version > 3.1.
I have the adventure works database with a view called v.
Make a typical spark.read() call with the JDBC driver.
Here is the results of displaying the dataframe.
Is the data frame strictly typed? The Answer is yes since it gets the field information from SQL Server.
Last but not least, is the view complex? The image below shows 8 tables are joined and aggregated to get the final result for the view.
In summary, use views in the database to pre-compile your data sets for Spark. Use the JDBC driver from Microsoft to read and write from SQL Server using dataframe.
As for the stored procedure, there is a way to used the driver to execute non queries. I will have to look for the code. Stay tuned for an update or part 2.