I am working on databricks interactive cluster and I want to query csv files stored in ADLS or DBFS using SQL.
Issue:
I can easily do that for delta table and parquet files.However, when I do this for csv with specific delimiters I get all the columns loaded as one column.
%sql SELECT *
from delta.`dbfs:/path/to/delta/table`
^^ displays output correctly
%sql SELECT *
from csv.`dbfs:/path/to/file.csv`
^^ displays output as one column without headers. Basically it needs some options passed to specify sep and headers
Potential Solutions:
I can easily do this in spark, spark.read.csv("dbfs:/path/to/file.csv", header=True, sep="|") but how can I do this in Spark SQL (magic command)?
I can create a table specifying options
%sql CREATE TABLE schema.table
USING CSV
OPTIONS (path "dbfs:/path/to/file.csv", header "true", delimiter "|")
and then I can query the data.
But is there a way to just select * the csv data without creating these intermediate tables?
Directly executing an SQL query on a CSV file without using
spark.readis not supported in Databricks.To access a file stored in DBFS (Databricks File System) using Databricks SQL, you first need to create a table or view that references the file. This process involves copying the data from the CSV file into the table or view. Once the table or view is created, you can query it to retrieve the data from the file.
I have tried your approach I am able to read the data from the CSV using the header
OPTIONS (header "true", delimiter ",")But I am getting an ERROR %sql SELECT * FROM csv.
dbfs:/FileStore/tables/sample.csvI have tried the below approach
Results:
Reference: Databricks - Read CSV file from folder