Parquet, select from all files that have a specific column

417 Views Asked by At

I need to "SELECT DISTINCT COLUMN_X" from all parquet files that have "COLUMN_X"

I tried with joining the OPENROWSET with sys.tables and sys.columns but it fails with "column doesn't exist" It seems that the JOIN operation comes after the files opening (OPENROWSET)

SELECT distinct COLUMN_X
FROM OPENROWSET(
    BULK 'db/schema/*/**'
    , FORMAT = 'PARQUET', DATA_SOURCE= 'datalake' ) x
inner join sys.tables t on t.name= x.filepath(1)
where exists ( select 1
    from sys.columns c
    where c.object_id= t.object_id
        and c.name= 'COLUMN_X' )

Is it possible to filter on OPENROWSET phase based on file metadata? that is, accepting the file if column exists in that file?

ADDED #1:

I have hundreds of different external tables (from parquet files), with different structure each I need to do a select distintct of COLUMN_X across all tables, as long as COLUMN_X exists

1

There are 1 best solutions below

3
Ziya Mert Karakas On

In synapse serverless, the OPENROWSET function with BULK operations doesn't support dynamic filtering based on file metadata during the file selection phase. The OPENROWSET function operates on the files as-is, and filtering is done post-file opening. This is why your approach using JOINs with system tables is not working as expected.

Query below will execute based on the metadata provided by the external table definition, and it should only select from the files that match the schema.

Define an external table and make it match the exact schema of your files such as this statement (This is not CETAS):

CREATE EXTERNAL TABLE X
(
    COLUMN_X datatype,
    # and all other columns with datatypes
)
USING (
    LOCATION = 'db/schema',
    DATA_SOURCE = 'datalake',
    FILE_FORMAT = (TYPE = 'PARQUET')
);

SELECT DISTINCT COLUMN_X
FROM X;

Source: Documentation

Giving error:

enter image description here