Azure Synapse > Serverless SQL > UPSERT parquet files is possible?

1k Views Asked by At

I'm using an Azure Synapse Serverless SQL database.

I have a first copy activity that uses Azure SQL tables as sources and an Azure storage gen2, where I store .parquet files as sink. From these .parquet files, I use CETAS to create External tables in my Serverless SQL DB (for the context: I'm working with several Azure SQL databases, so these external tables will allow me to write cross db queries). In other words, this Serverless SQL DB is my ODS database.

I have then a second copy activity that identifies incremental changes in the source tables (using the SYS_CHANGE_VERSION of the corresponding CHANGETABLE of the source DB). This second copy activity also outputs .parquet files.

In the end, I have 2 parquet files: 1 with the full content of the source table + 1 with the content to insert or update. External tables in a Serverless SQL DB are only metadata, so no way to do DML operations on them, so my question would be: is there a way to "merge" my 2 parquet files into 1 single one (without duplicates of course) that I could use to recreate an updated external table?

Alternatively, I see that I can choose in the copy activity sink the copy method "Upsert" and provide a KeyColumn (PK of my table), but it doesn't work, saying that: "Message=INSERT operation is not allowed for this type of table." (which looks normal as the associated sink dataset is pointing on my external table, that is read only)

copy activity: copy activity

Any idea on how to solve this? thanks!

1

There are 1 best solutions below

15
Aswin On BEST ANSWER

Since external tables in Azure Synapse Serverless SQL database are read-only, you cannot use the Upsert copy method to update the external table directly.

  • If the full load file and incremental files are in the same folder in ADLS, then while creating the external tables you can give ** in place of filename.

Sample external table script:

CREATE  EXTERNAL  TABLE [dbo].[external_table] (
[PK] nvarchar(4000),
[name] nvarchar(4000),
[ingestion_time] nvarchar(4000)
)
WITH (
LOCATION = '<folder-name>/**',
DATA_SOURCE = <datasource-name>,
FILE_FORMAT = <fileformatname>
)

This script will make sure to combine all the data under that folder.

  • When the old data loaded in the initial version got replaced with new data, It is necessary to take only the latest record.

Example, File1 data:

PK,name,ingestion_time
1,Karikala,2023-05-01
4,Kalyani,2023-05-01
7,Sindhu,2023-05-01

File2 data:

PK,name,ingestion_time
1,Aadhi,2023-06-01

Here, File2 data has the updated record for PK=1. When the query select * from external_table is executed, all four records will be displayed. Thus, Create a view on top of this external table to display only the latest record for each primary key.

Sample Query:

with cte as(
SELECT *,RANK() over (partition  by PK order  by [ingestion_time] desc) as  Rank  FROM [dbo].[external_table])
select PK,name,[ingestion_time] from cte where  rank=1
PK name ingestion_time
1 Aadhi 2023-06-01
2 Kalyani 2023-05-01
3 Sindhu 2023-05-01

This query returns the rows with the highest ingestion_time value for each PK value in the external_table.