Azure Synapse > Serverless SQL > UPSERT parquet files is possible?

Question

Azure Synapse > Serverless SQL > UPSERT parquet files is possible?

1k Views Asked by Jean-Christophe At 25 August 2023 at 15:24

I'm using an Azure Synapse Serverless SQL database.

I have a first copy activity that uses Azure SQL tables as sources and an Azure storage gen2, where I store .parquet files as sink. From these .parquet files, I use CETAS to create External tables in my Serverless SQL DB (for the context: I'm working with several Azure SQL databases, so these external tables will allow me to write cross db queries). In other words, this Serverless SQL DB is my ODS database.

I have then a second copy activity that identifies incremental changes in the source tables (using the SYS_CHANGE_VERSION of the corresponding CHANGETABLE of the source DB). This second copy activity also outputs .parquet files.

In the end, I have 2 parquet files: 1 with the full content of the source table + 1 with the content to insert or update. External tables in a Serverless SQL DB are only metadata, so no way to do DML operations on them, so my question would be: is there a way to "merge" my 2 parquet files into 1 single one (without duplicates of course) that I could use to recreate an updated external table?

Alternatively, I see that I can choose in the copy activity sink the copy method "Upsert" and provide a KeyColumn (PK of my table), but it doesn't work, saying that: "Message=INSERT operation is not allowed for this type of table." (which looks normal as the associated sink dataset is pointing on my external table, that is read only)

copy activity:

Any idea on how to solve this? thanks!

Original Q&A

There are 1 best solutions below

**Aswin** · Accepted Answer · 2023-08-26T07:05:08.747000

Since external tables in Azure Synapse Serverless SQL database are read-only, you cannot use the Upsert copy method to update the external table directly.

If the full load file and incremental files are in the same folder in ADLS, then while creating the external tables you can give ** in place of filename.

Sample external table script:

CREATE  EXTERNAL  TABLE [dbo].[external_table] (
[PK] nvarchar(4000),
[name] nvarchar(4000),
[ingestion_time] nvarchar(4000)
)
WITH (
LOCATION = '<folder-name>/**',
DATA_SOURCE = <datasource-name>,
FILE_FORMAT = <fileformatname>
)

This script will make sure to combine all the data under that folder.

When the old data loaded in the initial version got replaced with new data, It is necessary to take only the latest record.

Example, File1 data:

PK,name,ingestion_time
1,Karikala,2023-05-01
4,Kalyani,2023-05-01
7,Sindhu,2023-05-01

File2 data:

PK,name,ingestion_time
1,Aadhi,2023-06-01

Here, File2 data has the updated record for PK=1. When the query select * from external_table is executed, all four records will be displayed. Thus, Create a view on top of this external table to display only the latest record for each primary key.

Sample Query:

with cte as(
SELECT *,RANK() over (partition  by PK order  by [ingestion_time] desc) as  Rank  FROM [dbo].[external_table])
select PK,name,[ingestion_time] from cte where  rank=1

PK	name	ingestion_time
1	Aadhi	2023-06-01
2	Kalyani	2023-05-01
3	Sindhu	2023-05-01

This query returns the rows with the highest ingestion_time value for each PK value in the external_table.

Azure Synapse > Serverless SQL > UPSERT parquet files is possible?

There are 1 best solutions below

Related Questions in SERVERLESS

Related Questions in AZURE-SYNAPSE

Related Questions in EXTERNAL-TABLES

Related Questions in INCREMENTAL

Trending Questions

Popular # Hahtags

Popular Questions