Upsert using DuckDB

57 Views Asked by At

Let assume I have two parquet file aka main and updates with some data as shown below. I need to implement upsert kind of operation in duck db.

parquet file main: id, name, city

data:
   id     name     city 
    1      a         p
    2      b         q
    3      c         r
    

parquet file updates: id, name, city

data:
   id     name     city
    1      a         m
    4      b         q

desired output:-

   id     name     city
    1      a         m  <----update city
    2      b         q
    3      c         r
    4      b         q  <----insert

currently I am doing the same using below queries:-

create table main as select * from '/tmp/main.parquet';
create table stage as  select * from '/tmp/updates.parquet';
delete from main using stage where main.id=stage.id;
insert into main select * from stage;
COPY main TO '/tmp/final.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');

but only thing is create table will load and keep all the data in memory what I don't want as the main file may contain 8-10 million records, is there any way where I can achive the same using joins only and avoid creating main and stage tables.

2

There are 2 best solutions below

1
Thijs Bruineman On

DuckDB has upsert support, see the docs: https://duckdb.org/docs/sql/statements/insert#on-conflict-clause

0
Kumar-Sandeep On

Here is the query, one point here is that this will consume less memory and will be fast as compared to creating duck db tables as given in original question but consume more cpu cores.

COPY(
     select
     CASE WHEN a.id is NULL  THEN b.id ELSE a.id END as id,
     CASE WHEN b.id is NOT NULL THEN b.name ELSE a.name END as name,
     CASE WHEN b.id is NOT NULL THEN b.city ELSE a.city END as city
     from (SELECT  * FROM 'main.parquet') a full outer join (SELECT  * 
     FROM 'updates.parquet') b on a.id=b.id
    )
    TO '/tmp/output.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');