Snowflake Bulk Inserts vs. Single Row Inserts

172 Views Asked by At

I'm a bit confused about the definition of single-row inserts vs. bulk inserts in Snowflake.

I want to understand what is happening under the hood when executing insert statements in Snowflake, specifically with respect to micropartitions. It's my understanding that bulk inserting new records into a table has the same effect on the micropartitions of that table as a COPY INTO statement. However, when loading the same amount of records into a table via single-row inserts, we increase the number of micropartitions (e.g. first insert creates a single row partition, second insert creates a new 2 row partition, etc.)

My question is what exactly is a bulk insert?

insert into table1 select * from table2

^^ seems like a bulk insert

insert into table1 values (val1, val2), (val3, val4)

^^ this inserts two rows of data so is that considered bulk? If not why not?

insert into table1 values (val1, val2)

^^ pretty sure this is a single row insert

Thanks for any help!

2

There are 2 best solutions below

3
NickW On

“Bulk insert” does not have a technical definition, it is just generally used to mean inserting more than one record in a single statement.

1
Simeon Pilgrim On

if you have 1 million rows to insert (per day),

a bulk insert in doing it in one command, once a day.

But if the data is only ready once per hour, 1M/24 is still bulk insert.

But if you have 1000 rows per day to insert, and you do this in 1,000 insert commands, this is not bulk.

The point being there is not a mathematical definition of operations, but there is "costs" to be paid for doing work, and some of those costs can be shared over bulk operations. Some of the costs is time, some is CPU, some is disk usage, some is fragmentation of partition.

And you might have infinite money, thus do not care about any of those things. Or you might not care, but you boss think Snowflake "costs too much" because it is used poorly.

But really, it is hard to answer your implicit question of "what bit of implicit knowledge do I not have" because when I read that Snowflake was an immutable data store, with micro-partitions (etc, etc) I went, oh, we will want to write in as large batches we can, that meets our timeliness needs. And got on with it.

We then experimented, and showed, that too many micro partition gave poor pruning performance, and slows query compilation, as we had expected, and sure enough, as tools were added, this came to become "the documented best practices".

insert into table1 select * from table2

This is a bulk as it can get, thus is good.

insert into table1 values 
     (val1, val2), 
     (val3, val4);

This is bulk as compared to:

insert into table1 values 
     (val1, val2);
insert into table1 values 
     (val3, val4);

But if you are inserting data via insert commands, aka send large INSERT text commands to the database, these can be slow (for larger numbers of rows), aka at one point we where insert 100K's of rows via 10MB text blocks of explicit INSERTS (this was in 2016 when we lift and shifted, so it was not good practices, but it got us working). Anyways, 100K rows "is bulk" but 10MB SQL text blocks is gross, and the SQL compile time was 3-5minutes for a command that would complete in 10s. And swapping to writing all 100K rows into a CSV file and using a single COPY INTO command, bulk AND performant. So here we went lets pretend from 20 x 10MB text commands, thus 20 x 180s compiles times + 20 x 10s inserts, to <1s compile time, ~50s execution, so the INSERTS 20 x 10 became much faster, and all the compile time was lost. And the number of partitions where less.

So this is all pointing at, "at any point of time you have insert the largest block you can", so if as you say, 1K inserts over a day, if you only have new data 86400/1000 every 86 seconds, then inserting one row, because it will be read 10 times (read every 8 seconds) before the next row. Then write those rows one at a time. But if your data is only every read once a day at 1am, why not load the data "just before" the read, and not every minute. that will save you lots...