Batch insertion in MariaDb ColumnStore taking longer time

271 Views Asked by At

In my team we decided to choose MariaDb ColumnStore for running OLAP queries. So before that we thought of testing MariaDb ColumnStore with bulk and batch insertions for 100,000 records (1 lakh records)

I created a table which contains 11 columns with ColumnStore engine.

I ran a python script to perform batch insertions, for each batch, script inserts 1000 records. Time taken for each batch is as below

[287.0853614807129, 281.05055260658264, 282.64506244659424, 331.4887454509735, 348.7496454715729, 353.62516021728516, 347.6788556575775, 348.5816104412079, 353.4380421638489, 353.4889008998871, 354.2835190296173, 352.46736669540405, 360.3770363330841, 362.3567490577698, 359.73296880722046, 359.29212188720703, 358.81954050064087, 358.2558786869049, 355.0806622505188, 358.75686407089233, 361.61275911331177, 360.9823422431946, 361.2905898094177, 360.9722273349762, 357.3613495826721, 366.31693053245544, 365.2138879299164, 364.80778098106384, 370.3709137439728, 362.18855333328247, 368.99038791656494, 374.2518558502197, 370.6084198951721, 370.33627557754517, 366.5031726360321, 365.6407914161682, 365.10843682289124, 365.73114371299744, 369.5207598209381, 373.7039930820465, 368.9340612888336, 366.8793954849243, 370.7075254917145, 368.6313920021057, 367.10168743133545, 367.0975866317749, 373.3658838272095, 372.6547067165375, 376.8877205848694, 418.06233167648315, 394.1724989414215, 384.1936047077179, 378.3561038970947, 380.23631024360657, 377.93196201324463, 380.34552478790283, 381.915967464447, 384.0738854408264, 383.0759401321411, 380.92330598831177, 390.85334849357605, 391.03555250167847, 388.80859565734863, 392.8234450817108, 389.6291012763977, 384.38167452812195, 388.52447509765625, 394.38368034362793, 392.903005361557, 362.5258505344391, 309.23055624961853, 309.36455821990967, 311.11726665496826, 313.3339145183563, 312.9061908721924, 317.48958563804626, 313.0095570087433, 315.8379123210907, 313.1757471561432, 313.1741600036621, 315.13149428367615, 315.31139969825745, 319.4831624031067, 319.8994839191437, 325.9803538322449, 327.67448115348816, 318.8332529067993, 317.948855638504, 318.19195556640625, 320.73410272598267, 319.8331866264343, 320.14869451522827, 317.2805619239807, 323.0316562652588, 327.16980743408203, 315.70853662490845, 316.0078499317169, 329.8362789154053, 321.79836106300354, 320.2696611881256]

So on an average for each batch it took 300 seconds i.e to insert 1000 records my script took 300 seconds.

Whereas in Row-Oriented MariaDb table on an average my script took only 0.3 seconds for each batch

I felt 300 seconds for each batch insertion is too high, is this behaviour expected or is this behaviour is because of wrong configuration or problem with installation ?

script used for batch insertion

import pymysql
import csv
from time import time
import sys

import constants

conn = pymysql.connect(
    user=constants.db_user,
    password=constants.db_pass,
    host=constants.db_host,
    database=constants.db_name
)

cur = conn.cursor()

with open("../records.csv", "r") as csvfile:
    csvreader = csv.reader(csvfile)
    next(csvreader)

    start = time()
    index = 1
    for row in csvreader:
        query = '''
                INSERT INTO 
                columnar_batch(id, first_name, last_name, sem, dept, age, weight, height, id_card, state, nationality) 
                VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
        '''
        cur.execute(query, row)

        if int(row[0]) % 1000 == 0:
            conn.commit()
            end = time()
            print("Inserted " + str(index))
            with open("../columnar_results.txt", "a") as txt:
                txt.write("Time taken to insert batch " + str(index) + " records in COLUMN ORIENTED table: " + str(end - start) + "\n")
            start = time()
            index = index + 1


conn.close()

Sample Records

id,first_name,last_name,sem,dept,age,weight,height,id_card,state,nationality
1,Elaine,Creach,8,CI,22,50,6.98,ALV5W58,TN,IN
2,Emma,Craft,1,PS,18,69,5.2,90NIGBP,AP,IN
3,Karen,Race,6,MECH,22,56,6.41,JWKD43H,GA,IN
2

There are 2 best solutions below

0
gigaplant On BEST ANSWER

According to the documentation, it seems that doing transactional row based inserts like you are doing is slower for ColumnStore.

Here’s what the documentation says:

DML, i.e. INSERT, UPDATE, and DELETE, provide row level changes. ColumnStore is optimized towards bulk modifications and so these operations are slower than they would be in say InnoDB.

In transactional mode DML inserts are performed which will be significantly slower plus it will consume both binlog transaction files and ColumnStore VersionBuffer files.

Bulk DML operations will in general perform better than multiple individual statements.

These statements suggest the current way you are importing your data which is to individually insert row by row is the least efficient method when it comes to importing large amounts of data particularly from a CSV file.

To run a bulk import it is better to use the cpimport tool but perhaps you can try other methods like LOAD DATA INFILE as that is also mentioned here.

1
Rick James On

That's not "batched", that's one row at a time, but with 1000 rows statements in a transaction. "Batched" INSERT is a single INSERT with 1000 rows in it. It may be 10 times as fast.

However, Columnstore has to do a lot of work to generate the indexes. It is best to provide all the rows before doing any indexing. Inserting a single row is probably doing "a lot of work" for each row and now waiting until the 1000 is finished.

300 / 0.3 == 1000. (I'm surprised that it comes out exactly 1000.)