I am collecting live data from a website which populates a data frame in R. The rows could have the same unique id's, or new rows could be introduced. I want to send the dynamic data frame to a MariaDB database table, where the rows with existing unique id's update the columns that I have specified, rows without existing unique id's get inserted in the table as new rows. I can get this to work with the MariaDB INSERT ON DUPLICATE KEY UPDATE statement, and a function that generates the needed values from the dynamic data frame.
MWE:
install.packages("odbc")
insall.packages("RMariaDB")
library(odbc)
library(RMariaDB)
con <- dbConnect(RMariaDB::MariaDB(), host = Sys.getenv('MARIADB_DBHOST'),
port = Sys.getenv('MARIADB_DBPORT'), user = Sys.getenv('MARIADB_DBUSER'),
password = Sys.getenv('MARIADB_DBPW'), db = Sys.getenv('MARIADB_DBNAME'),
timeout = 5)
# Database table for mwe to work.
db_live <- data.frame(id = c(12, 22, 32), car_name = c("rolls royce","nissan","mercedes benz"), km = c(123,100,150), temp = c(78,60,80))
# Get table from database, id column is unique index.
db_live <- dbReadTable(con, "db_live")
print(db_live)
id car_name km temp
1 12 rols royce 123 78
2 22 nissan 100 60
3 32 mercedes benz 150 80
# Build dynamic dataframe
df_live <- data.frame(id = c(12, 22, 32, 42),
car_name = c("rolls royce","nissan","mercedes benz", "aston martin"),
km = c(140,120,200,40), temp = c(81,65,85,50))
print(df_live)
id car_name km temp
1 12 rols royce 140 81
2 22 nissan 120 65
3 32 mercedes benz 200 85
4 42 aston martin 40 50
# create function that generates a string with values for dbSendQuery.
gen_insert_values <- function(df) {
for(i in 1:nrow(df)) {
row_string <- paste(df[i,1], paste0("'",df[i,2],"'"), df[i,3], df[i,4],
collapse = ", ")
if(exists("df_string")) {
df_string <- paste0(df_string,", ",paste0("(",row_string,")"))
} else {
df_string <- paste0("(",row_string,")")
}
}
df_string
}
values <- gen_insert_values(df_live)
print(values)
"(12 'rolls royce' 140 81), (22 'nissan' 120 65), (32 'mercedes benz' 200 85), (42 'aston martin' 40 50)"
# Send query.
res <- dbSendQuery(con, paste0("INSERT INTO db_live (id,car_name,km,temp) VALUES ", values," ON DUPLICATE KEY UPDATE km = VALUES(km), temp = VALUES(temp);"))
dbClearResult(res)
#Check db table after sent query.
new_db_live <- dbReadTable(con, "db_live")
print(new_db_live)
id car_name km temp
1 12 rolls royce 140 81
2 22 nissan 120 65
3 32 mercedes benz 200 85
4 42 aston martin 40 50
This does not seem very efficient, as I have to change the query and the function in case I want to update more columns, and I include a for loop in my function which can cause the script to be slow.
Is there a more efficient way to solve this problem?
Here's a method that is likely more efficient: the use of a temporary table instead of manually encoding the data as a string of
(a,b,c),(a,b,c)data sets.For the sake of complete-demonstration, I've modified the
df_livedata slightly so that we have one row with no change, one row with updated data, and one row that is new. This process also works unaltered with your originaldf_live, I just wanted to highlight the three modes.Technically, though, the "no change" row does update the database, but it is not obvious. If the table has a "lastmodified" field that updates with the current timestamp when something in the row is updated, then you can see a little more of what is happening.
In fact, I'll add (just for demonstration) two fields:
createdandmodified, which show when the row was first created and when the last update occurred. These are not required for normal UPSERTs.Setup
This section should not be necessary, unless you don't have a primary key(s) on the table (in which case, add one).
I'll name the main table
"mydata", and upload thedb_livedataset into it. I believe (without extensive testing) that MariaDB requires UPSERTs to find the duplicate or conflict rows based on pre-existing keys. This means we'll need to set a (primary) key; I'll assume your table already has this (and show how I do it with manually-uploaded data).If you don't have a
primary keyon the main tablemydata, then the "UPSERT" operation will simply insert (add) all rows without updating. I don't know if there's a way to trick mariadb into faking keys for the purpose of correctly triggering your intended "update if exists" logic.UPSERT
We'll use a temporary table so that the data to be upserted will not persist; this is good for several reasons, and if you do it correctly then your DBAs will thank you :-)
(If you aren't familiar with temp tables ... they are not visible to other users on the db, often not visible to different connections for the same user, and will be dropped when the connection is closed.)
If you note, even though the values for
"nissan"were not different, the row was still allegedly updated as evidenced by themodifiedtimestamp. The "change" row we had,"rolls royce", shows the appropriatemodifiedtime. Themercedes benzwas uploaded the first time and not updated, and theaston martinwas updated the second time, so itscreatedtime is different than the others.Reproduction
I did this with the
mariadb:latestdocker image. These steps below are purely for demonstration, and are not provided as the canonical way to manage a database (for security or performance). Yes I'm connecting to the"mysql"database, which is not where user data should be going, I believe ... it was hasty, please forgive me.In R, connecting is straight-forward: