I have multiple huge CSV files that I have to export based on Apache Parquet format and split them into smaller files based on multiple criteria/keys (= column values). As I understand Apache arrow is the R package allowing to work with Apache parquet files.
I work in a shared lab environment and given the limited RAM memory (compared to the number of users who work simultaneously in this same environment) we are advised to create our dataframes in local SQLite databases rather than importing them in-memory (into RAM).
The following pseudo-code shows how I import my CSV files in my local SQLite database. In the following code I use sqldf and tidyverse packages.
input_file_path <- "D:/tmp/mydata.csv"
db_file_path <- "D:/tmp/db_tmp_sqlite.db"
unlink(db_file_path)
sqldf(str_c("attach '", db_file_path, "' as new"))
sqldf(read.csv.sql(
file = input_file_path,
sql = "
create table mytable as
select
. . .
from
file
",
`field.types` = list(
. . .
),
##
header = TRUE,
sep = ",",
eol = "\n",
dbname = db_file_path,
drv = "SQLite"
))
This works well as expected, my table is created and I can run all required SQL queries, in particular adding supplementary variables (columns in my tables) which will be used later as keys to export my tables into Apache Parquet format. However, based on Apache Arrow for R Cheatsheet, the function write_dataset that allows to export my data based on Apache Parquet format, requires a dataframe.
And that is precisely my problem because a dataframe in R is in-memory whereas my data as I explained earlier are in a SQLite local database. This means that first I have to do a SELECT to export the whole data into RAM, something like
df <- sqldf("select * from mytable", dbname = ...)
And only then I'd be able to use write_dataset with the created df dataframe as its first argument in order to export and split my data based on Apache Parquet format. But this is not what I wanted to do. The whole point was to put the data in SQLite and not in-memory (RAM) given the existing resource limitations (lack of memory) in our shared environment.
Is there anyway to convert to Apache Parquet directly from SQLite within a R program, without first putting the whole data in a dataframe before the export, or I'm trying to do something which is simply not possible?
DuckDB has several great features, including the ability to both import and export CSV and parquet formats natively without affecting R memory.
TL;DR
And that is all. The data is never imported into R. (Now, whether duckdb can do it itself without exhausting memory is another issue I did not validate locally ...)
Caveat emptor: before you trust this blindly, however, I strongly urge you to do some validation of classes. Most of it can be done easily in a "lazy" fashion using
duckdbwithout having to load the whole frame into R. I encourage you to read more of its documentation for querying CSV/parquet files natively (without loading into R).Methodology
To make a comparison of the two methods (via a
data.framewhich you don't want to do, and viaduckdb), we'll use "RSS" (fromps::ps_memory_info()) to indicate the current R process memory usage. From?ps::ps_memory_info:Though an imperfect measure of the true impact to R, it does indicate a significantly smaller impact on R when using DuckDB.
Also, each method is done in a fresh instance of
R --vanilla. No.Rprofileor site-init files are loaded. The code you see is the code that is executed, nothing more.In R via data.frame
This indicates R is 1490MB larger after reading in the full data. (FYI,
data.table::freadinstead ofread.csvresults in only 408MB of memory gain, same austere conditions. I'm not trying to optimize this part, though :-)(FYI, these numbers vary for me from run-to-run and are likely to be different based on other factors outside the scope of this answer. My laptop has 64GB of RAM, it might not be comparable to exactly what you see.)
DuckDB, read from CSV, write to parquet
showing only 23MB in this process.
Comparing the resulting files.
The larger file is due to the differences in class noted below. My guess is that if we force some of the
charactercolumns to belogical, then its file-size might be reduced.A little more in-depth look at the contents:
Some interesting things to deduce from this:
characterfor them;logicalinds1andcharacterinds2are all null (sorry, it was the data I had); the fact that they are different classes indicate thatduckdbdefaults to string-like nulls instead of "bit", may or may not be a factor for you;numeric-vs-integer;V11was truly integer, it's fine; the second one,V42shows that the heuristic used for differentiating betweennumericandintegermissed something. The first row ofV42that contained any fractional component was on row 37159.Fixing data discrepancies
Column
V42indicates that we need to be very cognizant of what is going in and out of that parquet generator. My guess is that it's in the "CSV Import" step, so looking at CSV Loading suggests the need to change theSAMPLE_SIZE. While relatively inefficient, I'll use-1indicating that it needs to look at all values in a column to determine its class. Slower, yes, but also safer.Validation of this assumption:
Indeed,
V11is still good, andV42changes frominttonum.After rerunning with this new parameter,
offline validation confirmed that all values are correct.