I have a table of ~750Mb (5.5M lines) and want to create a file containing it that will be used when reading/writing. I used to use RDS files, but would like to move to parquet for cross-language support.
Creating a parquet file with arrow::write_parquet() results in a file of ~50Mb while creating a dataset with arrow::write_dataset() results in a file of ~600Mb.
I'm using the default compression "snappy" in bot cases.
Why is the dataset approach so much larger? I would expect a difference in size but not this much.
This is the code I used:
arrow::write_parquet(
dta,
file.path(outpath, "data.parquet"),
compression = "snappy"
)
arrow::write_dataset(
dta,
file.path(outpath, "dataset"),
hive_style = FALSE,
compression = "snappy"
)