R arrow package write_parquet VS write_dataset size

67 Views Asked by At

I have a table of ~750Mb (5.5M lines) and want to create a file containing it that will be used when reading/writing. I used to use RDS files, but would like to move to parquet for cross-language support.

Creating a parquet file with arrow::write_parquet() results in a file of ~50Mb while creating a dataset with arrow::write_dataset() results in a file of ~600Mb.

I'm using the default compression "snappy" in bot cases.

Why is the dataset approach so much larger? I would expect a difference in size but not this much.

This is the code I used:

arrow::write_parquet(
  dta,
  file.path(outpath, "data.parquet"),
  compression = "snappy"
)

arrow::write_dataset(
  dta,
  file.path(outpath, "dataset"),
  hive_style = FALSE,
  compression = "snappy"
)
0

There are 0 best solutions below