R arrow package write_parquet VS write_dataset size

67 Views Asked by John Ioannides At 27 February 2024 at 11:01

I have a table of ~750Mb (5.5M lines) and want to create a file containing it that will be used when reading/writing. I used to use RDS files, but would like to move to parquet for cross-language support.

Creating a parquet file with arrow::write_parquet() results in a file of ~50Mb while creating a dataset with arrow::write_dataset() results in a file of ~600Mb.

I'm using the default compression "snappy" in bot cases.

Why is the dataset approach so much larger? I would expect a difference in size but not this much.

This is the code I used:

arrow::write_parquet(
  dta,
  file.path(outpath, "data.parquet"),
  compression = "snappy"
)

arrow::write_dataset(
  dta,
  file.path(outpath, "dataset"),
  hive_style = FALSE,
  compression = "snappy"
)

Original Q&A

R arrow package write_parquet VS write_dataset size

There are 0 best solutions below

Related Questions in R

Related Questions in PARQUET

Related Questions in APACHE-ARROW

Trending Questions

Popular # Hahtags

Popular Questions