[EDIT : Please see the answer below to understand the error and this SO question if you need to open the .parquet file]
My institution is slowly transiting from SAS to R, most of the code is written in arrow/dplyr or data.table, using the .parquet format as its main storing format. In my own work I am usually dealing with storing and analysing data from 1 million to 10 million rows and up to 150-200 columns. Parquet format is great for this kind of usage, but an unusual error has been occuring recently, and I couldn't find any ressource on the internet :
library(arrow)
library(tidyverse)
open_dataset(data_error)
Error in `open_dataset()`:
! IOError: Error creating dataset. Could not read schema from 'path/example.parquet'.
Is this a 'parquet' file?:
Could not open Parquet input source 'path/example.parquet':
Couldn't deserialize thrift: TProtocolException: Exceeded size limit
The same would happen with the function read_parquet.
What is data_error ?
data_error is just a typical data.frame, extracted from a bigger data source (let's call it data_clean), through a few data.table processes and saved by write_parquet unpartitionned. Please not that this error does not occur if the parquet file is partionnized.
This error first happened on a program on data.table that I didn't write, and I'm not familiar enough with data.table to understand the underlying issue.
repex :
library(arrow)
library(data.table)
# Seed
set.seed(1L)
# Big enough data.table
dt = data.table(x = sample(1e5L, 1e7L, TRUE), y = runif(100L))
# Save in parquet format
write_parquet(dt, "example_ok.parquet")
# Readable
dt_ok <- open_dataset("example_ok.parquet")
# Simple filter
dt[x == 989L]
# Save in parquet format
write_parquet(dt, "example_error.parquet")
# Error
dt_error <- open_dataset("example_error.parquet")
Thank you all for your help !
The culprit is that once you call
dt[x == 989L], an index is created in thedata.table.Notice the addition of the
indexattribute.The default action of
arrowis to store attributes; one nice side-effect of this is thatdt_okwill actually be of classdata.table:The file size is also adversely affected (not sure if you are aware of this):
Clearly the
_errorfile has something more. The normal efficiency of binary-data-storage inparquetfiles is not afforded to R attributes, so it makes sense that 10Mi values in a vector stored less-efficiently would take up that space.If we remove the
index, the problem goes away. One way to remove the index is to manually set the order:My immediate thought is that this is a bug, perhaps due to the size of the attribute. For demonstration, if we instead repeat this with 100 rows, we have no problem.
I suggest (request, even) that you submit a bug report.