Creating a Parquet Folder from a very large CSV with R

85 Views Asked by At

I am trying to work with a 200GB csv using R. I'm exploring the arrow package and have been able to using the open_dataset() function to point to the file.

arrow_data <- open_dataset(
  sources = large.csv", 
  format = "csv", 
  schema = schema(
    col1 = string(),
    col2 = string(),
    col3 = string(),
    col4 = string())

I would like to output the data to a parquet folder for further efficent analysis on the data. The code below looks to be what I need to do this where the write_dataset would create a folder for value in the group_var column.

arrow_data %>% 
  group_by(group_var) %>% 
  write_dataset(path = pq_path, format = "parquet")

However, when I run this it produces a few folders and then fails with the error below.

Error: Invalid: In CSV column #3: Row #12250498: CSV conversion error to string: invalid UTF8 data

Online suggestions are to pre-clean the csv, however, how to do that is seemingly difficult given the size and that I can't read it efficiently.

Can anyone offer some guidance on how I might be able to proceed?

0

There are 0 best solutions below