I am trying to work with a 200GB csv using R. I'm exploring the arrow package and have been able to using the open_dataset() function to point to the file.
arrow_data <- open_dataset(
sources = large.csv",
format = "csv",
schema = schema(
col1 = string(),
col2 = string(),
col3 = string(),
col4 = string())
I would like to output the data to a parquet folder for further efficent analysis on the data. The code below looks to be what I need to do this where the write_dataset would create a folder for value in the group_var column.
arrow_data %>%
group_by(group_var) %>%
write_dataset(path = pq_path, format = "parquet")
However, when I run this it produces a few folders and then fails with the error below.
Error: Invalid: In CSV column #3: Row #12250498: CSV conversion error to string: invalid UTF8 data
Online suggestions are to pre-clean the csv, however, how to do that is seemingly difficult given the size and that I can't read it efficiently.
Can anyone offer some guidance on how I might be able to proceed?