How to handle data with no space between separators when using fread in R

420 Views Asked by At

I am reading a large .txt file (>1GB) into R via fread. I am reading the file in directly from a .zip archive, via a bash command:

base = fread('unzip -p Folder.zip File.txt', sep = '|', header = FALSE, 
stringsAsFactors = FALSE, na.strings="", quote = "", col.names = col_namesMain)

The text file separates entries via | so that a typical line might look like:

RRX|||02020||333293||||12123

However, there are many places where empty entries are denoted by separators with no space between them, e.g. || in the example line above.

When using fread, these adjacent separators are typically read in altogether, so that the above line returns the following entries:

RRX, ||02020|, 333293|||, 12123

when it should read in as:

RRX, NA, NA, 02020, NA, 333293, NA, NA, NA, 12123

I have tried using read.table with the option skipNul = TRUE, and this works perfectly. However, there doesn't seem to be any option similar to skipNul for fread. I would much prefer to use fread over read.table if possible, since I have several very large files. Despite my searching, I haven't come across much discussion of this problem. Any help much appreciated.

1

There are 1 best solutions below

0
On BEST ANSWER

I have tried using read.table with the option skipNul = TRUE, and this works perfectly. However, there doesn't seem to be any option similar to skipNul for fread.

This has been fixed in dev 1.12.3 on 15 Apr 2019 (see NEWS) :

  1. fread() now skips embedded NUL (\0), #3400. Thanks to Marcus Davy for reporting with examples, and Roy Storey for the initial PR.