Parsing csv in R with internal quotes

77 Views Asked by At

I'm trying to parse a csv file in R, ideally using read_csv(). Here is a sample of my data where the error occurs:

"ScheduleDId","ReportId","CommitteeContactId","FirstName","MiddleName","LastOrCompanyName","Prefix","Suffix","AddressLine1","AddressLine2","City","StateCode","ZipCode","IsIndividual","TransactionDate","Amount","AuthorizingName","ItemOrService","ScheduleId","ReportUID"
"2543592","168976","568106","","","MacGyver Group Solution Consultants","","","6113 Sunlight Mountain Road","","Spotsylvania","VA","22553","False","06/03/2019","1250.00","Raymond A. Bell, Jr.","Consulting Fees 10 hours @ $125 per hour","","{0A439685-464F-29DC-879D-9D9793AA2E5F}"
"2543593","168976","568106","","","MacGyver Group Solution Consultants","","","6113 Sunlight Mountain Road","","Spotsylvania","VA","22553","False","06/10/2019","750.00","Raymond A. Bell, Jr.","Printed Material - 2k 4/4 gloss rack cards - $345 Yard Signs - 24 18x24" 1/0 coroplast yard signs/stakes/delivery - $405","","{0A439685-464F-29DC-879D-9D9793AA2E5F}"
"2543594","168976","572005","","","Freelancer Limited - Freelancer.CO.UK","","","680 George Street","","Sidney, New South Wales - Australia","N/A","20000","False","06/07/2019","204.90","Raymond A. Bell, Jr.","Website design","","{0A439685-464F-29DC-879D-9D9793AA2E5F}"

I should get three rows, but R returns two. I think the double quote denoting inches in one of the strings is causing the problem as it disappears from the output, but it throws off the results. See the last columns below:

# A tibble: 2 × 7
  IsIndividual TransactionDate Amount  AuthorizingName      ItemOrService                                                                                      ScheduleId ReportUID
  <chr>        <chr>           <chr>   <chr>                <chr>                                                                                              <chr>      <chr>    
1 False        06/03/2019      1250.00 Raymond A. Bell, Jr. "Consulting Fees 10 hours @ $125 per hour"                                                          NA        {0A43968…
2 False        06/10/2019      750.00  Raymond A. Bell, Jr. "Printed Material - 2k 4/4 gloss rack cards - $345 Yard Signs - 24 18x24 1/0 coroplast yard signs… "New Sout… N/A      

Google Sheets doesn't have a problem parsing the data, so I assume I should be able to in R. I tried fread() without luck. I've also tried read_delim(data, delim = ",", escape_double = FALSE)

This occurs through the file, so I can't just go in manually to remove the quote. There are over 27K rows, and I want to import multiple files. You can download the one I'm referencing here: https://apps.elections.virginia.gov/SBE_CSV/CF/2019_07/ScheduleD.csv

UPDATE

Robert's answer below go me closer — I was able to read in several more lines without issue, but there were still a few problem rows. Here's the code I used...

parse_csv <- function(x){
  dat <- readLines(x)
  dat <- gsub('(?<!",|^)"(?!,"|$)', '\\\\"', dat, perl=T)
  dat <- read_delim(I(dat), escape_backslash=T)
}


test <- parse_csv("test.csv")

problems(test)

and the problems...

# A tibble: 2 × 5
    row   col expected   actual                                                                                   
  <int> <int> <chr>      <chr>                                                                                    
1 27349    18 20 columns 18 columns 
2 27350     5 20 columns 5 columns  

I opened the csv file in a text editor and went to the problem rows, and I found that the end of the csv row was on a separate line.

enter image description here

There were dozens of rows like this in the file — and once I cleaned them up, the custom function worked perfectly.

But I'm processing more than 200 files, so I can't go through each file manually to ensure that each row is on one line. The line break appears to always be in the ItemOrService column. Is there a way to programmatically clean this up?

Here's a link to another csv with rows on two lines. The first break is at line 89.

https://gist.github.com/jrcloutier/cc48a230d99a983f40c94f7800e69952

1

There are 1 best solutions below

2
Robert Hacken On

With a bit of luck, the following will work. You can use gsub to escape all quotes which are likely not the external quotes. I define these as those which are not at a line's start or end and are not preceded with ", or followed by ,".

I am unable to connect to virginia.gov so I can only test this on the small sample you show in your question, where it works. It is, however, possible that something will break when applied to the complete data set or there may be some performance issues with the 27K lines.

# dat <- readLines('ScheduleD.csv')  # I used the dat object defined below
dat <- gsub('(?<!",|^)"(?!,"|$)', '\\\\"', dat, perl=T)
dat <- read_delim(I(dat), escape_backslash=T)

dim(dat)
# [1]  3 20
dat$ItemOrService[2]
[1] "Printed Material - 2k 4/4 gloss rack cards - $345 Yard Signs - 24 18x24\" 1/0 coroplast yard signs/stakes/delivery - $405"

Data:

dat <- c("\"ScheduleDId\",\"ReportId\",\"CommitteeContactId\",\"FirstName\",\"MiddleName\",\"LastOrCompanyName\",\"Prefix\",\"Suffix\",\"AddressLine1\",\"AddressLine2\",\"City\",\"StateCode\",\"ZipCode\",\"IsIndividual\",\"TransactionDate\",\"Amount\",\"AuthorizingName\",\"ItemOrService\",\"ScheduleId\",\"ReportUID\"", 
"\"2543592\",\"168976\",\"568106\",\"\",\"\",\"MacGyver Group Solution Consultants\",\"\",\"\",\"6113 Sunlight Mountain Road\",\"\",\"Spotsylvania\",\"VA\",\"22553\",\"False\",\"06/03/2019\",\"1250.00\",\"Raymond A. Bell, Jr.\",\"Consulting Fees 10 hours @ $125 per hour\",\"\",\"{0A439685-464F-29DC-879D-9D9793AA2E5F}\"", 
"\"2543593\",\"168976\",\"568106\",\"\",\"\",\"MacGyver Group Solution Consultants\",\"\",\"\",\"6113 Sunlight Mountain Road\",\"\",\"Spotsylvania\",\"VA\",\"22553\",\"False\",\"06/10/2019\",\"750.00\",\"Raymond A. Bell, Jr.\",\"Printed Material - 2k 4/4 gloss rack cards - $345 Yard Signs - 24 18x24\" 1/0 coroplast yard signs/stakes/delivery - $405\",\"\",\"{0A439685-464F-29DC-879D-9D9793AA2E5F}\"", 
"\"2543594\",\"168976\",\"572005\",\"\",\"\",\"Freelancer Limited - Freelancer.CO.UK\",\"\",\"\",\"680 George Street\",\"\",\"Sidney, New South Wales - Australia\",\"N/A\",\"20000\",\"False\",\"06/07/2019\",\"204.90\",\"Raymond A. Bell, Jr.\",\"Website design\",\"\",\"{0A439685-464F-29DC-879D-9D9793AA2E5F}\"")

To merge lines divided inside a column, updating your parse_csv() function like this should work:

parse_csv <- function(x) {
  
  dat <- readLines(x)
  to.merge <- grep('^[^"]', dat)
  if (length(to.merge) > 0) {
    dat[to.merge - 1] <- paste0(dat[to.merge - 1], dat[to.merge])
    dat <- dat[-to.merge]
  }
  dat <- gsub('(?<!",|^)"(?!,"|$)', '\\\\"', dat, perl=T)
  dat <- read_delim(I(dat), escape_backslash=T)
}

I didn't test this but it should find all lines which don't begin with a quote, paste them to the preceding line and delete them. Should the line breaks occur somewhere else than inside a field, this likely wouldn't work.