I'm trying to parse a csv file in R, ideally using read_csv(). Here is a sample of my data where the error occurs:
"ScheduleDId","ReportId","CommitteeContactId","FirstName","MiddleName","LastOrCompanyName","Prefix","Suffix","AddressLine1","AddressLine2","City","StateCode","ZipCode","IsIndividual","TransactionDate","Amount","AuthorizingName","ItemOrService","ScheduleId","ReportUID"
"2543592","168976","568106","","","MacGyver Group Solution Consultants","","","6113 Sunlight Mountain Road","","Spotsylvania","VA","22553","False","06/03/2019","1250.00","Raymond A. Bell, Jr.","Consulting Fees 10 hours @ $125 per hour","","{0A439685-464F-29DC-879D-9D9793AA2E5F}"
"2543593","168976","568106","","","MacGyver Group Solution Consultants","","","6113 Sunlight Mountain Road","","Spotsylvania","VA","22553","False","06/10/2019","750.00","Raymond A. Bell, Jr.","Printed Material - 2k 4/4 gloss rack cards - $345 Yard Signs - 24 18x24" 1/0 coroplast yard signs/stakes/delivery - $405","","{0A439685-464F-29DC-879D-9D9793AA2E5F}"
"2543594","168976","572005","","","Freelancer Limited - Freelancer.CO.UK","","","680 George Street","","Sidney, New South Wales - Australia","N/A","20000","False","06/07/2019","204.90","Raymond A. Bell, Jr.","Website design","","{0A439685-464F-29DC-879D-9D9793AA2E5F}"
I should get three rows, but R returns two. I think the double quote denoting inches in one of the strings is causing the problem as it disappears from the output, but it throws off the results. See the last columns below:
# A tibble: 2 × 7
IsIndividual TransactionDate Amount AuthorizingName ItemOrService ScheduleId ReportUID
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 False 06/03/2019 1250.00 Raymond A. Bell, Jr. "Consulting Fees 10 hours @ $125 per hour" NA {0A43968…
2 False 06/10/2019 750.00 Raymond A. Bell, Jr. "Printed Material - 2k 4/4 gloss rack cards - $345 Yard Signs - 24 18x24 1/0 coroplast yard signs… "New Sout… N/A
Google Sheets doesn't have a problem parsing the data, so I assume I should be able to in R. I tried fread() without luck. I've also tried read_delim(data, delim = ",", escape_double = FALSE)
This occurs through the file, so I can't just go in manually to remove the quote. There are over 27K rows, and I want to import multiple files. You can download the one I'm referencing here: https://apps.elections.virginia.gov/SBE_CSV/CF/2019_07/ScheduleD.csv
UPDATE
Robert's answer below go me closer — I was able to read in several more lines without issue, but there were still a few problem rows. Here's the code I used...
parse_csv <- function(x){
dat <- readLines(x)
dat <- gsub('(?<!",|^)"(?!,"|$)', '\\\\"', dat, perl=T)
dat <- read_delim(I(dat), escape_backslash=T)
}
test <- parse_csv("test.csv")
problems(test)
and the problems...
# A tibble: 2 × 5
row col expected actual
<int> <int> <chr> <chr>
1 27349 18 20 columns 18 columns
2 27350 5 20 columns 5 columns
I opened the csv file in a text editor and went to the problem rows, and I found that the end of the csv row was on a separate line.
There were dozens of rows like this in the file — and once I cleaned them up, the custom function worked perfectly.
But I'm processing more than 200 files, so I can't go through each file manually to ensure that each row is on one line. The line break appears to always be in the ItemOrService column. Is there a way to programmatically clean this up?
Here's a link to another csv with rows on two lines. The first break is at line 89.
https://gist.github.com/jrcloutier/cc48a230d99a983f40c94f7800e69952

With a bit of luck, the following will work. You can use
gsubto escape all quotes which are likely not the external quotes. I define these as those which are not at a line's start or end and are not preceded with",or followed by,".I am unable to connect to virginia.gov so I can only test this on the small sample you show in your question, where it works. It is, however, possible that something will break when applied to the complete data set or there may be some performance issues with the 27K lines.
Data:
To merge lines divided inside a column, updating your
parse_csv()function like this should work:I didn't test this but it should find all lines which don't begin with a quote, paste them to the preceding line and delete them. Should the line breaks occur somewhere else than inside a field, this likely wouldn't work.