I am attempting to parse numerous CSV files (comma-separated, UTF-8 encoding) and have encountered a recurring issue.
Consider a scenario where I have three fields with the values: : A, "B", C. According to RFC 4180
If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote
Based on this, my understanding is that the correct CSV representation should be A,"""B""",C .However, many files generated by various tools are formatted as: : "A, ""B"", C"
This causes CSV parsers (CsvHelper in c# for instance) to interpret such lines as a single field, rather than three separate fields.
Am I missing something here? Why is this seemingly "incorrect" format so prevalently used across different tools?
For discussion, here is a more realistic example:
"00AA12345,30/11/2023,30/11/2023,01/12/2023,01/12/2023,""BAS"",1 111 000.27,""NRT"",""Test, ok"","""","""","""","""","""""
I need to read
- field 0: id
- field 1 to 4 : date
- field 5 : date
- field 6 : string
- field 7 and + : string (can contains double quotes)
CsvMode.Escapecomes close to what you are looking for. It works for your very simple example of"A, ""B"", C". However, for""Test, ok"", it creates two fields"Testandok", which I suspect is supposed to be one field. But maybe I'm wrong and that does work for you, so I thought I would at least suggest it.I tend to agree with @PanagiotisKanavos that this file has been encoded twice. Which is why I would suggest reading it twice. Read it first as if each line is supposed to be a single field and then read that field to get the record.
This removes those double quotes that you say are supposed to be in the field, but I'm not convinced they are supposed to be a part of the data.