Understanding CSV Formatting : Why Common Tools Deviate from RFC 4180 Standards?

Question

Understanding CSV Formatting : Why Common Tools Deviate from RFC 4180 Standards?

143 Views Asked by NicoD At 08 December 2023 at 11:04

I am attempting to parse numerous CSV files (comma-separated, UTF-8 encoding) and have encountered a recurring issue.

Consider a scenario where I have three fields with the values: : A, "B", C. According to RFC 4180

If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote

Based on this, my understanding is that the correct CSV representation should be A,"""B""",C .However, many files generated by various tools are formatted as: : "A, ""B"", C"

This causes CSV parsers (CsvHelper in c# for instance) to interpret such lines as a single field, rather than three separate fields.

Am I missing something here? Why is this seemingly "incorrect" format so prevalently used across different tools?

For discussion, here is a more realistic example:

"00AA12345,30/11/2023,30/11/2023,01/12/2023,01/12/2023,""BAS"",1 111 000.27,""NRT"",""Test, ok"","""","""","""","""","""""

I need to read

field 0: id
field 1 to 4 : date
field 5 : date
field 6 : string
field 7 and + : string (can contains double quotes)

Original Q&A

There are 2 best solutions below

Joel Coehoorn On 08 December 2023 at 14:48

Rather than a specific code issue, I'll just address this high-level question:

Why Common Tools Deviate from RFC 4180 Standards?

The answer is this standard wasn't published until 2005, but CSV data has been in use for FAR LONGER. For much of the time we've been using CSV data, there was no standard! (And we got by just fine, thankyouverymuch).

This means there is not only a LOT of non-compliant CSV data still in the wild, but also still a significant number of legacy systems continuing to produce new non-compliant outputs today, and even new applications created to match non-compliant behavior so they function within a larger system (a complete Ship of Theseus of non-compliant CSV).

CSV parsers need to be able to handle this.

**David Specht** · Accepted Answer · 2023-12-08T14:35:04.283000

CsvMode.Escape comes close to what you are looking for. It works for your very simple example of "A, ""B"", C". However, for ""Test, ok"", it creates two fields "Test and ok", which I suspect is supposed to be one field. But maybe I'm wrong and that does work for you, so I thought I would at least suggest it.

var config = new CsvConfiguration(CultureInfo.InvariantCulture) {
    Mode = CsvMode.Escape   
};
using (var reader = new StreamReader("path\\to\\file.csv"))
using (var csv = new CsvReader(reader, config))

I tend to agree with @PanagiotisKanavos that this file has been encoded twice. Which is why I would suggest reading it twice. Read it first as if each line is supposed to be a single field and then read that field to get the record.

This removes those double quotes that you say are supposed to be in the field, but I'm not convinced they are supposed to be a part of the data.

void Main()
{
    var sb = new StringBuilder();
    sb.Append("\"00AA12345,30/11/2023,30/11/2023,01/12/2023,01/12/2023");
    sb.Append(",\"\"BAS\"\",1 111 000.27,\"\"NRT\"\",\"\"Test, ok\"\"");
    sb.Append(",\"\"\"\",\"\"\"\",\"\"\"\",\"\"\"\",\"\"\"\"\"");
    sb.AppendLine();
    sb.Append("\"00AA12345,30/11/2023,30/11/2023,01/12/2023,01/12/2023");
    sb.Append(",\"\"BAS\"\",1 111 000.27,\"\"NRT\"\",\"\"Test, ok\"\"");
    sb.Append(",\"\"\"\",\"\"\"\",\"\"\"\",\"\"\"\",\"\"\"\"\"");

    var records = new List<Foo>();

    var config = new CsvConfiguration(CultureInfo.InvariantCulture) {
        HasHeaderRecord = false
    };
    
    using (var reader = new StringReader(sb.ToString()))
    using (var csv = new CsvReader(reader, config))
    {
        while(csv.Read())
        {
            var line = csv.GetRecord<SingleLine>().Line;

            using (var reader2 = new StringReader(line))
            using (var csv2 = new CsvReader(reader2, config))
            {
                if(csv2.Read())
                {
                    var options = new TypeConverterOptions { Formats = new[] { "dd/MM/yyyy" } };
                    csv2.Context.TypeConverterOptionsCache.AddOptions<DateTime>(options);

                    var record = csv2.GetRecord<Foo>();
                    records.Add(record);
                }               
            }
        }       
    }
    records.Dump();
}

public class SingleLine
{
    public string Line { get; set; }
}

public class Foo
{
    [Index(0)]
    public string Field0 { get; set; }
    [Index(1)]
    public DateTime Field1 { get; set; }
    [Index(2)]
    public DateTime Field2 { get; set; }
    [Index(3)]
    public DateTime Field3 { get; set; }
    [Index(4)]
    public DateTime Field4 { get; set; }
    [Index(5)]
    public string Field5 { get; set; }
    [Index(6)]
    public string Field6 { get; set; }
    [Index(7)]
    public string Field7 { get; set; }
    [Index(8)]
    public string Field8 { get; set; }
    [Index(9)]
    public string Field9 { get; set; }
    [Index(10)]
    public string Field10 { get; set; }
    [Index(11)]
    public string Field11 { get; set; }
    [Index(12)]
    public string Field12 { get; set; }
    [Index(13)]
    public string Field13 { get; set; }
}

Understanding CSV Formatting : Why Common Tools Deviate from RFC 4180 Standards?

There are 2 best solutions below

Related Questions in C#

Related Questions in CSV

Related Questions in CSVHELPER

Trending Questions

Popular # Hahtags

Popular Questions