Different numbers of commas between fields in CSV files, throwing errors with pd.readcsv

58 Views Asked by At

I'm using the NOAA weather dataset to build a machine learning model to predict weather data. Python cannot read in this data as there are: a.) commas in the fields, and b.) different numbers of commas between each field.

Here are the headers and the first line: "STATION","DATE","SOURCE","REPORT_TYPE","CALL_SIGN","QUALITY_CONTROL","AA1","AJ1","AL1","CIG","DEW","GA1","KA1","MA1","MF1","OC1","RH1","SLP","TMP","VIS","WND"

"72503014732","2022-01-01T00:00:00","4","FM-12","99999","V020",,,,"99999,9,9,N","+0078,1","99,9,+00450,1,99,9","120,M,+0128,1","99999,9,10129,1",,,,"10141,1","+0106,1","016000,1,9,9","160,1,N,0046,1"

When I open this on excel, this is how it looks:

Image of rendered data on excel sheet

enter image description here

I have tried regex, I've tried setting the delimiter to ",", but it still doesn't work

1

There are 1 best solutions below

2
mozway On

As your fields are quoted the commas are not an issue for pandas:

df = pd.read_csv('yourfile.csv', sep=',')

output:

       STATION                 DATE  SOURCE REPORT_TYPE  CALL_SIGN  \
0  72503014732  2022-01-01T00:00:00       4       FM-12      99999   

  QUALITY_CONTROL  AA1  AJ1  AL1          CIG  ...                 GA1  \
0            V020  NaN  NaN  NaN  99999,9,9,N  ...  99,9,+00450,1,99,9   

             KA1              MA1 MF1  OC1  RH1      SLP      TMP  \
0  120,M,+0128,1  99999,9,10129,1 NaN  NaN  NaN  10141,1  +0106,1   

            VIS             WND  
0  016000,1,9,9  160,1,N,0046,1  

[1 rows x 21 columns]