how to split raw data into excel/ csv and identitfy rows that are not properly formed (python)?

99 Views Asked by At

I have raw data. I want to split this into csv/excel. after that if the data in the rows are not correctly stored( for e.g. if 0 is there entered instead of 121324) I want python to identify those rows. I mean while splitting raw data into csv through python code, some rows might form incorrectly( please understand). How to identify those rows through python?

example: S.11* N. ENGLAND L -8' 21-23 u44'\n S.18 TAMPA BAY W -7 40-7 u49'\n S.25 Buffalo L -4' 18-33 o48

result i want: S,11,*,N.,ENGLAND,L,-8',21-23,u44'\n S,18,,TAMPA,BAY,W,-7,40-7,u49'\n S,25,,Buffalo,L,-4',18-33,o48\n

suppose the output is like this: S,11,N.,ENGLAND,L,-8',21-23u,44'\n S,18,,TAMPA,BAY,W,-7,40-7,u49'\n S,25,,Buffalo,L,-4',18-33,o48\n

you can see that in first row * is missing and u44' is stored as only 44. and u is append with another column.

this row should be identified by python code and should return me this row.

likewise i want all rows those with error.

this is what i have done so far.

import csv

input_filename = 'rawsample.txt'
output_filename = 'spreads.csv'

with open(input_filename, 'r', newline='') as infile:
     open(output_filename, 'w', newline='') as outfile:
    reader = csv.reader(infile, delimiter=' ', skipinitialspace=True)
    writer = csv.writer(outfile, delimiter=',')
    for row in reader:
        new_cols = row[0].split('.')
        if not new_cols[1].endswith('*'):
            new_cols.extend([''])
        else:
            new_cols[1] = new_cols[1][:-1]
            new_cols.extend(['*'])
        row = new_cols + row[1:]
        #print(row)
        writer.writerow(row)
er=[]
for index, row in df.iterrows():
    for i in row:
        if str(i).lower()=='nan' or i=='':
            er.append(row)
# i was able to check for null values but nothing more.

please help me.

1

There are 1 best solutions below

0
Chen Brestel On

@mozway is right you better give an example input and expected result.

Anyway if you're dealing with a variable number of columns in the input please refer to Handling Variable Number of Columns with Pandas - Python

Best