Iterating through a folder and converting all text files to csv files Error

63 Views Asked by At

I ran a Powershell code that outputted a bunch of text files.

The text files look like this:

This is my aText.txt

    Clark Kent
    Dolly Parten
    Charlie Brown
    Gary Numan

It's just text files with names, no header. I want these to now be converted to csv files, so I turned to Python and wrote this code:

    import os
    import pandas as pd
    
    folder = '\path\text\'
    csvFolder = '\path\csv\'
    
    for filename in os.listdir(folder):
    
        if filename.endswith('.txt'):
            file_path = os.path.join(folder, filename)
            csvpath = os.path.join(csvFolder, filename)
            
            #if file is empty
            if os.stat(file_path).st_size == 0:
                df = pd.DataFrame()
    
            #for other files
            else:
                df = pd.read_csv(file_path, header=0, names=None)
    
            csv_path = os.path.splitext(csvpath)[0] + '.csv'
    
            df.to_csv(csv_path, index=False)
    
    
    print("Text files have been converted to csv")

When I ran it, it gave me an error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

I did some research but didn't see anything for Pandas, only for the csv function. Someone included this under some responses:

    df = pd.read_csv(file_path, encoding='cp1252', header=0, names=None)

I tried it out and the program ran, but the csv files were corrupted with strange characters. I tried this on a test folder where I created text files and it ran fine and the output was good, but with the text files created from Powershell, the code runs (with no error messages) but the output isn't correct.

Here is an example of what I am seeing in the csv files after the conversion:

    ¿ Ã Ÿâ

The else statement seems to be where the error is occurring since this is where the conversion takes place. I ran df:

df = pd.read_csv(file_path, encoding='cp1252', header=0, names=None)
print("This is df: ", df)

This is the sample output:

This is df:      ÿþA
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
1

There are 1 best solutions below

0
noobCoder On

I think I blew this issue out of proportion. I thought this was a much larger issue, but just playing around with the encoding while I waited for a response seemed to fix this. I simply added utf-16 in the encoding:

df = pd.read_csv(file_path, encoding='utf-16', header=0)
print("this is df: \n", df)

The output:

this is df:
Clark Kent
Dolly Parten
Charlie Brown
Gary Numan