What is the best way to convert all non display characters in a pandas dataframe?

647 Views Asked by At

I am loading data into a pandas dataframe from an Excel sheet and there are a lot of non display characters in many columns that I want to convert.

The most prevalent is an apostrophe being used in a contraction ; e.g. doesn't which comes out as doesn’t.

In the past I have used :

str.encode('ascii', errors='ignore').decode('utf-8')

but this required me to know which columns I needed to fix.

In this case I have 103 columns which could each contain this or other types of issues like this.

I am looking for a way to just replace any and all issues across the entire dataframe.

Is there a quick and easy way to do this over the entire dataframe without having to pass in each column to a function ?

2

There are 2 best solutions below

1
NYC Coder On

While reading the excel you should add encoding='utf-8'

df = pd.read_excel('App Stuff.xlsx', encoding='utf-8')

or use encoding='unicode-escape'

0
Ehsan On

Try to find the best encodings which are suitable for your file with :

from encodings.aliases import aliases
alias_values = set(aliases.values())

for value in alias_values:
    try:
        df = pd.read_csv(your_file, encoding=value) # or pd.read_excel
        print(value)
    except:
        continue

then open your file with the right encodings and see which one works best!