What is the best way to convert all non display characters in a pandas dataframe?

647 Views Asked by Justin At 09 May 2020 at 15:59

I am loading data into a pandas dataframe from an Excel sheet and there are a lot of non display characters in many columns that I want to convert.

The most prevalent is an apostrophe being used in a contraction ; e.g. doesn't which comes out as doesnâ€™t.

In the past I have used :

str.encode('ascii', errors='ignore').decode('utf-8')

but this required me to know which columns I needed to fix.

In this case I have 103 columns which could each contain this or other types of issues like this.

I am looking for a way to just replace any and all issues across the entire dataframe.

Is there a quick and easy way to do this over the entire dataframe without having to pass in each column to a function ?

Original Q&A

There are 2 best solutions below

NYC Coder On 09 May 2020 at 16:12

While reading the excel you should add encoding='utf-8'

df = pd.read_excel('App Stuff.xlsx', encoding='utf-8')

or use encoding='unicode-escape'

Ehsan On 09 May 2020 at 17:30

Try to find the best encodings which are suitable for your file with :

from encodings.aliases import aliases
alias_values = set(aliases.values())

for value in alias_values:
    try:
        df = pd.read_csv(your_file, encoding=value) # or pd.read_excel
        print(value)
    except:
        continue

then open your file with the right encodings and see which one works best!

What is the best way to convert all non display characters in a pandas dataframe?

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in CHARACTER-ENCODING

Related Questions in STRING-DECODING

Trending Questions

Popular # Hahtags

Popular Questions