Problem to process visually identical looking characters (umlauts)

91 Views Asked by At

This may probably be a more general issue related to character encoding, but since I came across the issue while coding an outer join of two dataframes, I post it with a Python code example.

On the bottom line the question is: why is technically not the identical character as ö and how can I make sure, both are not only visually identical but also technically? If you copy-paste both characters in a text editor and do a search for one of them, you will never find both!

So now the Python example trying to do a simple outer join of two dataframes on the column 'filename' (here presented as CSV data):

df1:

filename;abstract
problematic_ö.txt;abc
non-problematic_ö.txt;yxz

df2:

bytes;filename
374;problematic_ö.txt
128;non-problematic_ö.txt

Python code:

import csv
import pandas as pd

df1 = pd.read_csv('df1.csv', header=0, sep = ';')
df2 = pd.read_csv('df2.csv', header=0, sep = ';')

print(df1) 
print(df2) 

df_outerjoin = pd.merge(df1, df2, how='outer', indicator=True)
df_outerjoin.to_csv('df_outerjoin.csv', sep =';', index=False, header=True, quoting=csv.QUOTE_NONNUMERIC)

print(df_outerjoin)

Output:

#               filename   abstract     bytes        _merge
1      problematic_ö.txt        abc       NaN     left_only
2  non-problematic_ö.txt        yxz     128.0          both
3      problematic_ö.txt        NaN     374.0    right_only

So the 'ö' in the problematic filename isn't recognised as the same character as 'ö' in the non-problematic filename.

What is happening here?

What can I do to overcome this issue — can I do something "smart" by importing the data files with special encoding setting or will I have to do a dumb search and replace?

3

There are 3 best solutions below

0
Lover of Structure On

The issue is that characters with diacritics can often be represented either with so-called combining diacritics or as so-called precomposed characters in Unicode. The first instance in problematic_ö.txt is U+006F U+0308 (o with a combining di(a)eresis diacritic), while the second instance in non-problematic_ö.txt is the precomposed U+00F6.

To normalize, try this:

import unicodedata

oUml_combDiac = '\u006F\u0308' # 'ö' (combining diacritic)
oUml_precomp = '\u00F6' # 'ö' (precomposed)
oUml_combDiac == oUml_precomp
# False
unicodedata.normalize('NFC', oUml_combDiac) == \
  unicodedata.normalize('NFC', oUml_precomp)
# True

(oUml here stands for "o with umlaut".)

0
Nejc On

There are various ways to represent the same character in Unicode. In your situation, the problematic filename contains the character 'ö', which is really represented by two Unicode code points: 'o' (Latin Small Letter O) and the combining character ''. (Combining Diaeresis) The non-problematic filename, on the other hand, uses the character 'ö' (Latin Small Letter O with Diaeresis), which is represented by a single Unicode code char.

You can use unicodedata library - unicodedata.normalize.

It works like this

import unicodedata

a = "ö"
b = "ö"
print(a == b)

a = unicodedata.normalize('NFC', a)
b = unicodedata.normalize('NFC', b)

print(a == b)

Output:

False
True
0
Andj On

Rather than using unicodedata, Pandas provides the following method Series.str.normalize(form), so something like:

df1['filename'] = df1['filename'].str.normalize('NFC')
df2['filename'] = df2['filename'].str.normalize('NFC')

Before doing the outer join.