Problem to process visually identical looking characters (umlauts)

Question

Problem to process visually identical looking characters (umlauts)

91 Views Asked by Madamadam At 25 March 2023 at 12:17

This may probably be a more general issue related to character encoding, but since I came across the issue while coding an outer join of two dataframes, I post it with a Python code example.

On the bottom line the question is: why is ö technically not the identical character as ö and how can I make sure, both are not only visually identical but also technically? If you copy-paste both characters in a text editor and do a search for one of them, you will never find both!

So now the Python example trying to do a simple outer join of two dataframes on the column 'filename' (here presented as CSV data):

df1:

filename;abstract
problematic_ö.txt;abc
non-problematic_ö.txt;yxz

df2:

bytes;filename
374;problematic_ö.txt
128;non-problematic_ö.txt

Python code:

import csv
import pandas as pd

df1 = pd.read_csv('df1.csv', header=0, sep = ';')
df2 = pd.read_csv('df2.csv', header=0, sep = ';')

print(df1) 
print(df2) 

df_outerjoin = pd.merge(df1, df2, how='outer', indicator=True)
df_outerjoin.to_csv('df_outerjoin.csv', sep =';', index=False, header=True, quoting=csv.QUOTE_NONNUMERIC)

print(df_outerjoin)

Output:

#               filename   abstract     bytes        _merge
1      problematic_ö.txt        abc       NaN     left_only
2  non-problematic_ö.txt        yxz     128.0          both
3      problematic_ö.txt        NaN     374.0    right_only

So the 'ö' in the problematic filename isn't recognised as the same character as 'ö' in the non-problematic filename.

What is happening here?

What can I do to overcome this issue — can I do something "smart" by importing the data files with special encoding setting or will I have to do a dumb search and replace?

Original Q&A

There are 3 best solutions below

**Lover of Structure** · Answer 1 · 2023-03-25T12:22:52.503000

The issue is that characters with diacritics can often be represented either with so-called combining diacritics or as so-called precomposed characters in Unicode. The first instance in problematic_ö.txt is U+006F U+0308 (o with a combining di(a)eresis diacritic), while the second instance in non-problematic_ö.txt is the precomposed U+00F6.

To normalize, try this:

import unicodedata

oUml_combDiac = '\u006F\u0308' # 'ö' (combining diacritic)
oUml_precomp = '\u00F6' # 'ö' (precomposed)
oUml_combDiac == oUml_precomp
# False
unicodedata.normalize('NFC', oUml_combDiac) == \
  unicodedata.normalize('NFC', oUml_precomp)
# True

(oUml here stands for "o with umlaut".)

**Nejc** · Answer 2 · 2023-03-25T12:28:13.127000

There are various ways to represent the same character in Unicode. In your situation, the problematic filename contains the character 'ö', which is really represented by two Unicode code points: 'o' (Latin Small Letter O) and the combining character ''. (Combining Diaeresis) The non-problematic filename, on the other hand, uses the character 'ö' (Latin Small Letter O with Diaeresis), which is represented by a single Unicode code char.

You can use unicodedata library - unicodedata.normalize.

It works like this

import unicodedata

a = "ö"
b = "ö"
print(a == b)

a = unicodedata.normalize('NFC', a)
b = unicodedata.normalize('NFC', b)

print(a == b)

Output:

False
True

**Andj** · Answer 3 · 2023-03-25T13:44:24.190000

Rather than using unicodedata, Pandas provides the following method Series.str.normalize(form), so something like:

df1['filename'] = df1['filename'].str.normalize('NFC')
df2['filename'] = df2['filename'].str.normalize('NFC')

Before doing the outer join.

Problem to process visually identical looking characters (umlauts)

There are 3 best solutions below

Related Questions in PYTHON

Related Questions in CHARACTER-ENCODING

Related Questions in UTF

Trending Questions

Popular # Hahtags

Popular Questions