This may probably be a more general issue related to character encoding, but since I came across the issue while coding an outer join of two dataframes, I post it with a Python code example.
On the bottom line the question is: why is ö technically not the identical character as ö and how can I make sure, both are not only visually identical but also technically?
If you copy-paste both characters in a text editor and do a search for one of them, you will never find both!
So now the Python example trying to do a simple outer join of two dataframes on the column 'filename' (here presented as CSV data):
df1:
filename;abstract
problematic_ö.txt;abc
non-problematic_ö.txt;yxz
df2:
bytes;filename
374;problematic_ö.txt
128;non-problematic_ö.txt
Python code:
import csv
import pandas as pd
df1 = pd.read_csv('df1.csv', header=0, sep = ';')
df2 = pd.read_csv('df2.csv', header=0, sep = ';')
print(df1)
print(df2)
df_outerjoin = pd.merge(df1, df2, how='outer', indicator=True)
df_outerjoin.to_csv('df_outerjoin.csv', sep =';', index=False, header=True, quoting=csv.QUOTE_NONNUMERIC)
print(df_outerjoin)
Output:
# filename abstract bytes _merge
1 problematic_ö.txt abc NaN left_only
2 non-problematic_ö.txt yxz 128.0 both
3 problematic_ö.txt NaN 374.0 right_only
So the 'ö' in the problematic filename isn't recognised as the same character as 'ö' in the non-problematic filename.
What is happening here?
What can I do to overcome this issue — can I do something "smart" by importing the data files with special encoding setting or will I have to do a dumb search and replace?
The issue is that characters with diacritics can often be represented either with so-called combining diacritics or as so-called precomposed characters in Unicode. The first instance in
problematic_ö.txtisU+006F U+0308(o with a combining di(a)eresis diacritic), while the second instance innon-problematic_ö.txtis the precomposedU+00F6.To normalize, try this:
(
oUmlhere stands for "o with umlaut".)