Unable to remove unicode in specific scenario and am completely lost on why this is happening

126 Views Asked by At

I was running into an issue where given a string, I want to use unicodedata.normalize("NFKD",raw_data) in order to remove a particular problem point for my data cleanser. However I have ran into a huge issue that I am completely unable to figure out and it seems paradoxical.

I got the idea to use the unicodedata.normalize function from this post: Normalizing Unicode

I figured this would work:

raw_data = unicodedata.normalize("NFKD",raw_data)

In order to try to solve problem I thought maybe for each line I could apply normalization, thought maybe my string was too big! So I went line by line, discovered it was not about length of string. I thought maybe the command didn't work. I tested this out for myself in a new window and just ran Python.

When I imported unicodedata, and entered the following line:

unicodedata.normalize("NFKD","Clean the Ice Cream Maker –\xa0Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt.")

The output was: "Clean the Ice Cream Maker – Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt." - EXACTLY what I wanted!

I will break down the issue into a scenario reproducible by any user:

Suppose raw_data = "\n\n1. Clean the Ice Cream Maker –\xa0Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt." (taken from Google Sheets using pandas module and gspread).

I would then take this raw_data string (which would usually contain many many lines with \n and listing) and create a list of each line (call this list input_lines) which would be defined by splitting the string via "\\n". This was what the code looked like where my problem occurs:

for line in input_lines[0]:
    print(line)
    print(unicodedata.normalize("NFKD",line))
    if line != unicodedata.normalize("NFKD",line):
        ...

This input_lines list contains "1. Clean the Ice Cream Maker –\xa0Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt.".

However when it came time to evaluate when: line = "1. Clean the Ice Cream Maker –\xa0Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt."

print(line) = "1. Clean the Ice Cream Maker –\xa0Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt."

print(unicodedata.normalize("NFKD",line)) = "1. Clean the Ice Cream Maker –\xa0Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt."

Yet if I had simply manually put this string into the function:

print(unicodedata.normalize("NFKD","1. Clean the Ice Cream Maker –\xa0Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt.")) = "1. Clean the Ice Cream Maker – Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt."

I have tried anything I could to get around this, I fundamentally do not understand what is happening here and it appears to be consuming my soul so any help would be greatly appreciated.

To answer question in comments, when I do repr(line) when line = "1. Clean the Ice Cream Maker –\xa0Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt." in the for loop, the result is double \ rather than single.

3

There are 3 best solutions below

1
Eric On

I solved the issue myself thanks to the help from michael ruth in the comments.

I simply found the position of the unicode escape sequence using .find function and offset by expected values that follow the syntax:

line = line[:line.find("\\x")] + line[line.find("\\x")+4:]

I also did this for combinations of unicode using the same logic:

line = line[:line.find("\\u")] + line[line.find("\\u")+6:]
10
furas On

Using .encode("unicode-escape") and later .decode("unicode-escape") can't change it - because it goes back to original value.

You may need to mix "unicode-escape" with "raw-unicode-escape"

text = 'Clean the Ice Cream Maker \\u2013\\xa0Use a damp cloth'

print(text)

Gives

Clean the Ice Cream Maker \u2013\xa0Use a damp cloth

But

print(text.encode('raw-unicode-escape').decode('unicode-escape'))

Gives

Clean the Ice Cream Maker – Use a damp cloth

You have to test it on rest of your text.


BTW:

Sometimes you can get \ in text when you get data as bytes and you use str() to convert it to string. But this should add b' at the beginning.

As I remember Python 2 had problems with conversion because it kept string as bytes instead of unicode

0
Andj On

An alternative answer:

def codepoint(char:str) -> str:
    return f'U+{ord(char):04x}'
line = "1. Clean the Ice Cream Maker –\xa0Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt."
i = line.index('\xa0')
print(codepoint(line[i]))
# U+00a0

The function codepoint() is a convenience function to convert a single character to its codepoint designation using a minimum of four hexadecimal digits.

\xA0 resolves to U+00A0 (NO-BREAK SPACE). As a UTF-8 byte sequence it would be b'\xc2\xa0'

Python 3 strings consist of the character itself, a hex escape (if the character resolves to two hexadecimal digits \xA0. The Unicode escape notation \u00A0 or \U000000A0, or alternatively by the character name: \N{NO-BREAK SPACE}

codepoint(' ')
# 'U+00a0'
codepoint('\xA0')
# 'U+00a0'
codepoint('\u00A0')
# 'U+00a0'
codepoint('\U000000A0')
# 'U+00a0'
codepoint('\N{NO-BREAK SPACE}')
# 'U+00a0'

Each of these representations are treated equally.

Using the code from the answer above:

oline = line[:line.find("\\x")] + line[line.find("\\x")+4:]
codepoint(oline[i])
# 'U+00a0'

Essentially, if the string is single escaped (\xA0) the code will do nothing, the NON-BREAK SPACE will remain. If it is double-escaped, it will delete the code will remove \\xA0.

It's the same character either way:

print('\xA0' == ' ')
# True

In the OP, initially unicodedata.normalize('NFKD', line) was used to normalise or clean the data. This will convert U+00A0 to U+0020:

import unicodedata as ud
o2line = ud.normalize('NFKD', line)
codepoint(o2line[i])
# 'U+0020'

At this point NO-BREAK SPACE has been converted to SPACE, which is a common procedure in data cleaning. This is what the OP was attempting to do initially.

NFKD normalisation can change other things in the data, and unless you know exactly what is in your data before processing, unintended changes can occur.

An alternative is to use replacement and directly target the characters you want to change:

o3line = line.replace('\xA0', ' ')
codepoint(o3line[i]) 
# 'U+0020'

This will convert all NO-BREAK SPACE characters to a SPACE character.

Alternative, we could use regex:

import re
pattern = re.compile(r'\s')
o4line = re.sub(pattern, ' ', line)
codepoint(o4line[i])
# 'U+0020'

This will convert each whitespace character to a SPACE character. Although I would use the regex pattern r'\s+' so that multiple whitespace characters are collapsed into a single space.

In my projects I tend to use PyICU instead.