I was running into an issue where given a string, I want to use unicodedata.normalize("NFKD",raw_data) in order to remove a particular problem point for my data cleanser. However I have ran into a huge issue that I am completely unable to figure out and it seems paradoxical.
I got the idea to use the unicodedata.normalize function from this post: Normalizing Unicode
I figured this would work:
raw_data = unicodedata.normalize("NFKD",raw_data)
In order to try to solve problem I thought maybe for each line I could apply normalization, thought maybe my string was too big! So I went line by line, discovered it was not about length of string. I thought maybe the command didn't work. I tested this out for myself in a new window and just ran Python.
When I imported unicodedata, and entered the following line:
unicodedata.normalize("NFKD","Clean the Ice Cream Maker –\xa0Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt.")
The output was: "Clean the Ice Cream Maker – Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt." - EXACTLY what I wanted!
I will break down the issue into a scenario reproducible by any user:
Suppose raw_data = "\n\n1. Clean the Ice Cream Maker –\xa0Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt." (taken from Google Sheets using pandas module and gspread).
I would then take this raw_data string (which would usually contain many many lines with \n and listing) and create a list of each line (call this list input_lines) which would be defined by splitting the string via "\\n". This was what the code looked like where my problem occurs:
for line in input_lines[0]:
print(line)
print(unicodedata.normalize("NFKD",line))
if line != unicodedata.normalize("NFKD",line):
...
This input_lines list contains "1. Clean the Ice Cream Maker –\xa0Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt.".
However when it came time to evaluate when: line = "1. Clean the Ice Cream Maker –\xa0Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt."
print(line) = "1. Clean the Ice Cream Maker –\xa0Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt."
print(unicodedata.normalize("NFKD",line)) = "1. Clean the Ice Cream Maker –\xa0Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt."
Yet if I had simply manually put this string into the function:
print(unicodedata.normalize("NFKD","1. Clean the Ice Cream Maker –\xa0Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt.")) = "1. Clean the Ice Cream Maker – Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt."
I have tried anything I could to get around this, I fundamentally do not understand what is happening here and it appears to be consuming my soul so any help would be greatly appreciated.
To answer question in comments, when I do repr(line) when line = "1. Clean the Ice Cream Maker –\xa0Use a damp cloth or sponge to wipe down the outside and inside of the ice cream maker to remove dust or dirt." in the for loop, the result is double \ rather than single.
I solved the issue myself thanks to the help from michael ruth in the comments.
I simply found the position of the unicode escape sequence using .find function and offset by expected values that follow the syntax:
I also did this for combinations of unicode using the same logic: