I have a python string that looks like as shown below. This string is from the SEC filing of one public company in the US. I am trying to remove some annoying characters from the string using unicodedata.normalise function, but this is not removing all characters. What could be the reason behind such behavior?
from unicodedata import normalize
s = '[email protected]\nFacsimile\nNo.:\xa0 312-233-2266\n\xa0\nJPMorgan Chase Bank,\nN.A., as Administrative Agent\n10 South Dearborn, Floor 7th\nIL1-0010\nChicago, IL 60603-2003\nAttention:\xa0 Hiral Patel\nFacsimile No.:\xa0 312-385-7096\n\xa0\nLadies and Gentlemen:\n\xa0\nReference is made to the\nCredit Agreement, dated as of May\xa07, 2010 (as the same may be amended,\nrestated, supplemented or otherwise modified from time to time, the \x93Credit Agreement\x94), by and among\nHawaiian Electric Industries,\xa0Inc., a Hawaii corporation (the \x93Borrower\x94), the Lenders from time to\ntime party thereto and JPMorgan Chase Bank, N.A., as issuing bank and\nadministrative agent (the \x93Administrative Agent\x94).'
normalize('NFKC', s)
'[email protected]\nFacsimile\nNo.: 312-233-2266\n \nJPMorgan Chase Bank,\nN.A., as Administrative Agent\n10 South Dearborn, Floor 7th\nIL1-0010\nChicago, IL 60603-2003\nAttention: Hiral Patel\nFacsimile No.: 312-385-7096\n \nLadies and Gentlemen:\n \nReference is made to the\nCredit Agreement, dated as of May 7, 2010 (as the same may be amended,\nrestated, supplemented or otherwise modified from time to time, the \x93Credit Agreement\x94), by and among\nHawaiian Electric Industries, Inc., a Hawaii corporation (the \x93Borrower\x94), the Lenders from time to\ntime party thereto and JPMorgan Chase Bank, N.A., as issuing bank and\nadministrative agent (the \x93Administrative Agent\x94).'
As one can see from the outputs, the characters \xa0 is handled properly, but the characters like \x92, \x93 and \x94 are not normalized and are as it is in the result string.
Your data was decoded as ISO-8859-1 (aka
latin1), but those Unicode code points are control characters in that encoding. In Windows-1252 (akacp1252) they are so-called smart quotes:They also don't change when normalized, but at least they display correctly if decoded properly:
Note the
\xa0code point is U+00A0 (NO-BREAK SPACE) and canonically normalizes to a SPACE:It prints correctly without normalization: