I have trouble parsing an email which is encoded in win-1252 and contains the following header (literally like that in the file):
Subject: Счета на оплату по заказу . .
Here is a hexdump of that area:
000008a0 56 4e 4f 53 41 52 45 56 40 41 43 43 45 4e 54 2e |VNOSAREV@ACCENT.|
000008b0 52 55 3e 0d 0a 53 75 62 6a 65 63 74 3a 20 d1 f7 |RU>..Subject: ..|
000008c0 e5 f2 e0 20 ed e0 20 ee ef eb e0 f2 f3 20 ef ee |... .. ...... ..|
000008d0 20 e7 e0 ea e0 e7 f3 20 20 20 2e 20 20 2e 20 20 | ...... . . |
000008e0 20 20 0d 0a 58 2d 4d 61 69 6c 65 72 3a 20 4f 73 | ..X-Mailer: Os|
000008f0 74 72 6f 53 6f 66 74 20 53 4d 54 50 20 43 6f 6e |troSoft SMTP Con|
I realize that this encoding doesn't adhere to the usual RFC 1342 style encoding of =?charset?encoding?encoded-text?= but I assume that many email clients will still correctly display the subject and hence I would like to extract it correctly as well. For context: I am not making these emails up or creating them, they are given and I need to deal with them as is.
My approach so far was to use the email module that comes with Python:
import email
with open('data.eml', 'rb') as fp:
content = fp.read()
mail = email.message_from_bytes(content)
print(mail.get('subject'))
# ����� �� ������ �� ������ . .
print(mail.get('subject').encode())
# '=?unknown-8bit?b?0ffl8uAg7eAg7u/r4PLzIO/uIOfg6uDn8yAgIC4gIC4gICAg?='
My questions are:
- can I somehow convince the
emailmodule to parse mails with subjects like this correctly? - if not, can I somehow access the "raw" data of this header? i.e. the entries of
mail._headerswithout accessing private properties? - if not, can someone recommend a more versatile Python module for email parsing?
Some random observations:
a) Poking around in the internal data structure of mail, I arrived at [hd[1] for hd in mail._headers if hd[0] == 'Subject'] which is:
['\udcd1\udcf7\udce5\udcf2\udce0 \udced\udce0 \udcee\udcef\udceb\udce0\udcf2\udcf3 \udcef\udcee \udce7\udce0\udcea\udce0\udce7\udcf3 . . ']
b) According to the docs, mail.get_charsets() returns a list of character sets in case of multipart message, and it returns [None, 'windows-1251', None] here. So at least theoretically, the modules does have a chance to guessing the correct charset.
For completeness, the SHA256 has of the email file is 1aee4d068c2ae4996a47a3ae9c8c3fa6295a14b00d9719fb5ac0291a229b4038 (and I uploaded it to MalShare and VirusTotal).
The string you are seeing is just a normal unicode string which contains a lot of characters from the low surrogate range. I am quite sure that in this case, the string came about by using the
.decodemethod with asurrogateescapeerror handler. Indeed:To undo the damage, you should be able to use
.encode("utf8", "surrogateescape").decode("windows-1251").It is unclear to me whether they actually used
utf8with thesurrogateescapehandler, and you would have to match the charset that they (incorrectly) decode with. However, since the string matches yours perfectly, I thinkutf8is what is being used.