Python email module behaves unexpected when trying to parse "raw" subject lines

159 Views Asked by At

I have trouble parsing an email which is encoded in win-1252 and contains the following header (literally like that in the file):

Subject: Счета на оплату по заказу   .  .    

Here is a hexdump of that area:

000008a0  56 4e 4f 53 41 52 45 56  40 41 43 43 45 4e 54 2e  |VNOSAREV@ACCENT.|
000008b0  52 55 3e 0d 0a 53 75 62  6a 65 63 74 3a 20 d1 f7  |RU>..Subject: ..|
000008c0  e5 f2 e0 20 ed e0 20 ee  ef eb e0 f2 f3 20 ef ee  |... .. ...... ..|
000008d0  20 e7 e0 ea e0 e7 f3 20  20 20 2e 20 20 2e 20 20  | ......   .  .  |
000008e0  20 20 0d 0a 58 2d 4d 61  69 6c 65 72 3a 20 4f 73  |  ..X-Mailer: Os|
000008f0  74 72 6f 53 6f 66 74 20  53 4d 54 50 20 43 6f 6e  |troSoft SMTP Con|

I realize that this encoding doesn't adhere to the usual RFC 1342 style encoding of =?charset?encoding?encoded-text?= but I assume that many email clients will still correctly display the subject and hence I would like to extract it correctly as well. For context: I am not making these emails up or creating them, they are given and I need to deal with them as is.

My approach so far was to use the email module that comes with Python:

import email

with open('data.eml', 'rb') as fp:
    content = fp.read()

mail = email.message_from_bytes(content)

print(mail.get('subject'))
# ����� �� ������ �� ������   .  .    

print(mail.get('subject').encode())
# '=?unknown-8bit?b?0ffl8uAg7eAg7u/r4PLzIO/uIOfg6uDn8yAgIC4gIC4gICAg?='

My questions are:

  1. can I somehow convince the email module to parse mails with subjects like this correctly?
  2. if not, can I somehow access the "raw" data of this header? i.e. the entries of mail._headers without accessing private properties?
  3. if not, can someone recommend a more versatile Python module for email parsing?

Some random observations:

a) Poking around in the internal data structure of mail, I arrived at [hd[1] for hd in mail._headers if hd[0] == 'Subject'] which is:

['\udcd1\udcf7\udce5\udcf2\udce0 \udced\udce0 \udcee\udcef\udceb\udce0\udcf2\udcf3 \udcef\udcee \udce7\udce0\udcea\udce0\udce7\udcf3   .  .    ']

b) According to the docs, mail.get_charsets() returns a list of character sets in case of multipart message, and it returns [None, 'windows-1251', None] here. So at least theoretically, the modules does have a chance to guessing the correct charset.

For completeness, the SHA256 has of the email file is 1aee4d068c2ae4996a47a3ae9c8c3fa6295a14b00d9719fb5ac0291a229b4038 (and I uploaded it to MalShare and VirusTotal).

3

There are 3 best solutions below

1
Jesko Hüttenhain On

The string you are seeing is just a normal unicode string which contains a lot of characters from the low surrogate range. I am quite sure that in this case, the string came about by using the .decode method with a surrogateescape error handler. Indeed:

In [1]: a = "Счета на оплату по заказу"

In [2]: a.encode("windows-1251").decode("utf8", "surrogateescape")
Out[2]: '\udcd1\udcf7\udce5\udcf2\udce0 \udced\udce0 \udcee\udcef\udceb\udce0\udcf2\udcf3 \udcef\udcee \udce7\udce0\udcea\udce0\udce7\udcf3'

To undo the damage, you should be able to use .encode("utf8", "surrogateescape").decode("windows-1251").

It is unclear to me whether they actually used utf8 with the surrogateescape handler, and you would have to match the charset that they (incorrectly) decode with. However, since the string matches yours perfectly, I think utf8 is what is being used.

0
JosefZ On

mail.get_charsets() returns probably right values (with hard-coded the hexdump provided):

x = '56 4e 4f 53 41 52 45 56  40 41 43 43 45 4e 54 2e' + \
    '52 55 3e 0d 0a 53 75 62  6a 65 63 74 3a 20 d1 f7' + \
    'e5 f2 e0 20 ed e0 20 ee  ef eb e0 f2 f3 20 ef ee' + \
    '20 e7 e0 ea e0 e7 f3 20  20 20 2e 20 20 2e 20 20' + \
    '20 20 0d 0a 58 2d 4d 61  69 6c 65 72 3a 20 4f 73' + \
    '74 72 6f 53 6f 66 74 20  53 4d 54 50 20 43 6f 6e'

print(bytes.fromhex(x).decode('windows-1251'))
[email protected]>
Subject: Счета на оплату по заказу   .  .
X-Mailer: OstroSoft SMTP Con
0
tripleee On

Your mail.get('subject').encode() does return exactly the bytes you put in. There is no "correctly" beyond this point; you have to know, or guess, the correct encoding.

mail.raw_items() returns what purports to be the "raw" headers from the message, but they are actually encoded. @Jesko's answer shows how to take the encoded value and transform it back to the original bytes, provided you know which encoding to use.

(The surrogate encoding is apparently a hack to allow Python to keep the raw bytes in a form which cannot accidentally leak back into a proper decoded string. You have to know how it was assembled and explicitly request it to be undone.)

Going out on a limb, you can try all the encodings of the body of the message, and check if any of them return a useful decoding.

The following uses the modern EmailMessage API where mail.get('subject').encode() no longer behaves like in your example (I think perhaps this is a bug?)

import email
from email.policy import default

content = b'''\
From: <[email protected]>
Subject: \xd1\xf7\xe5\xf2\xe0 \xed\xe0 \xee\xef\xeb\xe0\xf2\xf3 \xef\xee \xe7\xe0\xea\xe0\xe7\xf3   .  .    
Content-type: text/plain; charset="windows-1251"

\xef\xf0\xe8\xe2\xe5\xf2
'''

# notice use of modern EmailMessage API, by specifying a policy
mail = email.message_from_bytes(content, policy=default)

# print(mail.get("subject"))

charsets = set(mail.get_charsets()) - {None}
for header, value in mail.raw_items():
    if header == "Subject":
        value = value.encode("utf-8", "surrogateescape")
        for enc in charsets:
            try:
                print(value.decode(enc))
                break
            except (UnicodeEncodeError, UnicodeDecodeError):
                pass

This crude heuristic could still misfire in a number of situations. If you know the encoding, probably hardcode it.

To the extent that mail clients are able to display the raw header correctly, I'm guessing it's mainly pure luck. If the system they are running on is set up to use code page 1251 by default, that probably helps some mail clients. Some mail clients also let you manually select an encoding for each message, so you can play around until you get the right one (and perhaps leave it at that setting if you receive many messages with this problem).