Before Yahoo groups was closed, you could download the content of a group to an mbox file. I am trying to convert the mbox file to a series of html files - one for each message. My problem is dealing with the encoding and special characters in the html. Here is my attempt:
import mailbox
the_dir = "/path/to/file"
mbox = mailbox.mbox(the_dir + "12394334.mbox")
html_header = """<!DOCTYPE html>
<html>
<head>
<title>Email message</title>
</head>
<body>"""
html_footer = '</body></html>'
for message in mbox:
mess_from = message['from']
subject = message['subject']
time_received = message['date']
if message.is_multipart():
content = ''.join(str(part.get_payload(decode=True)) for part in message.get_payload())
else:
content = message.get_payload(decode=True)
content = str(content)[2:].replace('\\n', '<br/>')
subject.replace('/', '-')
fname = subject + " " + time_received + '.html'
with open(the_dir + 'html/' + fname , 'w') as the_file:
the_file.write(html_header)
the_file.write('<br/>' + 'From: ' + mess_from)
the_file.write('<br/>' + 'Subject: ' + subject)
the_file.write('<br/>' + 'Received: ' + time_received + '<br/><br/>')
the_file.write(content)
The content of the message has backslashes before apostrophes and other special characters like this:
star rating, currently going for \xa311.99 [ideal Xmas present]. Advert over - Seroiusly, if you don't have a decent book on small boat
My question is, what is the best way to get the email message content and write it to the html file with the correct characters. I can't be the first one to run into this problem.
I found the answer to this question.
First, I needed to identify html by the subtype (part.get_content_subtype()). That is how I know I have an html subtype.
Then I needed to get the character set using part.get_charsets(). There is a part.get_charset() but it always returns None so I take the first element of get_charsets()
The get_payload seems to be bass ackward with the decode=True parameter meaning it will not decode the payload. I then decode the message using the charset I got earlier. Otherwise, I decode it with decode=False.
If it is text I strip out linefeeds etc and add an html header and then write to the file.
Next jobs,
text
print('Done!')