I'm trying to ananalyse some facebook messenger data and I'm having trouble with utf-8 encoding.
import os
import json
import datetime
from tqdm import tqdm
import csv
from datetime import datetime
directory = "facebook-100071636101603/messages/inbox"
folders = os.listdir(directory)
if ".DS_Store" in folders:
folders.remove(".DS_Store")
for folder in tqdm(folders):
print(folder)
for filename in os.listdir(os.path.join(directory,folder)):
if filename.startswith("message"):
data = json.load(open(os.path.join(directory,folder,filename), "r"))
for message in data["messages"]:
try:
date = datetime.fromtimestamp(message["timestamp_ms"] / 1000).strftime("%Y-%m-%d %H:%M:%S")
sender = message["sender_name"]
content = message["content"]
with open('output.csv', 'w', encoding="utf-8") as csv_file:
writer = csv.writer(csv_file)
writer.writerow([date,sender,content])
except KeyError:
pass
This script works but the output csv doesn't show the accentuated characters.
I'm very knew to this so I haven't tried a lot. I've read the Python csv documentation and found this passage:
Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getencoding()). To decode a file using a different encoding, use the encoding argument of open:
import csv with open('some.csv', newline='', encoding='utf-8') as f: reader = csv.reader(f) for row in reader: print(row)
But this doesn't seems to work.
Edit : This is the output I'm getting but it should be Jørn and not Jørn and quête, not quête.
Try adding
encoding="utf-8to this line:This will ensure that every file you import is in the utf-8 encoding format
EDIT:
You need to install ftfy using
pip install ftfy. This package will fix your broken encoding. Changesenderandcontentto fix the encoding using ftfy by writing this:You can use
ftfy.fix_text(string)for any other broken encoding as well.