How can I decode thrice bytes encoded string?

45 Views Asked by At

I've been working with a pandas dataframe, where one column was bytes encoded. I decoded it once with .decode('utf-8'), and it worked for the major part of the data, but there were some strings, that occured to be encoded more than once. For example: b'b'b\'[{"charcName":"\\\\u0420\\\\u0438\\\\u0441\\\\u0443\\\\u043d\\\\u043e\\\\u043a","charcValues":["\\\\u043c\\\\u0438\\\\u043b\\\\u0438\\\\u0442\\\\u0430\\\\u0440\\\\u0438 \\\\u043a\\\\u0430\\\\u043c\\\\u0443\\\\u0444\\\\u043b\\\\u044f\\\\u0436"]}]\'''

I tried to decode it consequently (and encode as well, in order to prevent an error 'str' object has no attribute 'decode'), but it doesn't seem to work. How can I decode such strings completely? In what order utf-8 and unicode_escape decoding should be applied?

1

There are 1 best solutions below

0
Mark Tolonen On

The original string wasn't valid so I stripped one bad layer of bytes-decoration and focused on decoding the remainder. It won't work on the other entries since I manually stripped the bad part of the invalid string. Tell the hacks upstream to fix it.

import ast
import json

s = b'b\'[{"charcName": "\\\\u0420\\\\u0438\\\\u0441\\\\u0443\\\\u043d\\\\u043e\\\\u043a", "charcValues": ["\\\\u043c\\\\u0438\\\\u043b\\\\u0438\\\\u0442\\\\u0430\\\\u0440\\\\u0438 \\\\u043a\\\\u0430\\\\u043c\\\\u0443\\\\u0444\\\\u043b\\\\u044f\\\\u0436"]}]\''
s = ast.literal_eval(s.decode())
s = ast.literal_eval(s.decode())

print('# Original object:')
print(s)
print('\n# Properly encoded in JSON (tell the hacks of the original data how to do it):')
print(json.dumps(s))
print('\n# Or this, but make sure to write this to a UTF-8-encoded database or file.')
print(json.dumps(s, ensure_ascii=False))

Output:

# Original object:
[{'charcName': 'Рисунок', 'charcValues': ['милитари камуфляж']}]

# Properly encoded in JSON (tell the hacks of the original data how to do it):
[{"charcName": "\u0420\u0438\u0441\u0443\u043d\u043e\u043a", "charcValues": ["\u043c\u0438\u043b\u0438\u0442\u0430\u0440\u0438 \u043a\u0430\u043c\u0443\u0444\u043b\u044f\u0436"]}]

# Or this, but make sure to write this to a UTF-8-encoded database or file.
[{"charcName": "Рисунок", "charcValues": ["милитари камуфляж"]}]