How to use UTF-8 encoding for unicode objects in python correctly

69 Views Asked by At

In one of the python module, there is this name string that contains non-ascii characters. While logging this object, python gives UnicodeDecodeError. For example:

# coding: UTF-8

import logging

root_logger= logging.getLogger()
root_logger.setLevel(logging.DEBUG) # or whatever
handler = logging.FileHandler('example.log', 'w', 'utf-8') # or whatever
formatter = logging.Formatter('%(name)s %(message)s') # or whatever
handler.setFormatter(formatter) # Pass handler as a parameter, not assign
root_logger.addHandler(handler)

class C(object):

  def __init__(self, name):
        self._name = name

  def __str__(self):
        print("__str__ start")
        return self.to_unicode().encode("utf-8")

  def __repr__(self):
        print("__repr__ start")
        return self.to_unicode().encode("utf-8")

  def to_unicode(self):
        print("to_unicode start")
        return u"name:{}".format(self._name)

obj = C(name="vm_nearsync_한국")

logging.debug(u"obj:{}".format(obj))

It retunes below error:

__str__ start
to_unicode start
Traceback (most recent call last):
  File "test.py", line 31, in <module>
    logging.debug(u"obj:{}".format(obj))
  File "test.py", line 19, in __str__
    return self.to_unicode().encode("utf-8")
  File "test.py", line 27, in to_unicode
    return u"name:{}".format(self._name)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 12: ordinal not in range(128)

It is actually trying to decode the string name into unicode with default ascii encoding but rather I expect it to use utf-8 encoding.

It only works when I change default system encodings as below:

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

My python version is Python 2.7.5

Is there any way to workaround this without changing systems default encoding? The object can have many such data and there are many places where software is logging data.

1

There are 1 best solutions below

0
Sandeep Parmar On

What @deceze said in comment is correct. Unicode and byte string should not be mixed. However if really needed, user should explicitly specify encoding and should not rely on systems default encoding.

Below approaches work

Approach1 (Use everything in unicode):

enter code here

class C(object):

  def __init__(self, name):
    self._name = name

  def __str__(self):
    print("__str__ start")
    return self.to_unicode()

  def __repr__(self):
    print("__repr__ start")
    return self.to_unicode()

  def to_unicode(self):
    print("to_unicode start")
    return u"name:{}".format(self._name)

obj = C(name=u"vm_nearsync_한국")

logging.debug(u"obj:{}".format(obj))

"""

Approach2 (Explicitly decode byte string to unicode with utf-8):

class C(object):

  def __init__(self, name):
    self._name = name

  def __str__(self):
    print("__str__ start")
    return self.to_unicode().encode("utf-8")

  def __repr__(self):
    print("__repr__ start")
    return self.to_unicode().encode("utf-8")

  def to_unicode(self):
    print("to_unicode start")
    return u"name:{}".format(self._name)

obj = C(name=u"vm_nearsync_한국")

# Here is the change
logging.debug(u"obj:{}".format(unicode(str(obj),"utf-8"))) 

or

class C(object):

  def __init__(self, name):
    self._name = name

  def __str__(self):
    print("__str__ start")
    return self.to_unicode()

  def __repr__(self):
    print("__repr__ start")
    return self.to_unicode()

  def to_unicode(self):
    print("to_unicode start")
    
    # Here is the change
    return u"name:{}".format(self._name.decode("utf-8"))

obj = C(name="vm_nearsync_한국")

logging.debug(u"obj:{}".format(obj))