UnicodeEncodeError with in2csv

118 Views Asked by At

could you please help me with something?

I have very large .xlsx files (ranging from 200MB to 1.4GB) that I need to convert to .csv format.

I wanted to use a csvkit function in2csv:

(venv) C:\Users\Martin\csvkit_tutorial>in2csv first.xlsx > second.csv

However, when I run it, I get this error:

C:\Users\Martin\csvkit_tutorial\venv\Lib\site-packages\agate\utils.py:274: UnnamedColumnWarning: Column 0 has no name. Using "a".
UnicodeEncodeError: 'charmap' codec can't encode characters in position 137-146: character maps to <undefined>

I don't care if the first column is named "a". The main issue is the second error, UnicodeEncodeError.

The second.csv file is created, but only until the error has occured, i.e. only 133 lines out of thousands are converted into .csv.

I have looked online, and people have suggested that converting the encoding into, e.g. utf-8 standard would help, but I do not know how to do this within csvkit.

Thank you for your help.

TLDR: I tried to convert .xlsx file into .csv using csvkit function in2csv. It did not work properly, citing a UnicodeEncodeError, and only a small part of the file was converted.

2

There are 2 best solutions below

0
Alan On

There are 2 answers that come to mind quickly - they are easy to miss in the docs though.

in2csv has an encoding parameter: -e ENCODING, --encoding ENCODING

It also has --encoding-xls ENCODING_XLS - I am not sure what the difference is, but for a unicode error you probably want -e utf8

The second possibility is more universal (but I do not recommend) - Tips & Tricks explains:

Python standard output encoding errors

If, when running a command like csvlook dummy.csv | less you get an error like:

"ascii" codec can't encode character "\u0105" in position 2: ordinal not in range(128)

The simplest option is to set the encoding that Python uses for standard streams, using the PYTHONIOENCODING environment variable:

env PYTHONIOENCODING=utf8 csvlook dummy.csv | less

I don't recommend you set this universally, it is mentioned here for completeness.

0
Serge Ballesta On

csvkit always create the output csvfile with the default encoding. It has options for the input encoding (--encoding and --encoding-xls) but they are not used for output.

I could reproduce the error with a tiny file containing 2 cells with the chinese characters you have provided in your comment: 亚得科技工程有限公司 and 帝斯曼食品专业有限公司.

Setting the PYTHONIOENCODING environment variable to utf8 was enough to fix:

  • Windows syntax:

      set PYTHONIOENCODING=utf8
      in2csv first.xlsx > second.csv
    
  • Linux (or other Unix-like):

      PYTHONIOENCODING=utf8 in2csv first.xlsx > second.csv