Reports out of Amazon's SP-API are generally in UTF-8 except for the ones out of Japan, which are in CP932. I cannot seem to figure out how to decode these into usable data.
Running Ruby 3.1.2 and using the amz_sp_api gem for connecting with Amazon
For CSV reports we are doing:
data = AmzSpApi.inflate_document(content, report_document)
csv_string = CSV.generate do |csv|
data.gsub("\r", "").split("\n").each do |line|
csv << line.split("\t")
end
end
csv_string.force_encoding 'ASCII-8BIT'
csv = CSV.parse(csv_string, headers: true)
Which doesn't complain about anything, but the resulting data looks something like:
...
"ship-state"=>"\xE7\xA6\x8F\xE5\xB2\xA1\xE7\x9C\x8C",
If I force the encoding to be 'CP932' then when I try to parse the csv I get:
3.1.2/lib/ruby/3.1.0/csv/parser.rb:786:in `build_scanner': Invalid byte sequence in Windows-31J in line 2. (CSV::MalformedCSVError)
For the XML reports we are using Nokogiri and doing something like this:
data = AmzSpApi.inflate_document(content, report_document)
parsed_xml = Nokogiri::XML(data)
The resulting xml is actually only part of the first node because it seems to silently fail.
In the above example data has:
data.encoding
=> #<Encoding:ASCII-8BIT>
You get the idea.
I obviously need to do SOMETHING to get all this to parse out properly but I am unclear what that something is.
I believe that perhaps the data is being converted to a string from a byte string, but that must be happening automatically behind the scenes
What doesn't work (but works for all Amazon reports in other regions that come down as UTF-8):
Output:
In the above, the xml will be malformed and not work (Hence the 1 order)
What works:
Output:
The issue seems to be Nokogiri (and other online parsers I found) cannot handle that xml tag that says the encoding is CP932.
<?xml version="1.0" encoding="CP932"?>The above code with gsub also works for UTF-8 files (because it does nothing)
NOTE: If you use
HTTPartyinstead ofFaradaythe content encoding isUTF-8instead ofASCII-8BITbut the issue (and solution) remains the same.