I have a big text file with different entries, some are just plain ASCII, some are UTF-8 and some are like double-UTF-8.
Here's the content of the file as cat shows it:
'Böker'
'für'
And here's what less would show:
'BÃ<U+0083>¶ker'
'für'
This is what I would like to get (clean ISO-8859-1):
'Böker'
'für'
Using iconv --from-code=UTF-8 --to-code=ISO-8859-1 this is the result:
'Böker'
'für'
Using iconv --from-code=UTF-8 --to-code=ISO-8859-1 twice (with the same parameters), it gives the correct ö, but interprets the ü as well (output from less):
'Böker'
'f<FC>r'
One approach would be to test every string in bash which unicode format it currently is in. I searched quite a lot for this, but couldn't find a suitable answer.
Another approach would be to have a program that converts the strings directly to the correct format, but I couldn't find another program like iconv, and since <FC> is a perfectly valid character in ISO-8859-1, neither using "-c" nor adding "//IGNORE" to the -to-code change the output.
It's impossible to solve this in a general way (what if 'Böker' as well as 'Böker' could be valid input?) but usually you can find a heuristic that works for your data. Since you seem to have only or mostly German-language strings, the problematic characters are
ÄÖÜäöüß. One approach would be to search every entry for these characters in ISO-8859-1, UTF-8 and double encoded UTF-8. If a match is found, simply assume that this is correct encoding.If you're using
bash, you can grep for the byte sequences using the$'\xnn'syntax. You only have to make sure thatgrepuses theClocale. Here's an example for the characterö(output from a UTF-8 console):But it's probably easier to solve this with a scripting language like Perl or Python.