The problem is that openFile assumes UTF-8 and the handle returns that as the encoding. The real problem is I am getting files submitted (by students) encoded in UTF-16LE that I want to identify, so I can convert them to UTF-8. The files don't actually have anything outsde the ASCII range, apart from the BOM markers, which the conversion to UTF-8 sorts out. I tried the following:
fixFileEncoding fname =
do hdl <- openFile fname ReadMode
menc <- hGetEncoding hdl
hClose hdl
case menc of
Nothing -> system ("cp "++fname++" safe"++fname)
Just enc ->
do let encstr = show enc
putStrLn ("@@@@@@" ++ fname ++ " is "++encstr)
if take 6 encstr == "UTF-16"
then
system ("iconv -f UTF-16LE -t UTF-8 "++fname++" > safe"++fname)
else
system ("cp "++fname++" safe"++fname)
The "@@@@@" line reports UTF-8 regardless of the files actual encoding. I verify this by using the unix file command to observe filetypes.
Normally you know the encoding of a file from how it was produced. There are ad hoc solutions like BOMs but they still rely on the producer adhering to such a format. Without a priori knowledge about the source of a file (as is the case for files submitted by students), the only way is to use heuristics. That's what
filedoes. You can also implement a simple solution in Haskell using the libraries bytestring and text:Data.ByteString.readFile,Data.Text.Encodingcontains the UTF ones),To explain the result you observed in your attempt, when you open a file, the encoding is simply guessed from environment variables on your OS (specifically, the locale) That's why you always get the same result with
hGetEncoding. AllopenFilehas to go on is the name of the file, which is not enough context to guess the encoding of a file.