How can you determine a files text encoding in Haskell without using openFile to get a handle?

65 Views Asked by At

The problem is that openFile assumes UTF-8 and the handle returns that as the encoding. The real problem is I am getting files submitted (by students) encoded in UTF-16LE that I want to identify, so I can convert them to UTF-8. The files don't actually have anything outsde the ASCII range, apart from the BOM markers, which the conversion to UTF-8 sorts out. I tried the following:

fixFileEncoding fname =
  do hdl <- openFile fname ReadMode
     menc <- hGetEncoding hdl
     hClose hdl
     case menc of
       Nothing   ->  system ("cp "++fname++" safe"++fname)
       Just enc  -> 
         do let encstr = show enc
            putStrLn ("@@@@@@" ++ fname ++ " is "++encstr)
            if take 6 encstr == "UTF-16"
            then 
              system ("iconv -f UTF-16LE -t UTF-8 "++fname++" > safe"++fname)
            else 
              system ("cp "++fname++" safe"++fname)

The "@@@@@" line reports UTF-8 regardless of the files actual encoding. I verify this by using the unix file command to observe filetypes.

1

There are 1 best solutions below

0
Li-yao Xia On

Normally you know the encoding of a file from how it was produced. There are ad hoc solutions like BOMs but they still rely on the producer adhering to such a format. Without a priori knowledge about the source of a file (as is the case for files submitted by students), the only way is to use heuristics. That's what file does. You can also implement a simple solution in Haskell using the libraries bytestring and text:

  1. read a file in binary as a bytestring, using Data.ByteString.readFile,
  2. try decoding it with some guessed encodings (Data.Text.Encoding contains the UTF ones),
  3. keep the one that succeeds; more heuristics may be needed if more than one encoding is applicable.

To explain the result you observed in your attempt, when you open a file, the encoding is simply guessed from environment variables on your OS (specifically, the locale) That's why you always get the same result with hGetEncoding. All openFile has to go on is the name of the file, which is not enough context to guess the encoding of a file.