I realized that accents in my texts get converted to �. I boiled it down, to the following example, which writes (and overwrites) the file test.txt.
It uses exclusively methods from Data.Text, which are supposed to handle unicode texts. I checked that both the source file as well the output file are encoded in utf8.
{-# LANGUAGE OverloadedStrings #-}
import Prelude hiding (writeFile)
import Data.Text
import Data.Text.IO
someText :: Text
someText = "Université"
main :: IO ()
main = do
writeFile "test.txt" someText
After running the code, test.txt contains: Universit�. In ghci, I get the following
*Main> someText
"Universit\233"
Is this already encoded incorrectly? I also found a comment on � in https://hackage.haskell.org/package/text-1.2.2.2/docs/Data-Text.html, but I still do not know how to correct the example above.
How do I use accents in an OverloadedString and correctly write them to a file?
This has nothing to do with
Data.Text, and certainly not withOverloadedStrings– both handle UTF-8–Unicode just fine.However
Data.Text.IOwill not write a BOM or anything that indicates the encoding, i.e. the file really just contains the text as-is. On any modern system, this means it will be in raw UTF-8 form:So depending on what editor you open the file with, it may guess a wrong encoding, and that's apparently your issue. On Linux, UTF-8 has long been the standard, so no issue here, but Windows isn't so up-to-date. It should be possible to manually select the encoding in any editor, though.
In fact,
Data.Text.IO.writeFilewill use your locale to decide how to encode the file. Everybody should have UTF-8 as their locale nowadays, if you don't please change that.To get a BOM in your file and thus preclude such issues, use
utf8_bom.Regarding the output you see in GHCi: that's the
Showinstance at work; it escapes any string-like values to the safest conceivable form, i.e. anything that's not ASCII to an escape sequence, which for'é'happens to be'\233'. Again not specific toText, in fact you get this even for single characters:This escaping never happens when you use the direct-IO-output actions for your string types, i.e.
putChar,putStrorputStrLn.