How do I know if my Stata file needs unicode translate?

330 Views Asked by At

I suspect I do not quite grasp the difference between extended ASCII and unicode and what they mean to Stata, therefore I am not able to determine if I need to use unicode translate on my Stata files or not. The files contain a lot of text with Nordic characters such as å, ä, ö that I need to analyse, so this is potentially important.

Using the browse mode, the text contained in the string variables in the original file looks fine to me (e.g. "myös"). But running unicode analyze suggests using unicode translate. However, if I do unicode translate (using encoding that allows for Nordic characters which are important in my data), in browse mode the text looks all wrong, displaying weird characters in places of å, ä and ö (e.g. "myäs" instead of "myös").

How I even ran into this issue is that someone had translated the data, and the text was messed up. I eventually found out by unicode restore that the original looks fine. But I do not understand why that someone wanted to translate it in the first place, if it was fine already.

So my question is: if the data LOOKS fine to me, does that mean it is then fine for Stata as well? And if it LOOKS wrong to me, can it still be fine for Stata?

I have tried:

//browse mydata.dta: looks fine

unicode encoding set iso-8859_10-1998
unicode analyze mydata.dta

        1  file(s) specified
        1  file(s) to be examined ...

  File mydata.dta (Stata dataset)
        1 str# variable needs translation
          -------------------------------------------------------------------------------------------------------------------------------
          Some elements of the file appear to be UTF-8 already.  Sometimes elements that need translating can look like UTF-8.  Look at
          these example(s):
              variable name "xyzö"
              contents of str# variable ntrl
          Do they look okay to you?
          If not, the file needs translating or retranslating with the transutf8 option.  Type
              . unicode   translate "mydata.dta", transutf8
              . unicode retranslate "mydata.dta", transutf8
          -------------------------------------------------------------------------------------------------------------------------------
          File needs translation.  Use unicode translate on this file.

unicode translate mydata.dta

//browse mydata.dta: Nordic characters in text all messed up

I have tried different encodings as well, none seem to work. And based on help encodings I've understood that because of the use of Nordic letters it should be iso-8859_10-1998.

0

There are 0 best solutions below