How to encode a character string in multiple languages in R?

419 Views Asked by At

I am having issues with how R is handling characters in different languages. I have a multilanguage data set (PL, HR, EN, FR, GE, IT) and I created a keyword string to filter this. However, R is not recognizing all of my characters in every language but converts them which is problematic.

So imagine I would like to look for the word "łapać" in my data by using the string then R would filter for "lapac" and thus wouldn't find the necessary word, because in the database it has properly read the original word:

catch <- "łapać"
catch
[1] "lapac"

I tried out different things and for some characters/languages it is working. For example:

things <- "ćłßöüžỳđčšśęıчуй"
things
[1] "clßöüžỳdcšseiчуй"

As you see, some characters are displayed as they should be (ö,ü,ž and even the cyrillic ones like ч or й) others are converted (ćł to cl).

I tried reopening the document with different encoding and changing the encoding:

options(encoding = "utf-8")

Encoding(things) <- "UTF-8"

Also, I tried it with differen R versions on two different Windows computers.

R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit) 
Running under: Windows 10 x64 (build 17763)
locale:
[1] LC_COLLATE=German_Germany.1250  LC_CTYPE=German_Germany.1250   
[3] LC_MONETARY=German_Germany.1250 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1250
1

There are 1 best solutions below

0
Doło On

When running

Sys.setlocale(category = "LC_ALL", locale = ".1250")

it works! See:

catch <- "łapać"
catch
[1] "łapać"

However not perfectly, as seen in

things <- "ćłßöüžỳđčšśęıчуй"
things
[1] "ćłßöüžỳđčšśęiчуй"

where the Turkish ı is still an i. Since I won't use Turkish, that's fine for me so far.

Thank you!