How to correct extra characters in strings due to language specific special characters in R?

245 Views Asked by At

I have two virtually equivalent strings. They look the same.

str1<-"Diş Hekimliği Fakültesi"
str2<-"Diş Hekimliği Fakültesi"

But when I try nchar() on them they return 26 and 23 characters respectively. And when I use strsplit();

strsplit(str1,split="")
[[1]]
 [1] "D" "i" "s" "̧"   " " "H" "e" "k" "i" "m" "l" "i" "g" "̆"   "i" " " "F" "a" "k" "u" "̈"   "l" "t" "e" "s" "i"

strsplit(str2,split="")
[[1]]
 [1] "D" "i" "ş" " " "H" "e" "k" "i" "m" "l" "i" "ğ" "i" " " "F" "a" "k" "ü" "l" "t" "e" "s" "i"

Each language specific special character is counted as two characters. How can I make str1 into str2? My only manual solution was using gsub().

ps. Unfortunately I cannot bring this example to here in full. When you try to copy paste the code it will be both 23 characers. Something with copy-pasting here.

1

There are 1 best solutions below

3
IRTFM On

The iconv function is a system-specific function that manages transliterations among international encodings. There is a function iconvlist that can return a vector of the names that your OS facility uses; I ran through all 419 such encodings on my system with the help of sapply and try to see if I could get conversions of str1 (23 characters) to 26 or vice versa and found two such encodings on my machine. Since I use a Mac, I cannot give any assurances that these particular values will work for you, since you don't disclose your OS status:

I was able to put together an MWE with just the output from your strsplit-result from str2 above:

str1<-"Diş Hekimliği Fakültesi"
str3 <- scan(what="")
 "D" "i" "s" "̧"   " " "H" "e" "k" "i" "m" "l" "i" "g" "̆"   "i" " " "F" "a" "k" "u" "̈"   "l" "t" "e" "s" "i"
#27: 
#Read 26 items
> str3c <- paste0(str3, collapse="")
> nchar(str3c)
[1] 26
> str1
[1] "Diş Hekimliği Fakültesi"

After many error messages (which do not stop execution because of the enclosing try(), I got a list of 2 encodings using this code:

?iconv
which(sapply( try(utils::head(iconvlist(), n = 419)), function(xc) 
                                                  try(nchar(iconv(str1, to=xc))))==26)
#--------snipped large number of error messages-------
Error in nchar(iconv(str1, to = xc)) : invalid multibyte string 1
UTF-8-MAC  UTF8-MAC 
      400       402 

Then thinking that the reverse might succeed (since str1 started as a 23-char object) I successfully tried:

> iconv(str3c,from="UTF-8-MAC", to="UTF-8")
[1] "Diş Hekimliği Fakültesi"
> nchar(iconv(str3c,from="UTF-8-MAC", to="UTF-8"))
[1] 23

Looking at the webpages for the Windows iconv is see that there is a listing for {10081, "x-mac-turkish"}, /* Turkish (Mac) */. If you are on Windoze perhaps that may be tried.

================

Earlier investigations below (I think it is useful to know how to pull apart character values.)

OK. I can actually put together an MWE with just your stuff above:

str1<-"Diş Hekimliği Fakültesi"
str3 <- scan(what="")
#1: "D" "i" "s" "̧"   " " "H" "e" "k" "i" "m" "l" "i" "g" "̆"   "i" " " "F" "a" "k" "u" "̈"   "l" "t" "e" "s" "i"
#27: 
#Read 26 items
> str3c <- paste0(str3, collapse="")
> nchar(str3c)
[1] 26
> str1
[1] "Diş Hekimliği Fakültesi"

Now to do some character hacking:

> ?charToRaw
> charToRaw(str3c)
 [1] 44 69 73 cc a7 20 48 65 6b 69 6d 6c 69 67 cc 86 69 20 46 61 6b 75 cc 88 6c 74 65
[28] 73 69
> charToRaw(str1)
 [1] 44 69 c5 9f 20 48 65 6b 69 6d 6c 69 c4 9f 69 20 46 61 6b c3 bc 6c 74 65 73 69

So look at the three Raw items that are representing your third letter. It appears that the second representation used a base character which backspaces it with a hex "cc" and then prints the descender. Now see if we can recognize them with regex:

 rawToChar( charToRaw(str3c) [3])
#[1] "s"
 rawToChar( charToRaw(str3c) [4])
#[1] "\xcc"
 rawToChar( charToRaw(str3c) [5])
#[1] "\xa7"
 grep("s\\xcc\\xa7", str3c)
#[1] 1   # Success!

And here's a gsub that I think is probably more efficient than what you ended up with if you were working with the split-versions of those words:

gsub("s\\xcc\\xa7", "\\c5\\9f", str3c)
#[1] "Diş Hekimliği Fakültesi"

Also note that there were actually 29 raw entries in the one R was telling you there were 26 "characters" (and 26 in the one that supposedly had 23). I think the three cc (backspaces) were not actually being counted.