How to correct extra characters in strings due to language specific special characters in R?

Question

How to correct extra characters in strings due to language specific special characters in R?

245 Views Asked by berkorbay At 04 April 2015 at 17:44

I have two virtually equivalent strings. They look the same.

str1<-"Diş Hekimliği Fakültesi"
str2<-"Diş Hekimliği Fakültesi"

But when I try nchar() on them they return 26 and 23 characters respectively. And when I use strsplit();

strsplit(str1,split="")
[[1]]
 [1] "D" "i" "s" "̧"   " " "H" "e" "k" "i" "m" "l" "i" "g" "̆"   "i" " " "F" "a" "k" "u" "̈"   "l" "t" "e" "s" "i"

strsplit(str2,split="")
[[1]]
 [1] "D" "i" "ş" " " "H" "e" "k" "i" "m" "l" "i" "ğ" "i" " " "F" "a" "k" "ü" "l" "t" "e" "s" "i"

Each language specific special character is counted as two characters. How can I make str1 into str2? My only manual solution was using gsub().

ps. Unfortunately I cannot bring this example to here in full. When you try to copy paste the code it will be both 23 characers. Something with copy-pasting here.

Original Q&A

There are 1 best solutions below

**IRTFM** · Answer 1 · 2015-04-04T18:49:05.190000

The iconv function is a system-specific function that manages transliterations among international encodings. There is a function iconvlist that can return a vector of the names that your OS facility uses; I ran through all 419 such encodings on my system with the help of sapply and try to see if I could get conversions of str1 (23 characters) to 26 or vice versa and found two such encodings on my machine. Since I use a Mac, I cannot give any assurances that these particular values will work for you, since you don't disclose your OS status:

I was able to put together an MWE with just the output from your strsplit-result from str2 above:

str1<-"Diş Hekimliği Fakültesi"
str3 <- scan(what="")
 "D" "i" "s" "̧"   " " "H" "e" "k" "i" "m" "l" "i" "g" "̆"   "i" " " "F" "a" "k" "u" "̈"   "l" "t" "e" "s" "i"
#27: 
#Read 26 items
> str3c <- paste0(str3, collapse="")
> nchar(str3c)
[1] 26
> str1
[1] "Diş Hekimliği Fakültesi"

After many error messages (which do not stop execution because of the enclosing try(), I got a list of 2 encodings using this code:

?iconv
which(sapply( try(utils::head(iconvlist(), n = 419)), function(xc) 
                                                  try(nchar(iconv(str1, to=xc))))==26)
#--------snipped large number of error messages-------
Error in nchar(iconv(str1, to = xc)) : invalid multibyte string 1
UTF-8-MAC  UTF8-MAC 
      400       402

Then thinking that the reverse might succeed (since str1 started as a 23-char object) I successfully tried:

> iconv(str3c,from="UTF-8-MAC", to="UTF-8")
[1] "Diş Hekimliği Fakültesi"
> nchar(iconv(str3c,from="UTF-8-MAC", to="UTF-8"))
[1] 23

Looking at the webpages for the Windows iconv is see that there is a listing for {10081, "x-mac-turkish"}, /* Turkish (Mac) */. If you are on Windoze perhaps that may be tried.

================

Earlier investigations below (I think it is useful to know how to pull apart character values.)

OK. I can actually put together an MWE with just your stuff above:

str1<-"Diş Hekimliği Fakültesi"
str3 <- scan(what="")
#1: "D" "i" "s" "̧"   " " "H" "e" "k" "i" "m" "l" "i" "g" "̆"   "i" " " "F" "a" "k" "u" "̈"   "l" "t" "e" "s" "i"
#27: 
#Read 26 items
> str3c <- paste0(str3, collapse="")
> nchar(str3c)
[1] 26
> str1
[1] "Diş Hekimliği Fakültesi"

Now to do some character hacking:

> ?charToRaw
> charToRaw(str3c)
 [1] 44 69 73 cc a7 20 48 65 6b 69 6d 6c 69 67 cc 86 69 20 46 61 6b 75 cc 88 6c 74 65
[28] 73 69
> charToRaw(str1)
 [1] 44 69 c5 9f 20 48 65 6b 69 6d 6c 69 c4 9f 69 20 46 61 6b c3 bc 6c 74 65 73 69

So look at the three Raw items that are representing your third letter. It appears that the second representation used a base character which backspaces it with a hex "cc" and then prints the descender. Now see if we can recognize them with regex:

 rawToChar( charToRaw(str3c) [3])
#[1] "s"
 rawToChar( charToRaw(str3c) [4])
#[1] "\xcc"
 rawToChar( charToRaw(str3c) [5])
#[1] "\xa7"
 grep("s\\xcc\\xa7", str3c)
#[1] 1   # Success!

And here's a gsub that I think is probably more efficient than what you ended up with if you were working with the split-versions of those words:

gsub("s\\xcc\\xa7", "\\c5\\9f", str3c)
#[1] "Diş Hekimliği Fakültesi"

Also note that there were actually 29 raw entries in the one R was telling you there were 26 "characters" (and 26 in the one that supposedly had 23). I think the three cc (backspaces) were not actually being counted.

How to correct extra characters in strings due to language specific special characters in R?

There are 1 best solutions below

Related Questions in R

Related Questions in ENCODING

Related Questions in DOUBLE-BYTE

Trending Questions

Popular # Hahtags

Popular Questions