I have two virtually equivalent strings. They look the same.
str1<-"Diş Hekimliği Fakültesi"
str2<-"Diş Hekimliği Fakültesi"
But when I try nchar() on them they return 26 and 23 characters respectively. And when I use strsplit();
strsplit(str1,split="")
[[1]]
[1] "D" "i" "s" "̧" " " "H" "e" "k" "i" "m" "l" "i" "g" "̆" "i" " " "F" "a" "k" "u" "̈" "l" "t" "e" "s" "i"
strsplit(str2,split="")
[[1]]
[1] "D" "i" "ş" " " "H" "e" "k" "i" "m" "l" "i" "ğ" "i" " " "F" "a" "k" "ü" "l" "t" "e" "s" "i"
Each language specific special character is counted as two characters. How can I make str1 into str2? My only manual solution was using gsub().
ps. Unfortunately I cannot bring this example to here in full. When you try to copy paste the code it will be both 23 characers. Something with copy-pasting here.
The
iconvfunction is a system-specific function that manages transliterations among international encodings. There is a functioniconvlistthat can return a vector of the names that your OS facility uses; I ran through all 419 such encodings on my system with the help ofsapplyandtryto see if I could get conversions of str1 (23 characters) to 26 or vice versa and found two such encodings on my machine. Since I use a Mac, I cannot give any assurances that these particular values will work for you, since you don't disclose your OS status:I was able to put together an MWE with just the output from your
strsplit-result fromstr2above:After many error messages (which do not stop execution because of the enclosing
try(), I got a list of 2 encodings using this code:Then thinking that the reverse might succeed (since str1 started as a 23-char object) I successfully tried:
Looking at the webpages for the Windows iconv is see that there is a listing for
{10081, "x-mac-turkish"}, /* Turkish (Mac) */. If you are on Windoze perhaps that may be tried.================
Earlier investigations below (I think it is useful to know how to pull apart character values.)
OK. I can actually put together an MWE with just your stuff above:
Now to do some character hacking:
So look at the three Raw items that are representing your third letter. It appears that the second representation used a base character which backspaces it with a hex "cc" and then prints the descender. Now see if we can recognize them with regex:
And here's a gsub that I think is probably more efficient than what you ended up with if you were working with the split-versions of those words:
Also note that there were actually 29 raw entries in the one R was telling you there were 26 "characters" (and 26 in the one that supposedly had 23). I think the three
cc(backspaces) were not actually being counted.