I've come across a problem when i use gsub() for removing punctuation from the string in R.
I need to remove punctuation from the Cyrillic string, but when I use the function, it also removes the letter "ч"!
I receive the string from .txt file, so only that locale that I mention below works.
Here's what I did
Sys.setlocale("LC_CTYPE", "russian")
teststring <- 'человек часто! учитывает* черепашечек'
teststring
# [1] "человек часто! учитывает* черепашечек"
clean <- gsub("[[:punct:]]", "", teststring)
clean
# [1] "еловек асто уитывает ерепашеек"
As you can see, it counts 'ч' as a punctuation mark. How could I work around this issue?
As it was memtioned in comments, it is better to set locale the right way:
Then if you need to read .txt files using readLines(), make sure the file itself is saved with UTF-8 encoding. In my case that was the problem. Then when using readLines() add the encoding argument:
Then removing punctuation with gsub() works just like it should.