How to remove punctuation from Cyrillic string in R?

38 Views Asked by At

I've come across a problem when i use gsub() for removing punctuation from the string in R. I need to remove punctuation from the Cyrillic string, but when I use the function, it also removes the letter "ч"!

I receive the string from .txt file, so only that locale that I mention below works.

Here's what I did

Sys.setlocale("LC_CTYPE", "russian")

teststring <- 'человек часто! учитывает* черепашечек'
teststring
# [1] "человек часто! учитывает* черепашечек"
clean <- gsub("[[:punct:]]", "", teststring)
clean
# [1] "еловек асто уитывает ерепашеек"

As you can see, it counts 'ч' as a punctuation mark. How could I work around this issue?

1

There are 1 best solutions below

0
polifolli On

As it was memtioned in comments, it is better to set locale the right way:

Sys.setlocale("LC_CTYPE", "ru_RU.UTF-8")

Then if you need to read .txt files using readLines(), make sure the file itself is saved with UTF-8 encoding. In my case that was the problem. Then when using readLines() add the encoding argument:

happy = readLines("./happy.txt", encoding = "UTF-8")

Then removing punctuation with gsub() works just like it should.