Topic Model LDA: Problem with removing of special character

88 Views Asked by At

I want to remove the special character from my simple corpus. Unfortunately, it doesn't work in my case. I tried different variations of gsub. Also, I tried to copy the dash from my R object. I use XML data and changes it in a simple corpus. For this I used tm_map.

If I use

text <- c("Today is the weather nice — I want to go to the beach —")
text_new <- gsub("—", "", text)

The output is

Today is the weather nice — I want to go to the beach —

whereas I'd like my output to be

Today is the weather nice I want to got to the beach

If I define the text as a vector than it works. But as a corpus R doesn't recognise the symbol . How can I detect the long dash?

1

There are 1 best solutions below

1
LeaK On

It could well be that you are searching for a - with your gsub() function, while the text from the PDF contains a long dash or any other type of dash that only looks similar. Have you tried opening the R object with the text and copy pasting the - you want to delete from there to your gsub() function?