I have the following data frame
report <- data.frame(Text = c("unit 1 crosses the street",
"driver 2 was speeding and saw driver# 1",
"year 2019 was the year before the pandemic",
"hey saw hei hei in the wood",
"hello: my kityy! you are the best"), id = 1:5)
report
Text id
1 unit 1 crosses the street 1
2 driver 2 was speeding and saw driver# 1 2
3 year 2019 was the year before the pandemic 3
4 hey saw hei hei in the wood 4
5 hello: my kityy! you are the best 5
From a previous coding help, we can remove stop words using the following code.
report$Text <- gsub(paste0('\\b',tm::stopwords("english"), '\\b',
collapse = '|'), '', report$Text)
report
Text id
1 unit 1 crosses street 1
2 driver 2 speeding saw driver# 1 2
3 year 2019 year pandemic 3
4 hey saw hei hei wood 4
5 hello: kityy! best 5
I want to remove words less than certain character length (for example, want to remove words less than 4 characters such as hei and hey). Plus need to remove manual stop words (for example, saw and kitty) and common noises (whitespaces, numbers, and punctuations) before tokenization. The final outcome would be:
Text id
1 unit crosses street 1
2 driver speeding driver 2
3 year year pandemic 3
4 wood 4
5 hello best 5
A similar question regarding noise and manual stop words is posted here.
With the previous code, if we start with removal of words that have
ncharless than or equal to 3 (withgsubfn) it should work-output