Remove Words with less than Certain Character Lengths plus Noise Reduction before Tokenization

168 Views Asked by S Das At 22 April 2022 at 15:46

I have the following data frame

report <- data.frame(Text = c("unit 1 crosses the street", 
       "driver 2 was speeding and saw driver# 1", 
        "year 2019 was the year before the pandemic",
        "hey saw       hei hei in        the    wood",
        "hello: my kityy! you are the best"), id = 1:5)
report 
                                         Text id
1                   unit 1 crosses the street  1
2     driver 2 was speeding and saw driver# 1  2
3  year 2019 was the year before the pandemic  3
4 hey saw       hei hei in        the    wood  4
5           hello: my kityy! you are the best  5

From a previous coding help, we can remove stop words using the following code.

report$Text <- gsub(paste0('\\b',tm::stopwords("english"), '\\b', 
                          collapse = '|'), '', report$Text)
report
                                    Text id
1                 unit 1 crosses  street  1
2      driver 2  speeding  saw driver# 1  2
3            year 2019   year   pandemic  3
4 hey saw       hei hei             wood  4
5                 hello:  kityy!    best  5

I want to remove words less than certain character length (for example, want to remove words less than 4 characters such as hei and hey). Plus need to remove manual stop words (for example, saw and kitty) and common noises (whitespaces, numbers, and punctuations) before tokenization. The final outcome would be:

                                    Text id
1                   unit crosses  street  1
2                driver speeding  driver  2
3                     year year pandemic  3
4                                   wood  4
5                             hello best  5

A similar question regarding noise and manual stop words is posted here.

Original Q&A

There are 1 best solutions below

akrun On 22 April 2022 at 16:02 BEST ANSWER

With the previous code, if we start with removal of words that have nchar less than or equal to 3 (with gsubfn) it should work

trimws(gsub(paste0("\\b(", paste(union(c("saw", "kityy"), 
   tm::stopwords("english")), collapse="|"), ")\\b"), "", 
     gsub("[[:punct:]0-9]+", "",gsubfn("\\w+", function(x) 
     if(nchar(x) > 3) x else '', report$Text))))))

-output

[1] "unit crosses street"    "driver speeding driver" 
[3] "year year pandemic"     "wood"                   "hello best"

Remove Words with less than Certain Character Lengths plus Noise Reduction before Tokenization

There are 1 best solutions below

Related Questions in R

Related Questions in NLP

Related Questions in TEXT-MINING

Related Questions in TM

Related Questions in STOP-WORDS

Trending Questions

Popular # Hahtags

Popular Questions