How to remove a word from a dataset in R? NLP

Question

How to remove a word from a dataset in R? NLP

111 Views Asked by philosophy At 31 August 2023 at 21:58

I'm very new in this world of programming.

Ok so I am making an analysis of a text in R. I am using this to get rid of stop words:

kant_palavras <- kant_palavras %>% anti_join(get_stopwords(language = 'pt'))

BUT after, in the counting of words, the most common is "no". This is not useful for my analysis and I want to remove it, but I do not know how to do it.

I tried

kant_palavras <- kant_palavras %>% anti_join("no")

and

palavras_a_remover <- c("no") 

kant_palavras <- kant_palavras %>% anti_join(data.frame(palavra = palavras_a_remover))

and

palavras_a_remover <- c("no")

kant_palavras <- kant_palavras %>% 
  filter(!palavra %in% palavras_a_remover)

neither worked to get rid of that no!

--

full code before (all works):

dados_kant <- read.csv("kant2.csv")

dados_kant2 <- as_tibble(dados_kant)

Encoding(dados_kant2$texto.do.kant) <- "ASCII"

for (i in 1:nrow(dados_kant2))
{
  dados_kant2$texto.do.kant[i] <- iconv(dados_kant2$texto.do.kant[i], to = "ASCII//TRANSLIT")
}

kant_palavras <- dados_kant2 %>%  unnest_tokens(word, texto.do.kant)

kant_palavras <- kant_palavras %>% anti_join(get_stopwords(language = 'pt'))

Original Q&A

There are 4 best solutions below

**deschen** · Answer 1 · 2023-08-31T22:43:23.230000

You can do:

library(tidyverse)
kant_palavras <- kant_palavras %>%
  filter(!str_detect(texto.do.kant, '\\bno\\b'))

This would remove the entire row. If you only want to remove the word 'no', but keep the rest of the text, you can do:

kant_palavras <- kant_palavras %>%
  mutate(texto.do.kant = str_remove_all(texto.do.kant, '\\bno\\b'))

**Jay Bee** · Answer 2 · 2023-09-01T02:32:41.950000

I have an adapted version of the dput data you provided as df. I added some different variations of a 'no' value ('no', 'nono', 'No') so we can see what gets removed.

df <- structure(list(texto.do.kant = c("INTRODUÇÃO I — Da Distinção Entre o Conhecimento Puro e o Empírico Não se pode duvidar no de que todos os nossos conhecimentos começam com a experiência, nono porque, com efeito, como haveria de exercitar-se a faculdade de se conhecer, se não fosse pelos objetos que (...) de vista geral de um sistema, deve", "ela No no com\"")), row.names = 1:2, class = "data.frame")

And then:

library(tidyverse)
df2 <- str_remove(df$texto.do.kant, "\\bno\\b")

Which gives:

[1] "INTRODUÇÃO I — Da Distinção Entre o Conhecimento Puro e o Empírico Não se pode duvidar  de que todos os nossos conhecimentos começam com a experiência, nono porque, com efeito, como haveria de exercitar-se a faculdade de se conhecer, se não fosse pelos objetos que (...) de vista geral de um sistema, deve"
[2] "ela No  com\""

'no' is removed, while 'No' and 'nono' remain.

**margusl** · Answer 3 · 2023-09-01T11:00:18.070000

You are probably facing some encoding issue(s), with unicode text the

unnest_tokens(..., output = word)  %>% anti_join(get_stopwords(language = 'pt'))

approach should behave as expected. It's not just some anti_join() thing, until text encoding issues are not dealt with, you can't really do any meaningful text processing/analysis.

To illustrate, here's a reproducible example with non-utf8 text as an input, we'll first try to detect encoding, convert it to utf8, split it to words and remove stopwords while checking the effect of (almost) every step:

library(dplyr)
library(readr)
library(stringi)
library(stringr)
library(tidytext)

# example text, non-unicode:
kant_txt <- read_file("http://www.filosofia.com.br/figuras/livros_inteiros/167.txt")
# detecting encoding:
stri_enc_detect(kant_txt)
#> [[1]]
#>       Encoding Language Confidence
#> 1 windows-1252       pt       0.81
#> 2 windows-1250       ro       0.35
#> 3 windows-1254       tr       0.17
#> 4     UTF-16BE                0.10
#> 5     UTF-16LE                0.10

# convert to Unicode and store in tibble:
kant_utf8 <- stri_encode(kant_txt, from = "windows-1252", to = "utf8")
kant <- tibble(title = "critica_da_razao_pura", text = kant_utf8)
kant
#> # A tibble: 1 × 2
#>   title                 text                                                    
#>   <chr>                 <chr>                                                   
#> 1 critica_da_razao_pura "Immanuel Kant – Crítica da Razão Pura\r\n\r\nProfessor…

# split text into tokens, default unit is word and by default 
# tokens are converted to lowercase:
kant_tokens <- unnest_tokens(kant, output = word, input = text)
# note dimensions, 199624 rows:
kant_tokens
#> # A tibble: 199,624 × 2
#>    title                 word      
#>    <chr>                 <chr>     
#>  1 critica_da_razao_pura immanuel  
#>  2 critica_da_razao_pura kant      
#>  3 critica_da_razao_pura crítica   
#>  4 critica_da_razao_pura da        
#>  5 critica_da_razao_pura razão     
#>  6 critica_da_razao_pura pura      
#>  7 critica_da_razao_pura professor 
#>  8 critica_da_razao_pura em        
#>  9 critica_da_razao_pura kõnigsberg
#> 10 critica_da_razao_pura membro    
#> # ℹ 199,614 more rows

# count words starting with "n", top 5:
kant_tokens %>% 
  filter(str_starts(word, "n")) %>% 
  summarise(count = n(), .by = word) %>% 
  arrange(desc(count)) %>% 
  print(n = 5)
#> # A tibble: 185 × 2
#>   word     count
#>   <chr>    <int>
#> 1 não       3355
#> 2 na        1383
#> 3 no        1311
#> 4 nos        562
#> 5 natureza   466
#> # ℹ 180 more rows

# drop stopwords:
kant_nostop <- anti_join(kant_tokens, get_stopwords(language = 'pt'))
#> Joining with `by = join_by(word)`
# keep an eye on changed row count:
kant_nostop
#> # A tibble: 112,540 × 2
#>    title                 word      
#>    <chr>                 <chr>     
#>  1 critica_da_razao_pura immanuel  
#>  2 critica_da_razao_pura kant      
#>  3 critica_da_razao_pura crítica   
#>  4 critica_da_razao_pura razão     
#>  5 critica_da_razao_pura pura      
#>  6 critica_da_razao_pura professor 
#>  7 critica_da_razao_pura kõnigsberg
#>  8 critica_da_razao_pura membro    
#>  9 critica_da_razao_pura academia  
#> 10 critica_da_razao_pura real      
#> # ℹ 112,530 more rows

# count words starting with "n" after stopwords are removed, top 5:
kant_nostop %>% 
  filter(str_starts(word, "n")) %>% 
  summarise(count = n(), .by = word) %>% 
  arrange(desc(count)) %>% 
  print(n = 5)

#> # A tibble: 172 × 2
#>   word        count
#>   <chr>       <int>
#> 1 natureza      466
#> 2 nada          331
#> 3 nenhuma       245
#> 4 nenhum        235
#> 5 necessidade   176
#> # ℹ 167 more rows

^{Created on 2023-09-01 with reprex v2.0.2}

**Julia Silge** · Answer 4 · 2023-09-01T20:55:28.973000

Let's start with this example, where you count up the word in Pride and Prejudice after removing stopwords:

library(tidyverse)
library(tidytext)

tibble(txt = janeaustenr::prideprejudice) |> 
  unnest_tokens(word, txt) |> 
  anti_join(get_stopwords()) |> 
  count(word, sort = TRUE)
#> Joining with `by = join_by(word)`
#> # A tibble: 6,404 × 2
#>    word          n
#>    <chr>     <int>
#>  1 mr          785
#>  2 elizabeth   597
#>  3 said        401
#>  4 darcy       373
#>  5 mrs         343
#>  6 much        326
#>  7 must        305
#>  8 bennet      294
#>  9 miss        283
#> 10 jane        264
#> # ℹ 6,394 more rows

^{Created on 2023-09-01 with reprex v2.0.2}

But let's say you don't want to include those words "mr", "mrs", and "miss". If the list is short, I think I would use filter():

library(tidyverse)
library(tidytext)

tibble(txt = janeaustenr::prideprejudice) |> 
  unnest_tokens(word, txt) |> 
  anti_join(get_stopwords()) |>
  filter(!word %in% c("mr", "mrs", "miss")) |> 
  count(word, sort = TRUE)
#> Joining with `by = join_by(word)`
#> # A tibble: 6,401 × 2
#>    word          n
#>    <chr>     <int>
#>  1 elizabeth   597
#>  2 said        401
#>  3 darcy       373
#>  4 much        326
#>  5 must        305
#>  6 bennet      294
#>  7 jane        264
#>  8 one         263
#>  9 bingley     257
#> 10 know        236
#> # ℹ 6,391 more rows

^{Created on 2023-09-01 with reprex v2.0.2}

You could also add them to a stopword lexicon, like this:

library(tidyverse)
library(tidytext)

my_custom_stopwords <-
  get_stopwords() |> 
  bind_rows(
    tibble(
      word = c("mr", "mrs", "miss"),
      lexicon = "custom"
    )
  )

tail(my_custom_stopwords)
#> # A tibble: 6 × 2
#>   word  lexicon 
#>   <chr> <chr>   
#> 1 too   snowball
#> 2 very  snowball
#> 3 will  snowball
#> 4 mr    custom  
#> 5 mrs   custom  
#> 6 miss  custom

^{Created on 2023-09-01 with reprex v2.0.2}

How to remove a word from a dataset in R? NLP

There are 4 best solutions below

Related Questions in R

Related Questions in NLP

Related Questions in TIDYTEXT

Related Questions in ANTI-JOIN

Trending Questions

Popular # Hahtags

Popular Questions