R {quanteda}: remove accents in a dictionary

46 Views Asked by At

I want to remove accents and punctuation from a dictionary. For example, I want to transform "à l'épreuve" into "a l epreuve". The dictionary is this one: https://www.poltext.org/fr/donnees-et-analyses/lexicoder (.cat). There are explanations for dataframes (Remove accents from a dataframe column in R), but I could not find a way of removing for dictionaries.

My code so far:

dict_lg <- dictionary(file = "frlsd/frlsd.cat", encoding = "UTF-8")

Any suggestion?

2

There are 2 best solutions below

1
I_O On BEST ANSWER

This should work:

library(quanteda)
library(stringi)
library(stringr)

dict_lg_ascii <- 
  dict_lg |> 
  rapply(f = \(term) term |>
              ## compose from string utilities as desired       
              stri_trans_general(id = 'Latin-ASCII') |>
              str_replace_all(pattern = '[[:punct:]]', replacement = ' '),
         how = 'replace'
         )

output:

## > dict_lg_ascii
Dictionary object with 2 primary key entries and 2 nested levels.
- [NEGATIVE]:
  - a cornes, a court de personnel , a l etroit, a peine , abais , 
## truncated

from the docs:

Dictionaries can be subsetted using [ and [[, operating the same as the equivalent list operators.

Thus rapply (recursively applying a function over nested lists) works. In this case, we apply stri_trans_general as suggested here.

1
shghm On

This post might help: Remove all special characters from a string in R?

The stringr-package together with regular expressions are probably a good way to deal with it.