How to remove both Roman numbers and Arabic numbers in TermDocumentMatrix()?

Question

How to remove both Roman numbers and Arabic numbers in TermDocumentMatrix()?

270 Views Asked by Tim At 25 May 2020 at 19:31

In TermDocumentMatrix(), parameter removeNumbers=TRUE removes Arabic numbers in an English corpus. How can I remove both Roman numerals (such as "iii", "xiv" and "xiii", and in any case) and Arabic numbers? What custom function can I provide to removeNumbers parameter to accomplish that?

The code which I am trying to understand and modify:

library(gutenbergr)
library(stringr)
library(dplyr)
library(tidyr)

library(tm)
library(topicmodels)
library(tidyverse)
library(tidytext)
library(slam)

titles = c("Wuthering Heights", "A Tale of Two Cities",
  "Alice's Adventures in Wonderland", "The Adventures of Sherlock Holmes")

##read in those books
books = gutenberg_works(title %in% titles) %>%
  gutenberg_download(meta_fields = "title") %>% 
  mutate(document = row_number())

create_chapters = books %>% 
  group_by(title) %>%
  mutate(chapter = cumsum(str_detect(text, regex("\\bchapter\\b", ignore_case = TRUE)))) %>% 
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, title, chapter) 

by_chapter = create_chapters %>% 
  group_by(document) %>% 
  summarise(text=paste(text,collapse=' '))

import_corpus = Corpus ( VectorSource (by_chapter$text))

no_romans <- function(s) s[!grepl("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$", toupper(s))]

import_mat = DocumentTermMatrix (import_corpus,
  control = list (stemming = TRUE, #create root words
  stopwords = TRUE, #remove stop words
  minWordLength = 3, #cut out small words
  removeNumbers = no_romans, #take out the numbers
  removePunctuation = TRUE)) #take out punctuation

The following analysis shows that Roman numerals still exist, such as "iii" and "xii".

> st = import_mat$dimnames$Term
> st[grepl("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$", toupper(st))]
 [1] "cli"    "iii"    "mix"    "vii"    "viii"   "xii"    "xiii"   "xiv"   
 [9] "xix"    "xvi"    "xvii"   "xviii"  "xxi"    "xxii"   "xxiii"  "xxiv"  
[17] "xxix"   "xxv"    "xxvi"   "xxvii"  "xxviii" "xxx"    "xxxi"   "xxxii" 
[25] "xxxiii" "xxxiv"

Original Q&A

There are 1 best solutions below

**r2evans** · Answer 1 · 2020-05-25T21:11:21.160000

Try these options.

library(tm)
dat <- VCorpus(VectorSource(c("iv. Chapter Four", "I really want to discuss the proper mix of 17 ingredients.", "Nothing to remove here.")))

inspect( DocumentTermMatrix(dat) )
# <<DocumentTermMatrix (documents: 3, terms: 13)>>
# Non-/sparse entries: 13/26
# Sparsity           : 67%
# Maximal term length: 12
# Weighting          : term frequency (tf)
# Sample             :
#     Terms
# Docs chapter discuss four here. ingredients. iv. mix nothing proper really
#    1       1       0    1     0            0   1   0       0      0      0
#    2       0       1    0     0            1   0   1       0      1      1
#    3       0       0    0     1            0   0   0       1      0      0

One of Gregor's cautions -- the word "I" -- does not seem to be there, so we won't worry about that for now. Another of Gregor's cautions was the word "mix", which is both legitimate and roman numerals. A basic function to remove simple/whole roman numerals might be:

no_romans <- function(s) s[!grepl("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$", toupper(s))]
inspect( DocumentTermMatrix(dat, control = list(removeNumbers = no_romans)) )
# <<DocumentTermMatrix (documents: 3, terms: 12)>>
# Non-/sparse entries: 12/24
# Sparsity           : 67%
# Maximal term length: 12
# Weighting          : term frequency (tf)
# Sample             :
#     Terms
# Docs chapter discuss four here. ingredients. iv. nothing proper really remove
#    1       1       0    1     0            0   1       0      0      0      0
#    2       0       1    0     0            1   0       0      1      1      0
#    3       0       0    0     1            0   0       1      0      0      1

That removes "mix" but leaves the "iv.". If you need to remove that, then perhaps

no_romans2 <- function(s) s[!grepl("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})[.]?$", toupper(s))]
inspect( DocumentTermMatrix(dat, control = list(removeNumbers = no_romans2)) )
# <<DocumentTermMatrix (documents: 3, terms: 11)>>
# Non-/sparse entries: 11/22
# Sparsity           : 67%
# Maximal term length: 12
# Weighting          : term frequency (tf)
# Sample             :
#     Terms
# Docs chapter discuss four here. ingredients. nothing proper really remove the
#    1       1       0    1     0            0       0      0      0      0   0
#    2       0       1    0     0            1       0      1      1      0   1
#    3       0       0    0     1            0       1      0      0      1   0

(The only difference is adding [.]? near the end of the regex.)

(BTW: one can use grepl(..., ignore.case=TRUE) to get the same effect as toupper(s) as used here. It is a little slower in small-sample testing, but the effect is the same.)

How to remove both Roman numbers and Arabic numbers in TermDocumentMatrix()?

There are 1 best solutions below

Related Questions in R

Related Questions in TM

Related Questions in TERM-DOCUMENT-MATRIX

Trending Questions

Popular # Hahtags

Popular Questions