I have already read this and this questions, but I still didn't understand the use of stemDocument in tm_map. Let's follow this example:
q17 <- VCorpus(VectorSource(x = c("poder", "pode")),
readerControl = list(language = "pt",
load = TRUE))
lapply(q17, content)
$`character(0)`
[1] "poder"
$`character(0)`
[1] "pode"
If I use:
> stemDocument("poder", language = "portuguese")
[1] "pod"
> stemDocument("pode", language = "portuguese")
[1] "pod"
it does work! But if I use:
> q17 <- tm_map(q17, FUN = stemDocument, language = "portuguese")
> lapply(q17, content)
$`character(0)`
[1] "poder"
$`character(0)`
[1] "pode"
it doesn't work. Why so?
Unfortunately you stumbled on a bug.
stemDocumentworks if you pass on the language when you do:But when using this in
tm_map, the function starts of withstemDocument.PlainTextDocument. In this function the language of the corpus is checked against the language you supply in the function. This works correctly. But at the end of this function everything is passed on to the functionstemDocument.character, but without the language component. InstemDocument.characterthe default language is specified as English. So within thetm_mapcall (or theDocumentTermMatrix) the language you supply with it will revert back to English and the stemming doesn't work correctly.A workaround could be using the package quanteda:
Since you are working with Portuguese, I suggest using the packages quanteda, udpipe, or both. Both packages handle non-English languages a lot better than tm.