Is there a way to add a dictionary of custom user defined words to the udpipe models?
For example, below using the default english model, some of the words should have been
identified as the keywords, such as R, Python, SQL, javascript, Excel, noSQL.
I would like to augment the default english model with my own custom words, so that the textrank_keywords function will be able to better identify relevant keywords.
library(udpipe)
library(dplyr)
tagger <- udpipe_download_model("english")
tagger <- udpipe_load_model(tagger$file_model)
# read data
rawdata <- c("Automating and R/Python package development.","You have a sound knowledge of another data analysis language (R,Python, SQL, javascript) and you don't care in which relational database, Excel, bigdata or noSQL store your data is located.")
# annotate
rawdata_annotate <- udpipe_annotate(tagger, rawdata) %>% as_tibble()
keyw <- textrank_keywords(rawdata_annotate$lemma,
relevant = rawdata_annotate$upos %in% c("PROPN","NOUN", "VERB", "ADJ"))
have <- keyw$terms
[1] "package" "analysis" "sound" "relational"
rawdata_annotate %>% dplyr::filter(token %in% c('R', 'Python', 'SQL', 'javascript', 'Excel', 'noSQL')) %>% dplyr::select(token, lemma, upos)
token lemma upos
<chr> <chr> <chr>
1 R R PROPN
2 Python python NOUN
3 R r NOUN
4 Python python NOUN
5 SQL sql NOUN
6 javascript javascript NOUN
7 Excel Excel PROPN
8 noSQL nosql AUX
I think I found the answer. Basically I would need to create a custom
CONLL-Ufile for the custom annotation. And then train the model.