I am trying to use openNLP to look through rows of text and classify sentences into thematic buckets. Here is a sample df:
dat <- data.frame(text=c("A fluffy crab discovered off the coast of Western Australia has been named after the ship that carried Charles Darwin around the world. The new species, Lamarckdromia beagle, belongs to the Dromiidae family, commonly known as sponge crabs. Crustaceans in this family fashion and use sea sponges and ascidians – animals including sea squirts – for protection. They trim the creatures using their claws and wear them like hats. ",
"The inadmissibility of such actions, which violate the relevant legal and political obligations of the European Union and lead to an escalation of tensions, was pointed out, the ministry said in a statement. Speaking shortly after the meeting, Ederer said he had called on the Russian government to remain calm and resolve this issue diplomatically, the Russian news agency Tass reported."),
date=c(as.Date("2020-12-26"),as.Date("2020-12-31")),
id= c("1", "2"))
Ive gotten as splitting the text into sentences, and then searching for the keywords using the following code:
#split sentences search for keywords
all_sentence <- as.String(dat$text)
sent_annotator <- Maxent_Sent_Token_Annotator()
annotation <- annotate(all_sentence, sent_annotator)
split_text <- all_sentence[annotation]
# word list to search for
word_dat <- data.frame(words=c("animal", "species", "political", "government"),
theme=c("nature", "nature", "geopolitics", "geopolitics"))
stem_keyword <- wordStem(word_dat$words, language = "english")
for(kw in stem_keyword) {
x=grep(kw, split_text)
print(split_text[x])
print(stem_keyword[x])
}
However my for loop doesnt print exactly what im looking for.. for example, print(stem_keyword) is giving me the wrong keyword for the wrong sentence. In the end I dont want to print, I want to write the results to a new dataframe with this structure:
final_df <- data.frame(text=c("A fluffy crab discovered off the coast of Western Australia has been named after the ship that carried Charles Darwin around the world.", "The new species, Lamarckdromia beagle, belongs to the Dromiidae family, commonly known as sponge crabs.","Crustaceans in this family fashion and use sea sponges and ascidians – animals including sea squirts – for protection.", "They trim the creatures using their claws and wear them like hats.",
"The inadmissibility of such actions, which violate the relevant legal and political obligations of the European Union and lead to an escalation of tensions, was pointed out, the ministry said in a statement.",
"Speaking shortly after the meeting, Ederer said he had called on the Russian government to remain calm and resolve this issue diplomatically, the Russian news agency Tass reported."),
keyword=c("null", "species", "animal", "null", "political", "government"),
theme=c("null", "nature", "nature", "null", "geopolitics", "geopolitics"),
id= c("1", "1", "1", "1", "2", "2"))
Any advice or help getting my for loop to where I need it to be? TIA
EDIT: I would also like for sentences that cannot be classified to appear in the final dataframe with 'null' keywords and themes
What you instead want to do is
print(kw), but I'm going to provide you with the complete solution for putting your data into a dataframe anyway: