I'm trying to count the most commonly occurring bigram in 1500 IDS (1 ID per row with 1 event) without counting the bigram more than 1x in each ID (row). For example, if I have the IDs below, I would only want 'work day' counted 1 x in each ID. The summary for the number of times 'work day' should show-up in my analysis should be 2. Once 'work day' gets counted in an ID I don't want it counted again.
ID Text
1 "The work day was horrible. On this particular work day, I made 5 mistakes....."
2 "This long work day was the best for me. I miss a long work day, because I get into a rhythm....."
This is my code. It gives the total counts for the 40 most frequently occurring bigrams as a histogram showing the 2 word bigram and the count. I'm not sure if it is counting the occurrence of a bigram more than 1x per ID as listed above, although I do believe it's just taking all 'Events' and counting however many times the 2 word bigram is occurring regardless of ID.
Sum1 %>%
unnest_tokens(word, "Event", token = "ngrams", n = 2) %>%
separate(word, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
unite(word,word1, word2, sep = " ") %>%
count(word, sort = TRUE) %>%
slice(1:40) %>%
ggplot() + geom_bar(aes(x=reorder(word,n), y=n), stat = "identity", fill = "#de5833") +
theme_minimal() +
coord_flip()
Something like this?
using base R
strsplitandFilterto filter out stopwords right from the original text, finallydistinctto retain unique bigrams per ID only:(
strsplitreturns a list whose single item, the word vector, has to be plucked with[[1]]beforeFiltering)output:
Finally,
countthe bigrams like so: