Count of most commonly occurring bigram in 1500 IDs without repeating the count within an ID

59 Views Asked by At

I'm trying to count the most commonly occurring bigram in 1500 IDS (1 ID per row with 1 event) without counting the bigram more than 1x in each ID (row). For example, if I have the IDs below, I would only want 'work day' counted 1 x in each ID. The summary for the number of times 'work day' should show-up in my analysis should be 2. Once 'work day' gets counted in an ID I don't want it counted again.

ID Text
1  "The work day was horrible. On this particular work day, I made 5 mistakes....."
2  "This long work day was the best for me. I miss a long work day, because I get into a rhythm....."

This is my code. It gives the total counts for the 40 most frequently occurring bigrams as a histogram showing the 2 word bigram and the count. I'm not sure if it is counting the occurrence of a bigram more than 1x per ID as listed above, although I do believe it's just taking all 'Events' and counting however many times the 2 word bigram is occurring regardless of ID.

Sum1 %>% 
    unnest_tokens(word, "Event", token = "ngrams", n = 2) %>% 
    separate(word, c("word1", "word2"), sep = " ") %>% 
    filter(!word1 %in% stop_words$word) %>%
    filter(!word2 %in% stop_words$word) %>% 
    unite(word,word1, word2, sep = " ") %>% 
    count(word, sort = TRUE) %>% 
    slice(1:40) %>% 
    ggplot() + geom_bar(aes(x=reorder(word,n), y=n), stat = "identity", fill = "#de5833") +
    theme_minimal() +
    coord_flip()

1

There are 1 best solutions below

6
I_O On

Something like this?

    library(tidytext)
    library(dplyr)

    d <- data.frame(ID = 1:2,
                    txt = c('a particular word', 
                            'a particular word a phrase and a particular word')
                    )

## > d

      ID                                              txt
    1  1                                a particular word
    2  2 a particular word a phrase and a particular word

using base R strsplit and Filter to filter out stopwords right from the original text, finally distinct to retain unique bigrams per ID only:

d |>
  rowwise() |>
  mutate(txt = strsplit(txt, split = '\\s')[[1]] |> 
           Filter(f = \(x) !(x %in% get_stopwords()$word)) |>
           paste(collapse = ' ')
         ) |>
  unnest_tokens(input = txt, output = 'tokens',
                token = 'ngrams', n = 2) |>
  distinct(ID, tokens)

(strsplit returns a list whose single item, the word vector, has to be plucked with [[1]] before Filtering)

output:

+ # A tibble: 4 x 2
# Rowwise: 
     ID tokens           
  <int> <chr>            
1     1 particular word  
2     2 particular word  
3     2 word phrase      
4     2 phrase particular

Finally, count the bigrams like so:

## earlier steps (see above) 
## ... |>
count(tokens)
+ # A tibble: 3 x 2
# Rowwise: 
  tokens                n
  <chr>             <int>
1 particular word       2
2 phrase particular     1
3 word phrase           1