Word frequency over time : How to count the word frequency by date?

Question

Word frequency over time : How to count the word frequency by date?

204 Views Asked by cheeseburger At 06 June 2022 at 11:14

I have a data frame look like this :

date	text
201901	Thank you for helping me
201902	You are amazing
201902	For helping with this

My aim is to calculate the word frequency in each line, and eventually look like this:

date	thank	you	for	helping	me	are	amazing	with	this	for
201901	1	1	1	1	1	0	0	0	0	0
201902	0	1	1	1	0	1	1	1	1	1

The actual data set is like this frame, but contains millions of text lines. So I was wondering how to automate this process using R, without typing all those texts lines.

Original Q&A

There are 2 best solutions below

HoelR On 06 June 2022 at 11:25

library(tidyverse)
library(tidytext)

df <- tibble(
  date = c("201901", "201902", "201902"),
  text = c("Thank you for helping me", 
           "You are amazing", 
           "For helping with this")
)

# A tibble: 3 x 2
  date   text                    
  <chr>  <chr>                   
1 201901 Thank you for helping me
2 201902 You are amazing         
3 201902 For helping with this

df %>%  
  unnest_tokens("words", text) %>% 
  group_by(date, words) %>% 
  summarise(count = n()) %>% 
  ungroup() %>% 
  spread(words, count)

# A tibble: 2 x 10
  date   amazing   are for helping    me thank  this  with   you
  <chr>    <int> <int> <int>   <int> <int> <int> <int> <int> <int>
1 201901      NA    NA     1       1     1     1    NA    NA     1
2 201902       1     1     1       1    NA    NA     1     1     1

**bpvalderrama** · Accepted Answer · 2022-06-06T11:24:25.860000

Using R and tidyverse:

df <- data.frame(date = c(201901, 201902, 201902),
                 text = c("Thank you for helping me", "You are amazing", "For helping with this"))

library(tidyverse)

If you want your data as a table of counts

df %>% 
            separate_rows(text, sep = " ") %>% 
            mutate(text = tolower(text)) %>% 
            table()

Output:

text
date     amazing are for helping me thank this with you
  201901       0   0   1       1  1     1    0    0   1
  201902       1   1   1       1  0     0    1    1   1

If you want your output as a tibble

df %>% 
        separate_rows(text, sep = " ") %>% 
        mutate(text = tolower(text)) %>% 
        table() %>% 
        as_tibble() %>% 
        pivot_wider(names_from = text, values_from = n)

Output:

# A tibble: 2 x 10
  date   amazing   are `for` helping    me thank  this  with   you
  <chr>    <int> <int> <int>   <int> <int> <int> <int> <int> <int>
1 201901       0     0     1       1     1     1     0     0     1
2 201902       1     1     1       1     0     0     1     1     1

edit: To transform everything to lowercase as your desired output and to show you the output

edit2: To show you that you can also get your data as a tibble to further work with it

Word frequency over time : How to count the word frequency by date?

There are 2 best solutions below

Related Questions in R

Related Questions in NLP

Related Questions in WORD-FREQUENCY

Trending Questions

Popular # Hahtags

Popular Questions