I have a data frame look like this :
| date | text |
|---|---|
| 201901 | Thank you for helping me |
| 201902 | You are amazing |
| 201902 | For helping with this |
My aim is to calculate the word frequency in each line, and eventually look like this:
| date | thank | you | for | helping | me | are | amazing | with | this | for |
|---|---|---|---|---|---|---|---|---|---|---|
| 201901 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 201902 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
The actual data set is like this frame, but contains millions of text lines. So I was wondering how to automate this process using R, without typing all those texts lines.
Using R and tidyverse:
If you want your data as a table of counts
Output:
If you want your output as a tibble
Output:
edit: To transform everything to lowercase as your desired output and to show you the output
edit2: To show you that you can also get your data as a tibble to further work with it