I have a data frame where columns have strings of text. I want to search within the text of multiple columns and count the occurrence of specific words, and then make a table that reports the co-occurence (by multiple columns).
Here's a small data frame to make this reproducible:
#create dataframe
TI <- c('Studies of pancreatic cancer',
'colon, breast, and pancreatic cancer',
'cancer and CVD paper',
'CVD paper'
)
AB <- c('Autoimmune pancreatitis (AIP) is now considered a disease to be treated by diet.',
'dietary patterns, positive associations were found between pancreatic cancer risk ',
'Cardiovascular diseases (CVD) is linked to Diet',
'making this match with pancreas'
)
df <- data.frame(TI, AB)
#TI = title (of a paper)
#AB = abstract of the same paper
I then want to see how many titles have either the word 'cancer' or 'CVD' in the title (TI column), and the co-occurrence with the words 'pancreatic (or some variation)' or 'diet' in the abstract (AB column)
I can count the single occurence of the words I want with grep
pancreatic.ab <- length(grep("pancreat*", df$AB, ignore.case = TRUE, perl = FALSE))
pancreatic.ab
diet.ab <- length(grep("diet*", df$AB, ignore.case = TRUE, perl = FALSE))
diet.ab
cancer.ti <- length(grep("cancer*", df$TI, ignore.case = TRUE, perl = FALSE))
cancer.ti
CVD.ti <- length(grep("CVD", df$TI, ignore.case = TRUE, perl = FALSE))
CVD.ti
but not sure how to do this for a complicated cross tabs table.
any suggestions?

You can bind the rows that have your targets in the
TIcolumns, providing a label for the columns belonging to each (in my exampleti). Then, grouping by that label (i.e.group_by(ti)), you can create the columns with the counts, usingsummarize()Output: