In this problem, I am trying to "score" a task from a computer administered questionnaire, where each person is asked to type 5, 3-letter words into a text box. The result in the data.frame looks like this:
dput(head(WRRsp, 10))
structure(list(Subject = c(100101, 100108, 100110, 100114, 100119,
100123, 100133, 100148, 100155, 100159), WRRsp = c("Elk,Mop,Bat,Sop,Car",
"Ely,Mop,Bat,Toe,Spy", "Bat,Mop,Spy,Elk,Top", "Spy,Bad,Toe",
"Elk,Mop,Toe", "Eelk,Spy", "Elk,Toe,Mop,Box,Car", "Mope,Eik",
"Ee,Elk,Mop,Bat,Fox", "E,L,K,Mop,Spy")), row.names = c(NA, -10L
), class = c("tbl_df", "tbl", "data.frame"))
The correct responses are the following
- elk
- mop
- bat
- toe
- spy
The idea is to create a 0-15 score taking into account points per letter for targets. For example, "Elk, Mop, Bat, Sky" would score as 11 out of 15 because "Sky" is not a correct word, but contains two of three appropriate letters. The major problem I'm encountering is that the order of the words doesn't matter, they could be entered in any order.
My first pass at this was to score based on position, so the targets would be:
| col1 | col2 | col3 |
|---|---|---|
| e | l | k |
| m | o | p |
| b | a | t |
| t | o | e |
| s | p | y |
An example scoring, if doing this by hand:
| Subject | WRRsp | 1st | 2nd | 3rd | Total |
|---|---|---|---|---|---|
| 100101 | Elk,Mop,Bat,Sop,Car | 4 | 4 | 3 | 11 |
| 100108 | Ely,Mop,Bat,Toe,Spy | 5 | 5 | 4 | 14 |
| 100110 | Bat,Mop,Spy,Elk,Top | 5 | 5 | 4 | 14 |
| 100114 | Spy,Bad,Toe | 3 | 3 | 2 | 8 |
| 100119 | Elk,Mop,Toe | 3 | 3 | 3 | 9 |
| 100123 | Eelk,Spy | 2 | 1 | 1 | 4 |
| 100133 | Elk,Toe,Mop,Box,Car | 4 | 4 | 3 | 11 |
| 100148 | Mope,Eik | 2 | 1 | 2 | 5 |
| 100155 | Ee,Elk,Mop,Bat,Fox | 3 | 4 | 3 | 10 |
| 100159 | E,L,K,Mop,Spy | 3 | 2 | 2 | 7 |
Some context: this is only one question in a large questionnaire designed to test for signs of cognitive impairment.
I can get really close with way too much code:
target1 <- c("e", "m", "b", "t", "s")
target2 <- c("l", "o", "a", "o", "p")
target3 <- c("k", "p", "t", "e", "y")
wrrsp_score <- WRRsp %>%
separate_wider_delim(
"WRRsp",
delim = ",",
names = c("w1", "w2", "w3", "w4", "w5"),
too_few = "align_start"
) %>%
mutate(
col1 = paste(
str_sub(w1, 1, 1),
str_sub(w2, 1, 1),
str_sub(w3, 1, 1),
str_sub(w4, 1, 1),
str_sub(w5, 1, 1)
),
col1 = str_remove_all(col1, "NA"),
# this stuff isn't really necessary
col1 = str_trim(col1),
col1 = tolower(col1),
col2 = paste(
str_sub(w1, 2, 2),
str_sub(w2, 2, 2),
str_sub(w3, 2, 2),
str_sub(w4, 2, 2),
str_sub(w5, 2, 2)
),
col2 = str_remove_all(col2, "NA"),
col2 = str_trim(col2),
col2 = tolower(col2),
col3 = paste(
str_sub(w1, 3, 3),
str_sub(w2, 3, 3),
str_sub(w3, 3, 3),
str_sub(w4, 3, 3),
str_sub(w5, 3, 3)
),
col3 = str_remove_all(col3, "NA"),
col3 = str_trim(col3),
col3 = tolower(col3)
) %>%
mutate(
col1_e = ifelse(grepl(target1[1], col1), 1, 0),
col1_m = ifelse(grepl(target1[2], col1), 1, 0),
col1_b = ifelse(grepl(target1[3], col1), 1, 0),
col1_t = ifelse(grepl(target1[4], col1), 1, 0),
col1_s = ifelse(grepl(target1[5], col1), 1, 0),
col1_score = rowSums(pick(col1_e:col1_s), na.rm = TRUE),
col2_l = ifelse(grepl(target2[1], col2), 1, 0),
col2_o = ifelse(grepl(target2[2], col2), 1, 0),
col2_a = ifelse(grepl(target2[3], col2), 1, 0),
col2_o2 = ifelse(grepl(target2[4], col2), 1, 0),
col2_s = ifelse(grepl(target2[5], col2), 1, 0),
col2_score = rowSums(pick(col2_l:col2_s), na.rm = TRUE),
col3_k = ifelse(grepl(target3[1], col3), 1, 0),
col3_p = ifelse(grepl(target3[2], col3), 1, 0),
col3_t = ifelse(grepl(target3[3], col3), 1, 0),
col3_e = ifelse(grepl(target3[4], col3), 1, 0),
col3_y = ifelse(grepl(target3[5], col3), 1, 0),
col3_score = rowSums(pick(col3_k:col3_y), na.rm = TRUE)
) %>%
mutate(WRRspPerCharAC = rowSums(pick(ends_with("_score")), na.rm = TRUE))
Where this fails, is with the second "o" in target2, it counts any "o" twice even if there are isn't anything to count (e.g., only two words entered). Two things: how to fix the duplicate issue, and how to make this more efficient/more streamlined, because what I have above seems overengineered and insane.
Here's one attempt that looks to return the correct result with your sample data. It relies on
pmatch()to match the correct vector against the vectors of answers split on letter positions.