Identifying and counting "correct" letters in a comma delimited string in R

Question

Identifying and counting "correct" letters in a comma delimited string in R

58 Views Asked by rrevans At 16 March 2023 at 05:19

In this problem, I am trying to "score" a task from a computer administered questionnaire, where each person is asked to type 5, 3-letter words into a text box. The result in the data.frame looks like this:

dput(head(WRRsp, 10))

structure(list(Subject = c(100101, 100108, 100110, 100114, 100119, 
100123, 100133, 100148, 100155, 100159), WRRsp = c("Elk,Mop,Bat,Sop,Car", 
"Ely,Mop,Bat,Toe,Spy", "Bat,Mop,Spy,Elk,Top", "Spy,Bad,Toe", 
"Elk,Mop,Toe", "Eelk,Spy", "Elk,Toe,Mop,Box,Car", "Mope,Eik", 
"Ee,Elk,Mop,Bat,Fox", "E,L,K,Mop,Spy")), row.names = c(NA, -10L
), class = c("tbl_df", "tbl", "data.frame"))

The correct responses are the following

elk
mop
bat
toe
spy

The idea is to create a 0-15 score taking into account points per letter for targets. For example, "Elk, Mop, Bat, Sky" would score as 11 out of 15 because "Sky" is not a correct word, but contains two of three appropriate letters. The major problem I'm encountering is that the order of the words doesn't matter, they could be entered in any order.

My first pass at this was to score based on position, so the targets would be:

col1	col2	col3
e	l	k
m	o	p
b	a	t
t	o	e
s	p	y

An example scoring, if doing this by hand:

Subject	WRRsp	1st	2nd	3rd	Total
100101	Elk,Mop,Bat,Sop,Car	4	4	3	11
100108	Ely,Mop,Bat,Toe,Spy	5	5	4	14
100110	Bat,Mop,Spy,Elk,Top	5	5	4	14
100114	Spy,Bad,Toe	3	3	2	8
100119	Elk,Mop,Toe	3	3	3	9
100123	Eelk,Spy	2	1	1	4
100133	Elk,Toe,Mop,Box,Car	4	4	3	11
100148	Mope,Eik	2	1	2	5
100155	Ee,Elk,Mop,Bat,Fox	3	4	3	10
100159	E,L,K,Mop,Spy	3	2	2	7

Some context: this is only one question in a large questionnaire designed to test for signs of cognitive impairment.

I can get really close with way too much code:

target1 <- c("e", "m", "b", "t", "s")
target2 <- c("l", "o", "a", "o", "p")
target3 <- c("k", "p", "t", "e", "y")

wrrsp_score <- WRRsp %>%
  separate_wider_delim(
    "WRRsp",
    delim = ",",
    names = c("w1", "w2", "w3", "w4", "w5"),
    too_few = "align_start"
  ) %>%
  mutate(
    col1 = paste(
      str_sub(w1, 1, 1),
      str_sub(w2, 1, 1),
      str_sub(w3, 1, 1),
      str_sub(w4, 1, 1),
      str_sub(w5, 1, 1)
    ),
    col1 = str_remove_all(col1, "NA"),
    # this stuff isn't really necessary
    col1 = str_trim(col1),
    col1 = tolower(col1),
    
    col2 = paste(
      str_sub(w1, 2, 2),
      str_sub(w2, 2, 2),
      str_sub(w3, 2, 2),
      str_sub(w4, 2, 2),
      str_sub(w5, 2, 2)
    ),
    col2 = str_remove_all(col2, "NA"),
    col2 = str_trim(col2),
    col2 = tolower(col2),
    
    col3 = paste(
      str_sub(w1, 3, 3),
      str_sub(w2, 3, 3),
      str_sub(w3, 3, 3),
      str_sub(w4, 3, 3),
      str_sub(w5, 3, 3)
    ),
    col3 = str_remove_all(col3, "NA"),
    col3 = str_trim(col3),
    col3 = tolower(col3)
  ) %>%
  mutate(
    col1_e = ifelse(grepl(target1[1], col1), 1, 0),
    col1_m = ifelse(grepl(target1[2], col1), 1, 0),
    col1_b = ifelse(grepl(target1[3], col1), 1, 0),
    col1_t = ifelse(grepl(target1[4], col1), 1, 0),
    col1_s = ifelse(grepl(target1[5], col1), 1, 0),
    col1_score = rowSums(pick(col1_e:col1_s), na.rm = TRUE),
    
    col2_l  = ifelse(grepl(target2[1], col2), 1, 0),
    col2_o  = ifelse(grepl(target2[2], col2), 1, 0),
    col2_a  = ifelse(grepl(target2[3], col2), 1, 0),
    col2_o2 = ifelse(grepl(target2[4], col2), 1, 0),
    col2_s  = ifelse(grepl(target2[5], col2), 1, 0),
    col2_score = rowSums(pick(col2_l:col2_s), na.rm = TRUE),
    
    col3_k = ifelse(grepl(target3[1], col3), 1, 0),
    col3_p = ifelse(grepl(target3[2], col3), 1, 0),
    col3_t = ifelse(grepl(target3[3], col3), 1, 0),
    col3_e = ifelse(grepl(target3[4], col3), 1, 0),
    col3_y = ifelse(grepl(target3[5], col3), 1, 0),
    col3_score = rowSums(pick(col3_k:col3_y), na.rm = TRUE)
  ) %>%
  mutate(WRRspPerCharAC = rowSums(pick(ends_with("_score")), na.rm = TRUE))

Where this fails, is with the second "o" in target2, it counts any "o" twice even if there are isn't anything to count (e.g., only two words entered). Two things: how to fix the duplicate issue, and how to make this more efficient/more streamlined, because what I have above seems overengineered and insane.

Original Q&A

There are 1 best solutions below

**lotus** · Accepted Answer · 2023-03-16T07:02:41.683000

Here's one attempt that looks to return the correct result with your sample data. It relies on pmatch() to match the correct vector against the vectors of answers split on letter positions.

library(stringr)
library(dplyr)
library(tidyr)
library(purrr)

targets <- list(c("e", "m", "b", "t", "s"),
                c("l", "o", "a", "o", "p"),
                c("k", "p", "t", "e", "y"))

dat |>
  mutate(score = tolower(WRRsp) |>
           str_split(",") |>
           map(
             ~ str_split_fixed(.x, "", n = Inf)[, 1:3] |>
               asplit(2) |>
               map2_dbl(targets,
                        ~ sum(pmatch(.y, .x, nomatch = 0) > 0))
           )) |>
  unnest_wider(score, names_sep = "_") |>
  mutate(Total = rowSums(pick(starts_with("score"))))

# A tibble: 10 × 6
   Subject WRRsp               score_1 score_2 score_3 Total
     <dbl> <chr>                 <dbl>   <dbl>   <dbl> <dbl>
 1  100101 Elk,Mop,Bat,Sop,Car       4       4       3    11
 2  100108 Ely,Mop,Bat,Toe,Spy       5       5       4    14
 3  100110 Bat,Mop,Spy,Elk,Top       5       5       4    14
 4  100114 Spy,Bad,Toe               3       3       2     8
 5  100119 Elk,Mop,Toe               3       3       3     9
 6  100123 Eelk,Spy                  2       1       1     4
 7  100133 Elk,Toe,Mop,Box,Car       4       4       3    11
 8  100148 Mope,Eik                  2       1       2     5
 9  100155 Ee,Elk,Mop,Bat,Fox        3       4       3    10
10  100159 E,L,K,Mop,Spy             3       2       2     7

Identifying and counting "correct" letters in a comma delimited string in R

There are 1 best solutions below

Related Questions in R

Related Questions in STRING

Related Questions in DUPLICATES

Related Questions in CODING-EFFICIENCY

Trending Questions

Popular # Hahtags

Popular Questions