plyr ddply across two dataframes

48 Views Asked by At

not sure if I'm using the right function, but I have two dataframes. I want to take the sequence value one by one from df2 and use a function (LCS_score) on it to compare it to each code in df1, then return an aggregate count value in the same row of df2 but a new column using some kind of vectorized approach as the real dataset is very large. I previously for another application had some success using the ldply function from plyr for iterating over a list. In this case I'm not quite sure how to set it up to have two dataframe inputs.

data

df1 <- structure(list(code = c("E050 H055 C058 K052 O050 E007", "E051 D052 K053 X050 H055 F054 E013 C057 O050 B030 H056 J053 C058 D030 Q078 Q076 A014 F030 F036 E030 C055 X030 M050 Q006 E007", 
                                "E050 F030 F036 B030 H058 E051 J032 J050 E013 E005 K052 H056 A014 K053 A051 D052 E030 E007 Q072", 
                                "J055 J050 Q006 Q076 Q074 J053 H051 H058 H056 E051 Q077 Q079 E030 D052 A050 Q072 Q003 E050 C058 B030 F030 A014 E007 A010", 
                                "B030 F054 E050 X030 Q006 A050 C058 E007 E051 H058 Q078 F030 J050 K053 D030", 
                                "D030 D052 E051 D051 C058 C055 H058 K053 E050 J054 A039 B030 E007", 
                                "A006 E030 Q076 X001 Q010 Q006 A014 Q072 E007 E051 A050 J032 A051 E050 B030 A010 D052 H056 H058 Q003 E013", 
                                "E050 H056 A050 C058 E013 Q078 E051 J055 D030 A030 D052 D051 K053 E030 E007 Q076", 
                                "J050 E050 H058 H056 C058 A050 D052 E051 Q006 D030 B030 E030 Q003 X030 Q072 Q008", 
                                "D052 E030 E051 J053 E013 H056 L050 D030 H030 C058 O030 F030 F052 E050 F036 D003 E007"
), id = 1:10), row.names = c(NA, -10L), class = c("tbl_df", "tbl", 
                                                  "data.frame"))

df2 <- structure(list(sequence = c("B030 D030 E013 A006 A050", "B030 D030 E013 A006 E007", 
                                    "B030 D030 E013 A014 A050", "B030 D030 E013 A014 E007", "C058 B030 E013 A006 A050", 
                                    "C058 B030 E013 A006 E007", "C058 B030 E013 A014 A050", "C058 B030 E013 A014 E007", 
                                    "C058 B030 D030 E013 A006", "C058 B030 D030 E013 A014")), class = c("tbl_df", 
                                                                                                        "tbl", "data.frame"), row.names = c(NA, -10L))

function

LCS_score <- function(code, sequence){
  seq_str <- unlist(strsplit(sequence, " "))
  code_str <- unlist(strsplit(code, " "))
  code_subset <- code_str[code_str %in% seq_str]
  sequence_subset <- seq_str[seq_str %in% code_str]
  overlap <- sequence_subset == code_subset
  if(length(overlap) == 0){
    score <- 0
  } 
  else{
    score <- sum(overlap) + 1
  }
  if(score/length(seq_str) > 0.79){
    count <- 1
  } else{
    count <- 0
  }
  return(count)
}

I'm not sure if the function will work within a vectorized approach. I've tested it with individual inputs (e.g. a single code and single sequence) and in that context it works. Based on putting together a for loop for this dataset, I'm expecting df2 to have the following output, with the sequence column being the actual sequence string not a row number.

| Sequence | Count |
| -------- | ----- |
|        1 | 0 |
|        2 | 1 |
|        3 | 1 |
|        4 | 2 |
|        5 | 0|
|        6 | 4|
|        7 | 1|
|        8 | 5|
|        9 | 0|
|       10 | 2|

Thank you for any advice/suggestions/solutions!

1

There are 1 best solutions below

0
PhDavey On

For completion, I solved this by creating an additional function so that I could use sapply multiple times.

Function 1

LCS_score <- function(code, sequence){
  seq_str <- unlist(strsplit(sequence, " "))
  code_str <- unlist(strsplit(code, " "))
  code_subset <- code_str[code_str %in% seq_str]
  sequence_subset <- seq_str[seq_str %in% code_str]
  overlap <- sequence_subset == code_subset
  if(length(overlap) == 0){
    score <- 0
  } 
  else{
    score <- sum(overlap) + 1
  }
  if(score/length(seq_str) > 0.79){
    count <- 1
  } else{
    count <- 0
  }
  return(count)
}

Function 2

LCS_tally <- function(sequence, code){
  scores <- sapply(code, LCS_score, sequence)
  total <- sum(unlist(scores))
}

Application

sequence_scores <- sapply(df2$sequence, LCS_tally, code = df1$code, simplify = FALSE)

Output

 sequence_scores
$`B030 D030 E013 A006 A050`
[1] 0

$`B030 D030 E013 A006 E007`
[1] 1

$`B030 D030 E013 A014 A050`
[1] 1

$`B030 D030 E013 A014 E007`
[1] 2

$`C058 B030 E013 A006 A050`
[1] 0

$`C058 B030 E013 A006 E007`
[1] 4

$`C058 B030 E013 A014 A050`
[1] 1

$`C058 B030 E013 A014 E007`
[1] 5

$`C058 B030 D030 E013 A006`
[1] 0

$`C058 B030 D030 E013 A014`
[1] 2

Still room in my opinion to optimize/confirm the function is producing the correct output. Will update as I go along.