How to a create recursive variable per group in data frame?

143 Views Asked by At

I have a data set with multiple observations of a clinical outcome for each patient. The length between these time points is variable. I want to create a "roving baseline score" variable for the clinical outcome, that changes every time an increase or a decrease of the score is confirmed at the first time point that is at least 90 days ahead.

I have created an example data frame with observations for two patients (patient 1 has 6 observations, patient 2 only has 2 observations):

library(tidyverse)

df <- tibble(
patID  = c("1", "1", "1", "1", "1", "1", "2", "2", "3", "3", "3", "3", "3", "3"),     
time_point = c("1", "2", "3", "4", "5", "6", "1", "2", "1", "2", "3", "4", "5", "6"),
date = as.Date(c("2020-01-01", "2020-05-01", "2020-06-01", "2020-09-01", "2021-01-01", "2021-05-01", "2020-01-01", "2020-05-01", "2020-01-01", "2020-02-01", "2020-03-01", "2020-06-01", "2020-10-01", "2021-04-01")),
score = c("300", "100", "100", "100", "200", "200", "600", "400", "300", "200", "200", "100", "400", "300"))

The vector with the baseline score should result in the following:

baseline = c("300", "300", "300", "100", "100", "200", "600", "600", "300", "300", "300", "200", "200", "300") 
  • Explanation: the score of patient 1 decreases at time point 2 from 300 to 100. This decrease can be confirmed at time point 4, since this is the closest time point that is at least 90 days ahead. Thus, the new baseline at time point 4 is 100. At time point 6, an increase in the score (with respect to the latest baseline) is confirmed and we obtain a new baseline of 200. Patient 2 only has two time points and therefore a change in the baseline cannot be confirmed.

I have tried creating the variable using a for loop, but I don't know how to make it such that it is done for each patient separately.

EDIT: I tried to write the code as following using sapply, but I get the error "argument is of length 0"

df_baseline <- df %>%
group_by(patID) %>%
mutate(baseline = lag(score, default = first(score)),
     closest_index_90 = sapply(1:n(), function(i) {
       valid_indices <- which(date[i]+90 <= date[(i+1):n()])
       if (length(valid_indices)>0) {return(first(valid_indices) + i)} 
       else {return(NA)}
     }),
     baseline = sapply(1:n(), function(i) {
       if (score[i:closest_index_90[i]] > baseline[i - 1] | score[i:closest_index_90[i]] < baseline[i - 1]) {return(score[which.min(abs(score[i:closest_index_90[i]]-baseline[i-1]))])}
     else {return(baseline[i-1])}}))

I am relatively new to R so my skills with for loops and sapply functions are limited. I have looked through many questions asked on here but none seemed to answer my question specifically. Any help would be greatly appreciated!

2

There are 2 best solutions below

0
Evy On BEST ANSWER

Found a solution using purrr::accumulate2

df %>%
group_by(patID) %>%
mutate(
closest_index_90 = sapply(1:n(), function(i) {
   valid_indices <- which(date[i]+90 <= date[(i+1):n()])
   if (length(valid_indices)>0) {return(first(valid_indices) + i)} 
   else {return(NA)}
 }),
baseline = accumulate2(
   time_point,
   replace_na(closest_index_90, max(i)),
   \(b, i, r) {
   if (b>max(score[i:r])) {max(score[i:r]}
   else if (b<min(score[i:r])) {min(score[i:r]}
   else b
   }, .init = score[[1]]
   )[-1])
3
Carl On

It's a different approach using a rolling join to find the closest match with a gap of at least 90 days. Needs testing with more data, but see if this helps.

I get 200 for the last row as neither the increase to 400 nor the decrease to 300 have been confirmed.

library(tidyverse)
library(slider)

df <- tribble(
  ~patID, ~tmp,         ~date, ~score, ~desired,
     1,       1, "2020-01-01",    300,      300,
     1,       2, "2020-05-01",    100,      300,
     1,       3, "2020-06-01",    100,      300,
     1,       4, "2020-09-01",    100,      100,
     1,       5, "2021-01-01",    200,      100,
     1,       6, "2021-05-01",    200,      200,
     2,       1, "2020-01-01",    600,      600,
     2,       2, "2020-05-01",    400,      600,
     3,       1, "2020-01-01",    300,      300,
     3,       2, "2020-02-01",    200,      300,
     3,       3, "2020-03-01",    200,      300,
     3,       4, "2020-06-01",    100,      200,
     3,       5, "2020-10-01",    400,      200,
     3,       6, "2021-04-01",    300,      300
  ) |> 
  mutate(date = ymd(date))

change_df <- df |>
  mutate(
    change = if_else((score < lag(score) &
      slide_dbl(lead(score), \(.x) min(.x, na.rm = TRUE), .after = Inf) <= score) |
      (score > lag(score) &
         slide_dbl(lead(score), \(.x) min(.x, na.rm = TRUE), .after = Inf) >= score),
    score, NA
    ),
    conf_after = if_else(score != lag(score), date + days(90), NA),
    .by = patID
  ) |>
  select(patID, conf_after, change) |>
  drop_na()

df |>
  left_join(change_df, join_by(patID, closest(date >= conf_after))) |>
  mutate(
    start = if_else(date == first(date), score, NA),
    baseline = coalesce(start, change),
    .by = patID
  ) |>
  group_by(patID) |>
  fill(baseline) |>
  ungroup() |>
  mutate(check = baseline == desired) |> # to check outcome
  relocate(desired, .before = baseline)  # to check outcome
#> # A tibble: 14 × 10
#>    patID   tmp date       score conf_after change start desired baseline check
#>    <dbl> <dbl> <date>     <dbl> <date>      <dbl> <dbl>   <dbl>    <dbl> <lgl>
#>  1     1     1 2020-01-01   300 NA             NA   300     300      300 TRUE 
#>  2     1     2 2020-05-01   100 NA             NA    NA     300      300 TRUE 
#>  3     1     3 2020-06-01   100 NA             NA    NA     300      300 TRUE 
#>  4     1     4 2020-09-01   100 2020-07-30    100    NA     100      100 TRUE 
#>  5     1     5 2021-01-01   200 2020-07-30    100    NA     100      100 TRUE 
#>  6     1     6 2021-05-01   200 2021-04-01    200    NA     200      200 TRUE 
#>  7     2     1 2020-01-01   600 NA             NA   600     600      600 TRUE 
#>  8     2     2 2020-05-01   400 NA             NA    NA     600      600 TRUE 
#>  9     3     1 2020-01-01   300 NA             NA   300     300      300 TRUE 
#> 10     3     2 2020-02-01   200 NA             NA    NA     300      300 TRUE 
#> 11     3     3 2020-03-01   200 NA             NA    NA     300      300 TRUE 
#> 12     3     4 2020-06-01   100 2020-05-01    200    NA     200      200 TRUE 
#> 13     3     5 2020-10-01   400 2020-05-01    200    NA     200      200 TRUE 
#> 14     3     6 2021-04-01   300 2020-05-01    200    NA     300      200 FALSE

Created on 2024-02-29 with reprex v2.1.0