R: Faster way to create a variable using earlier assigned values (SAS retain-type programming)

276 Views Asked by At

I have a data.frame in R, which consists of about 100k rows. I am using the following example data.frame to illustrate my problem. Imagine here that ID represents a patient, date is the date of testing for a disease, A indicates a positive test from test-type A, and B is a similar indicator but from test-type B. Every row will have at least one positive test and the dataset is sorted by ID and date.

data <- data.frame(ID=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
                   date=as.Date(c("2022-01-01","2022-01-05","2022-02-17","2022-05-13",
                                "2022-09-20","2021-02-02","2021-02-06","2021-04-18",
                                "2021-04-19","2022-08-21","2020-01-01","2020-03-29",
                                "2020-04-03","2020-04-04","2022-01-01")),
                   A=c(1,1,0,1,0,0,0,1,0,1,0,0,0,0,0),
                   B=c(0,0,1,0,1,1,1,0,1,0,1,1,1,1,1))

data

   ID       date A B
1   1 2022-01-01 1 0
2   1 2022-01-05 1 0
3   1 2022-02-17 0 1
4   1 2022-05-13 1 0
5   1 2022-09-20 0 1
6   2 2021-02-02 0 1
7   2 2021-02-06 0 1
8   2 2021-04-18 1 0
9   2 2021-04-19 0 1
10  2 2022-08-21 1 0
11  3 2020-01-01 0 1
12  3 2020-03-29 0 1
13  3 2020-04-03 0 1
14  3 2020-04-04 0 1
15  3 2022-01-01 0 1

I have the following rule to determine dates/counts (infec_date and n_infec) of consecutive infections for a patient:

  1. The first date value for a patient will be their initial infection (infec_date=date and n_infec=1).
  2. If A==1 and date is 45 days or more after a previous infection, that is a new infection (add 1 to n_infec and assign date as the new infec_date).
  3. Otherwise, if B==1 and date is 90 days or more after a previous infection, that is also a new infection (take the same actions as in 2. above).
  4. If no criteria for infection are met, then the most recent infec_date/n_infec are carried forward.

My output dataset using this rule will need to look like:

   ID       date A B infec_date n_infec
1   1 2022-01-01 1 0 2022-01-01       1
2   1 2022-01-05 1 0 2022-01-01       1
3   1 2022-02-17 0 1 2022-01-01       1
4   1 2022-05-13 1 0 2022-05-13       2
5   1 2022-09-20 0 1 2022-09-20       3
6   2 2021-02-02 0 1 2021-02-02       1
7   2 2021-02-06 0 1 2021-02-02       1
8   2 2021-04-18 1 0 2021-04-18       2
9   2 2021-04-19 0 1 2021-04-18       2
10  2 2022-08-21 1 0 2022-08-21       3
11  3 2020-01-01 0 1 2020-01-01       1
12  3 2020-03-29 0 1 2020-01-01       1
13  3 2020-04-03 0 1 2020-04-03       2
14  3 2020-04-04 0 1 2020-04-03       2
15  3 2022-01-01 0 1 2022-01-01       3

MY GOAL: Is to find a way to do this that is faster than having to iterate through each row and check a string of logic to set values manually. I would like to know if there is any functionality for ease/speed of this type of programming using packages dplyr, tidyr, data.table, or sqldf.

My current for loop I've been using in R is:

for(i in 1:nrow(data)){
  if(i==1){
    data[i,"infec_date"]=data[i,"date"]
    data[i,"n_infec"]=1
  }else if(data[i,"ID"]!=data[i-1,"ID"]){
    data[i,"infec_date"]=data[i,"date"]
    data[i,"n_infec"]=1
  }else{
    if(data[i,"A"]==1&data[i,"date"]>=data[i-1,"infec_date"]+45){
      data[i,"infec_date"]=data[i,"date"]
      data[i,"n_infec"]=data[i-1,"n_infec"]+1
    }else if(data[i,"B"]==1&data[i,"date"]>=(data[i-1,"infec_date"]+90)){
      data[i,"infec_date"]=data[i,"date"]
      data[i,"n_infec"]=data[i-1,"n_infec"]+1
    }else{
      data[i,"infec_date"]=data[i-1,"infec_date"]
      data[i,"n_infec"]=data[i-1,"n_infec"]
    }
  }
}

This gets slow and takes forever to run when dealing with 100k rows of data. I don't have access to SAS, but programming this in SAS would look like:

data new_data;
    set data;
    by id date;
    length infec_date n_infec 8.;
    format infec_date mmddyy10.;
    retain infec_date n_infec;
    if first.id then do;
        infec_date=date;
        n_infec=1;
        end;
    if A=1 and date>=infec_date+45 then do;
        infec_date=date;
        n_infec=n_infec+1;
        end;
    else if B=1 and date>=infec_date+90 then do;
        infec_date=date;
        n_infec=n_infec+1;
        end;
run;

Thanks in advance!

1

There are 1 best solutions below

1
Jonni On

Thanks for the additional information about data, helpful to know more of the limitations. Below answer is still essentially looping through each row of each participant, but it is a bit more optimized than the for-loop/nested if-else you have listed in the question. I optimized by

  1. Splitting the dataframe up into lists by participant; iterating through lists is faster and this removes the need to track if participant session is "first"/duplicate

  2. Vectorized and preallocated a column that the new date would go into rather than adding to and expanding with each iteration

  3. Used ifelse in a user-defined function that took vector inputs, this required some data modification of making a "group" that contained whether the date was either an A or a B.

More ways to optimize here

For the code, I first pivoted longer to make A and B be in the same column and dropped the rows where, eg., B = 0, as these weren't relevant

data <- data %>%  
    pivot_longer(cols = c("A","B")) %>% 
    filter(value != 0) %>% 
    rename("group" = "name") #renamed to group for ease in explanation

#Select only variables of interest
df_list <- data %>% 
    select(ID, date,group)

#Make this into list
df_list <- split(df_list,df_list$ID)

Next wrote a user-defined function that will take in objects of the date, group, and running_infec. These will be pieces of information available in the loop run below.

condition_code <- function(date,group,running_infec){
    if(group == "A"){
        ifelse(date - running_infec < 45, running_infec, date)
    }else{
        ifelse(date - running_infec < 90, running_infec, date)
    }
}

Next is to loop through each list. For each list, create or predefine the new date column as the first date listed. This takes care of when it is a new participant as there is only one participant per list.

For each row of the list, objects are saved and used in the function condition_code

# Loop through each list in your series of lists
for (i in seq_along(df_list)) {
    # Get the data frame within the current list
    df <- df_list[[i]]
    
    # Create a new column to store the coded dates
    df$infec_date <- df$date[1]
    # Creates initial
    running_infec <- df$date[1]
    # Loop through each row of the data frame
    for (j in 1:nrow(df)) {
        # Get the date and group for the current row
        date <- df$date[j]
        group <- df$group[j]
        
        #Assigns value to infec_date
        infec_date <- as.Date(condition_code(date, group, running_infec), origin = "1970-01-01")
        # Save the new date in the new column of the data frame
        df$infec_date[j] <- infec_date
        #update running_infec value
        running_infec <- infec_date
    }
    
    # Replace the original data frame in the current list with the updated data frame
    df_list[[i]] <- df
}

After loop, bind the list back together for your dataframe, group by ID, and perform similar count of infection/new infection as shown with first answer. Adding the pivot_wider back in to get data output to match what you have shown in picture above but that may not be necessary.

#Get running count, pivot wider back to requested output
df_list %>% 
    bind_rows() %>%
    group_by(ID) %>% 
    mutate(running_count = case_when(infec_date == date ~ 1,
                                     TRUE ~ 0)) %>% 
    mutate(running_count = cumsum(running_count)) %>% 
    pivot_wider(names_from = group, values_from = group, values_fn = ~1, values_fill = 0) %>% 
    select(ID, date, A, B, infec_date, running_count)

enter image description here

[PREVIOUS ANSWER BUT OP COMMENT CLARIFIED THAT BELOW SOLUTION INCOMPLETE--IGNORE BELOW]

data %>% 
    group_by(ID) %>% 
    mutate(dup = duplicated(ID)) %>% 
    mutate(infection = case_when(A == 1 & (date - lag(date)) >=45 ~ 1,
                                 B == 1 & (date - lag(date)) >=90 ~ 1,
                                 dup == FALSE ~ 1,
                                 TRUE ~ 0)) %>% 
    mutate(infec_date = case_when(infection == 1 ~ date,
                                      infection == 0 & lag(infection) == 1 ~ lag(date),
                                      infection == 0 & lag(infection) == 0 ~ lag(date,n=2), 
                                      TRUE ~ NA)) %>% 
    mutate(n_infec = cumsum(infection)) %>% 
    select(-dup,-infection) %>% 
    ungroup()