I have a data.frame in R, which consists of about 100k rows. I am using the following example data.frame to illustrate my problem. Imagine here that ID represents a patient, date is the date of testing for a disease, A indicates a positive test from test-type A, and B is a similar indicator but from test-type B. Every row will have at least one positive test and the dataset is sorted by ID and date.
data <- data.frame(ID=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
date=as.Date(c("2022-01-01","2022-01-05","2022-02-17","2022-05-13",
"2022-09-20","2021-02-02","2021-02-06","2021-04-18",
"2021-04-19","2022-08-21","2020-01-01","2020-03-29",
"2020-04-03","2020-04-04","2022-01-01")),
A=c(1,1,0,1,0,0,0,1,0,1,0,0,0,0,0),
B=c(0,0,1,0,1,1,1,0,1,0,1,1,1,1,1))
data
ID date A B
1 1 2022-01-01 1 0
2 1 2022-01-05 1 0
3 1 2022-02-17 0 1
4 1 2022-05-13 1 0
5 1 2022-09-20 0 1
6 2 2021-02-02 0 1
7 2 2021-02-06 0 1
8 2 2021-04-18 1 0
9 2 2021-04-19 0 1
10 2 2022-08-21 1 0
11 3 2020-01-01 0 1
12 3 2020-03-29 0 1
13 3 2020-04-03 0 1
14 3 2020-04-04 0 1
15 3 2022-01-01 0 1
I have the following rule to determine dates/counts (infec_date and n_infec) of consecutive infections for a patient:
- The first
datevalue for a patient will be their initial infection (infec_date=dateandn_infec=1). - If
A==1anddateis 45 days or more after a previous infection, that is a new infection (add 1 ton_infecand assigndateas the newinfec_date). - Otherwise, if
B==1anddateis 90 days or more after a previous infection, that is also a new infection (take the same actions as in 2. above). - If no criteria for infection are met, then the most recent
infec_date/n_infecare carried forward.
My output dataset using this rule will need to look like:
ID date A B infec_date n_infec
1 1 2022-01-01 1 0 2022-01-01 1
2 1 2022-01-05 1 0 2022-01-01 1
3 1 2022-02-17 0 1 2022-01-01 1
4 1 2022-05-13 1 0 2022-05-13 2
5 1 2022-09-20 0 1 2022-09-20 3
6 2 2021-02-02 0 1 2021-02-02 1
7 2 2021-02-06 0 1 2021-02-02 1
8 2 2021-04-18 1 0 2021-04-18 2
9 2 2021-04-19 0 1 2021-04-18 2
10 2 2022-08-21 1 0 2022-08-21 3
11 3 2020-01-01 0 1 2020-01-01 1
12 3 2020-03-29 0 1 2020-01-01 1
13 3 2020-04-03 0 1 2020-04-03 2
14 3 2020-04-04 0 1 2020-04-03 2
15 3 2022-01-01 0 1 2022-01-01 3
MY GOAL: Is to find a way to do this that is faster than having to iterate through each row and check a string of logic to set values manually. I would like to know if there is any functionality for ease/speed of this type of programming using packages dplyr, tidyr, data.table, or sqldf.
My current for loop I've been using in R is:
for(i in 1:nrow(data)){
if(i==1){
data[i,"infec_date"]=data[i,"date"]
data[i,"n_infec"]=1
}else if(data[i,"ID"]!=data[i-1,"ID"]){
data[i,"infec_date"]=data[i,"date"]
data[i,"n_infec"]=1
}else{
if(data[i,"A"]==1&data[i,"date"]>=data[i-1,"infec_date"]+45){
data[i,"infec_date"]=data[i,"date"]
data[i,"n_infec"]=data[i-1,"n_infec"]+1
}else if(data[i,"B"]==1&data[i,"date"]>=(data[i-1,"infec_date"]+90)){
data[i,"infec_date"]=data[i,"date"]
data[i,"n_infec"]=data[i-1,"n_infec"]+1
}else{
data[i,"infec_date"]=data[i-1,"infec_date"]
data[i,"n_infec"]=data[i-1,"n_infec"]
}
}
}
This gets slow and takes forever to run when dealing with 100k rows of data. I don't have access to SAS, but programming this in SAS would look like:
data new_data;
set data;
by id date;
length infec_date n_infec 8.;
format infec_date mmddyy10.;
retain infec_date n_infec;
if first.id then do;
infec_date=date;
n_infec=1;
end;
if A=1 and date>=infec_date+45 then do;
infec_date=date;
n_infec=n_infec+1;
end;
else if B=1 and date>=infec_date+90 then do;
infec_date=date;
n_infec=n_infec+1;
end;
run;
Thanks in advance!
Thanks for the additional information about data, helpful to know more of the limitations. Below answer is still essentially looping through each row of each participant, but it is a bit more optimized than the for-loop/nested if-else you have listed in the question. I optimized by
Splitting the dataframe up into lists by participant; iterating through lists is faster and this removes the need to track if participant session is "first"/duplicate
Vectorized and preallocated a column that the new date would go into rather than adding to and expanding with each iteration
Used ifelse in a user-defined function that took vector inputs, this required some data modification of making a "group" that contained whether the date was either an A or a B.
More ways to optimize here
For the code, I first pivoted longer to make A and B be in the same column and dropped the rows where, eg., B = 0, as these weren't relevant
Next wrote a user-defined function that will take in objects of the date, group, and running_infec. These will be pieces of information available in the loop run below.
Next is to loop through each list. For each list, create or predefine the new date column as the first date listed. This takes care of when it is a new participant as there is only one participant per list.
For each row of the list, objects are saved and used in the function
condition_codeAfter loop, bind the list back together for your dataframe, group by ID, and perform similar count of infection/new infection as shown with first answer. Adding the pivot_wider back in to get data output to match what you have shown in picture above but that may not be necessary.
[PREVIOUS ANSWER BUT OP COMMENT CLARIFIED THAT BELOW SOLUTION INCOMPLETE--IGNORE BELOW]