I have values of 2 biomarkers ('analyte'), measured with 2 different 'methods' over 2 different periods ('old' methods over 4 days from 2022-06-03 to 2022-06-06, and 'new' methods over 7 days from 2023-06-01 to 2023-06-07).
Question: How to count exactly the same (maximal) number of values for 'old' and 'new' methods in a new column, by analyte, starting from the earliest common period, considering month-day regardless of the year?
For example, in the desired output data below, the new 'count' column shows:
n=16 values for old and new methods of the first analyte 'chol', and
n=18 values for old and new methods of the second analyte 'ldh'.
Input data:
dat0 <-
structure(list(analyte = structure(c(3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L), .Label = c("amy", "auu", "chol", "ggt",
"iron", "ldh", "pal", "prot", "trig", "uree", "ureeu", "uric"
), class = "factor"), method = structure(c(2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("new", "old"), class = "factor"),
date = structure(c(1654214400, 1654214400, 1654214400, 1654214400,
1654214400, 1654300800, 1654300800, 1654300800, 1654300800,
1654387200, 1654387200, 1654387200, 1654473600, 1654473600,
1654473600, 1654473600, 1685577600, 1685577600, 1685577600,
1685577600, 1685577600, 1685577600, 1685664000, 1685664000,
1685664000, 1685664000, 1685664000, 1685664000, 1685750400,
1685750400, 1685750400, 1685750400, 1685750400, 1685836800,
1685836800, 1685836800, 1685836800, 1685923200, 1685923200,
1685923200, 1685923200, 1685923200, 1685923200, 1686009600,
1686009600, 1686009600, 1686009600, 1686009600, 1686009600,
1686096000, 1686096000, 1686096000, 1686096000, 1686096000,
1686096000, 1654214400, 1654214400, 1654214400, 1654214400,
1654214400, 1654300800, 1654300800, 1654300800, 1654300800,
1654300800, 1654300800, 1654387200, 1654387200, 1654387200,
1654387200, 1654387200, 1654473600, 1654473600, 1654473600,
1654473600, 1654473600, 1654473600, 1654473600, 1654473600,
1685577600, 1685577600, 1685577600, 1685577600, 1685577600,
1685664000, 1685664000, 1685664000, 1685664000, 1685664000,
1685750400, 1685750400, 1685750400, 1685836800, 1685836800,
1685836800, 1685923200, 1685923200, 1685923200, 1685923200,
1686009600, 1686009600, 1686009600, 1686009600, 1686096000,
1686096000, 1686096000, 1686096000), tzone = "UTC", class = c("POSIXct",
"POSIXt")), value = c(2.84, 2.79, 2.82, 2.84, 2.84, 2.81,
2.8, 2.83, 2.82, 2.82, 2.8, 2.82, 2.81, 2.79, 2.82, 2.84,
2.74, 2.76, 2.77, 2.79, 2.76, 2.81, 2.73, 2.75, 2.77, 2.78,
2.76, 2.86, 2.73, 2.78, 2.84, 2.82, 2.8, 2.74, 2.76, 2.79,
2.82, 2.8, 2.76, 2.77, 2.83, 2.85, 2.83, 2.78, 2.74, 2.8,
2.8, 2.79, 2.85, 2.75, 2.76, 2.78, 2.82, 2.76, 2.86, 121,
119, 113, 117, 119, 123, 120, 117, 118, 125, 121, 121, 124,
123, 117, 115, 113, 120, 113, 113, 115, 117, 119, 115, 111,
111, 113, 116, 115, 111, 115, 112, 118, 115, 113, 106, 118,
113, 113, 113, 110, 114, 113, 117, 111, 111, 115, 120, 112,
113, 114, 108)), row.names = c(NA, -107L), class = c("tbl_df",
"tbl", "data.frame"))
Desired output data:
dat1 <-
structure(list(analyte = c("chol", "chol", "chol", "chol", "chol",
"chol", "chol", "chol", "chol", "chol", "chol", "chol", "chol",
"chol", "chol", "chol", "chol", "chol", "chol", "chol", "chol",
"chol", "chol", "chol", "chol", "chol", "chol", "chol", "chol",
"chol", "chol", "chol", "chol", "chol", "chol", "chol", "chol",
"chol", "chol", "chol", "chol", "chol", "chol", "chol", "chol",
"chol", "chol", "chol", "chol", "chol", "chol", "chol", "chol",
"chol", "chol", "ldh", "ldh", "ldh", "ldh", "ldh", "ldh", "ldh",
"ldh", "ldh", "ldh", "ldh", "ldh", "ldh", "ldh", "ldh", "ldh",
"ldh", "ldh", "ldh", "ldh", "ldh", "ldh", "ldh", "ldh", "ldh",
"ldh", "ldh", "ldh", "ldh", "ldh", "ldh", "ldh", "ldh", "ldh",
"ldh", "ldh", "ldh", "ldh", "ldh", "ldh", "ldh", "ldh", "ldh",
"ldh", "ldh", "ldh", "ldh", "ldh", "ldh", "ldh", "ldh", "ldh"
), method = c("old", "old", "old", "old", "old", "old", "old",
"old", "old", "old", "old", "old", "old", "old", "old", "old",
"new", "new", "new", "new", "new", "new", "new", "new", "new",
"new", "new", "new", "new", "new", "new", "new", "new", "new",
"new", "new", "new", "new", "new", "new", "new", "new", "new",
"new", "new", "new", "new", "new", "new", "new", "new", "new",
"new", "new", "new", "old", "old", "old", "old", "old", "old",
"old", "old", "old", "old", "old", "old", "old", "old", "old",
"old", "old", "old", "old", "old", "old", "old", "old", "old",
"new", "new", "new", "new", "new", "new", "new", "new", "new",
"new", "new", "new", "new", "new", "new", "new", "new", "new",
"new", "new", "new", "new", "new", "new", "new", "new", "new",
"new"), date = structure(c(1654214400, 1654214400, 1654214400,
1654214400, 1654214400, 1654300800, 1654300800, 1654300800, 1654300800,
1654387200, 1654387200, 1654387200, 1654473600, 1654473600, 1654473600,
1654473600, 1685577600, 1685577600, 1685577600, 1685577600, 1685577600,
1685577600, 1685664000, 1685664000, 1685664000, 1685664000, 1685664000,
1685664000, 1685750400, 1685750400, 1685750400, 1685750400, 1685750400,
1685836800, 1685836800, 1685836800, 1685836800, 1685923200, 1685923200,
1685923200, 1685923200, 1685923200, 1685923200, 1686009600, 1686009600,
1686009600, 1686009600, 1686009600, 1686009600, 1686096000, 1686096000,
1686096000, 1686096000, 1686096000, 1686096000, 1654214400, 1654214400,
1654214400, 1654214400, 1654214400, 1654300800, 1654300800, 1654300800,
1654300800, 1654300800, 1654300800, 1654387200, 1654387200, 1654387200,
1654387200, 1654387200, 1654473600, 1654473600, 1654473600, 1654473600,
1654473600, 1654473600, 1654473600, 1654473600, 1685577600, 1685577600,
1685577600, 1685577600, 1685577600, 1685664000, 1685664000, 1685664000,
1685664000, 1685664000, 1685750400, 1685750400, 1685750400, 1685836800,
1685836800, 1685836800, 1685923200, 1685923200, 1685923200, 1685923200,
1686009600, 1686009600, 1686009600, 1686009600, 1686096000, 1686096000,
1686096000, 1686096000), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
value = c(2.84, 2.79, 2.82, 2.84, 2.84, 2.81, 2.8, 2.83,
2.82, 2.82, 2.8, 2.82, 2.81, 2.79, 2.82, 2.84, 2.74, 2.76,
2.77, 2.79, 2.76, 2.81, 2.73, 2.75, 2.77, 2.78, 2.76, 2.86,
2.73, 2.78, 2.84, 2.82, 2.8, 2.74, 2.76, 2.79, 2.82, 2.8,
2.76, 2.77, 2.83, 2.85, 2.83, 2.78, 2.74, 2.8, 2.8, 2.79,
2.85, 2.75, 2.76, 2.78, 2.82, 2.76, 2.86, 121, 119, 113,
117, 119, 123, 120, 117, 118, 125, 121, 121, 124, 123, 117,
115, 113, 120, 113, 113, 115, 117, 119, 115, 111, 111, 113,
116, 115, 111, 115, 112, 118, 115, 113, 106, 118, 113, 113,
113, 110, 114, 113, 117, 111, 111, 115, 120, 112, 113, 114,
108), count = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -107L))
Thanks for help
I'm not 100% sure, but I think this might help in moving forward. I get a bit of a difference in output, but it may be justifiable. This assumes that sequences will not wrap around years or have more complex occurrences.
First, you will need to separate your date to consider
monthanddaywithoutyear. You can sort orarrangethe data for eachanalyteusingmonthandday. The first day of the year where there is overlap (both "new" and "old" available for a given day andanalyte) will be indicated withTRUEin a new column calledoverlap.A
countwill be computed for eachmethodandanalytewhere there isoverlap.Next part, you can
arrangebyanalyteandcount, andfilteror keep rows where there are both methods available for a givencountandanalyte.I believe the difference in output is that the following only includes exactly 14 tests for "ldh" for both methods, even though one ends on June 5 and the other June 6. If you wanted them to continue with the same end date, this could be modified.
Output