creating new columns for unique events and then counting events in R by group

Question

creating new columns for unique events and then counting events in R by group

89 Views Asked by Nick Masto At 09 February 2023 at 19:42

A subset of my data looks like this but with many more groupings (IDs):

 ID                          time                class    
   <chr>                       <dttm>              <fct>    
 1 BBR-b172021-M_fall_winter_4 2022-11-01 19:03:31 migrating
 2 BBR-b172021-M_fall_winter_4 2022-11-04 22:03:33 migrating 
 3 BBR-b172021-M_fall_winter_4 2022-11-07 18:03:34 migrating 
 4 BBR-b172021-M_fall_winter_4 2022-11-08 21:03:34 stopover 
 5 BBR-b172021-M_fall_winter_4 2022-11-10 21:03:39 stopover 
 6 BBR-b172021-M_fall_winter_4 2022-11-14 18:03:37 migrating 
 7 BBR-b172021-M_fall_winter_4 2022-11-17 06:04:08 migrating 
 8 BBR-b172021-M_fall_winter_4 2022-11-18 06:04:08 stopover 
 9 BBR-b172021-M_fall_winter_4 2022-11-19 00:03:41 winter 
10 BBR-b172021-M_fall_winter_4 2022-11-27 00:03:51 winter 
11 LINWR-b1282020-M_fall_winter_3 2022-01-14 11:00:08 migrating
12 LINWR-b1282020-M_fall_winter_3 2022-01-15 13:59:45 stopover
13 LINWR-b1282020-M_fall_winter_3 2022-01-20 02:59:54 stopover
14 LINWR-b1282020-M_fall_winter_3 2022-01-21 03:00:14 migrating
15 LINWR-b1282020-M_fall_winter_3 2022-01-21 16:59:47 stopover
16 LINWR-b1282020-M_fall_winter_3 2022-01-22 16:59:45 winter

I am trying to create unique columns either through mapping or group_by and mutate but I don't know where to begin. I would like several new columns describing unique sequential events, their sum, and their duration. New columns added to the dataframe I would suspect would look something like this:


newcols <- data.frame(unique_class = c("migrating1", "migrating1", "migrating1", "stopover1", 
                                       "stopover1", "migrating2", "migrating2", "stopover2", 
                                       "winter1", "winter1", "migrating1", "stopover1", 
                                       "stopover1", "migrating2", "stopover2", "winter1"),
                      migrate_sum = c(2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
                      stopover_sum = c(2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
                      winter_sum = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
                      event_duration = c(6,6,6,2,2,3,3,0,8,8,0,5,5,0,0,0))

...where event_duration column would equate to time in days or hours. I know I need to group_by(ID) but and mutate() but not sure how to get the unique classes or lagged duration of times for each class. Any help appreciated.

NOT SURE WHERE TO PUT THIS SO EDITING MY QUESTION: I TRIED @AKRUN SOLUTION BUT IT DIDN'T QUITE WORK. IT PRODUCED THE UNIQUE_CLASS WELL BUT SUMMARIES ARE NOT INCORRECT. HERE'S AN EXAMPLE OF A DATAFRAME PRODUCED USING SOLUTION BELOW AND SUBSET BY UNIQUE ID: fall_mig2 %>% filter(BirdsID_season == "BBR-b432021-M_fall_winter_4") %>% select(BirdsID_season, x, y, time, unique_class, class, stopover_sum) slice_head <- fall_mig2 %>% filter(BirdsID_season == "BBR-b432021-M_fall_winter_4") %>% slice_head(n = 10) slice_tail <- fall_mig2 %>% filter(BirdsID_season == "BBR-b432021-M_fall_winter_4") %>% slice_tail(n = 10) bind_rows(slice_head, slice_tail) %>% select(BirdsID_season, x, y, time, stopover_sum) and the result:

 BirdsID_season                  x     y time                unique_class class     stopover_sum
   <chr>                       <dbl> <dbl> <dttm>              <chr>        <chr>            <int>
 1 BBR-b432021-M_fall_winter_4 -99.2  48.1 2022-11-09 19:09:01 migrating1   migrating            3
 2 BBR-b432021-M_fall_winter_4 -99.2  48.1 2022-11-09 21:08:36 migrating1   migrating            3
 3 BBR-b432021-M_fall_winter_4 -99.2  48.1 2022-11-09 23:08:55 migrating1   migrating            3
 4 BBR-b432021-M_fall_winter_4 -99.2  48.1 2022-11-10 01:09:11 migrating1   migrating            3
 5 BBR-b432021-M_fall_winter_4 -99.2  48.1 2022-11-10 03:08:50 migrating1   migrating            3
 6 BBR-b432021-M_fall_winter_4 -99.2  48.1 2022-11-10 05:09:06 migrating1   migrating            3
 7 BBR-b432021-M_fall_winter_4 -99.2  48.1 2022-11-10 07:08:43 migrating1   migrating            3
 8 BBR-b432021-M_fall_winter_4 -99.2  48.1 2022-11-10 09:08:54 migrating1   migrating            3
 9 BBR-b432021-M_fall_winter_4 -99.2  48.1 2022-11-10 11:09:07 migrating1   migrating            3
10 BBR-b432021-M_fall_winter_4 -99.2  48.1 2022-11-10 13:08:39 migrating1   migrating            3
11 BBR-b432021-M_fall_winter_4 -89.3  36.7 2022-12-13 23:08:30 winter1      winter               1
12 BBR-b432021-M_fall_winter_4 -89.3  36.7 2022-12-14 01:08:45 winter1      winter               1
13 BBR-b432021-M_fall_winter_4 -89.3  36.7 2022-12-14 03:08:45 winter1      winter               1
14 BBR-b432021-M_fall_winter_4 -89.3  36.7 2022-12-14 05:08:26 winter1      winter               1
15 BBR-b432021-M_fall_winter_4 -89.3  36.7 2022-12-14 07:08:22 winter1      winter               1
16 BBR-b432021-M_fall_winter_4 -89.3  36.7 2022-12-14 09:08:45 winter1      winter               1
17 BBR-b432021-M_fall_winter_4 -89.3  36.7 2022-12-14 11:08:54 winter1      winter               1
18 BBR-b432021-M_fall_winter_4 -89.3  36.7 2022-12-14 13:08:19 winter1      winter               1
19 BBR-b432021-M_fall_winter_4 -89.3  36.7 2022-12-14 15:08:47 winter1      winter               1
20 BBR-b432021-M_fall_winter_4 -89.4  36.7 2022-12-14 17:08:19 winter1      winter               1

stopover_sum should be 1 (which is in the middle of the subsetted df) . I'm not sure where the 3 is coming from. Trying to dissect the solution now.

Original Q&A

There are 2 best solutions below

Nick Masto On 10 February 2023 at 22:56

Again thank you @akrun. My question/intended output was a poorly worded; you're solution did exactly what I asked. I should have specified that I wanted to retain the entire dataset. To do so, I adjusted your solution with mutate instead of transmute and added duration columns, using the same pivot_wider procedure. I duplicated class and event_duration columns so I could retain them. Admittedly, quite clumsy but effective nonetheless. Thanks again. Below should be the final solution:

df <- df %>%
  mutate(grp = data.table::rleid(class), 
         date = as.Date(ymd_hms(time))) %>% 
  group_by(ID, class) %>% 
  mutate(count = n_distinct(grp), 
         unique_class = str_c(class, match(grp, unique(grp)))) %>% 
  group_by(ID, unique_class) %>% 
  mutate(event_duration = difftime(max(time), min(time), units = "days")) %>% 
  ungroup() %>% 
  mutate(class = str_c(class, '_sum')) %>% 
  pivot_wider(names_from = class, values_from = count) %>% 
  relocate(event_duration, .after = last_col()) %>%
  mutate(class = str_sub(unique_class, start = 1, end = -2),
         class2 = class,
         event_duration2 = event_duration) %>%
  pivot_wider(names_from = class2, values_from = event_duration2) %>%
  mutate(mig_dur = migrating,
         stop_dur = stopover,
         winter_dur = winter) %>%
  dplyr::select(-migrating, -stopover, -winter)

**akrun** · Accepted Answer · 2023-02-09T21:00:07.347000

We may create a run-length-id column grouped by 'Class', convert the 'time' to Date class, then grouped by 'ID', 'class', get the number of distinct (n_distinct) elements in 'grp', as well as the unique_class is created by pasteing the 'class' with the unique 'grp' indexes. Do a second grouping by 'ID', 'unique_class' to calculate the 'event_duration' ie. the number of days between the max/min 'date' values, select the columns of interest, reshape to 'wide' with pivot_wider and fill the values in the _sum to previous non-NA values

library(dplyr)
library(lubridate)
library(tidyr)
library(stringr)
library(data.table)
df1 %>% 
   mutate(grp = rleid(class), date = as.Date(ymd_hms(time))) %>% 
   group_by(ID, class) %>% 
   mutate(Count = n_distinct(grp), 
     unique_class = str_c(class, match(grp, unique(grp)))) %>% 
   group_by(ID, unique_class) %>% 
   mutate(event_duration = as.integer(max(date) - min(date))) %>% 
   ungroup %>% 
   transmute(rn = row_number(), class = str_c(class, '_sum'),
    Count, unique_class, event_duration) %>% 
   pivot_wider(names_from = class, values_from = Count) %>% 
   fill(ends_with("_sum"), .direction = "downup") %>% 
   select(-rn) %>%
   relocate(event_duration, .after = last_col())

-output

# A tibble: 16 × 5
   unique_class migrating_sum stopover_sum winter_sum event_duration
   <chr>                <int>        <int>      <int>          <int>
 1 migrating1               2            2          1              6
 2 migrating1               2            2          1              6
 3 migrating1               2            2          1              6
 4 stopover1                2            2          1              2
 5 stopover1                2            2          1              2
 6 migrating2               2            2          1              3
 7 migrating2               2            2          1              3
 8 stopover2                2            2          1              0
 9 winter1                  2            2          1              8
10 winter1                  2            2          1              8
11 migrating1               2            2          1              0
12 stopover1                2            2          1              5
13 stopover1                2            2          1              5
14 migrating2               2            2          1              0
15 stopover2                2            2          1              0
16 winter1                  2            2          1              0

data

df1 <- structure(list(ID = c("BBR-b172021-M_fall_winter_4",
 "BBR-b172021-M_fall_winter_4", 
"BBR-b172021-M_fall_winter_4", "BBR-b172021-M_fall_winter_4", 
"BBR-b172021-M_fall_winter_4", "BBR-b172021-M_fall_winter_4", 
"BBR-b172021-M_fall_winter_4", "BBR-b172021-M_fall_winter_4", 
"BBR-b172021-M_fall_winter_4", "BBR-b172021-M_fall_winter_4", 
"LINWR-b1282020-M_fall_winter_3", "LINWR-b1282020-M_fall_winter_3", 
"LINWR-b1282020-M_fall_winter_3", "LINWR-b1282020-M_fall_winter_3", 
"LINWR-b1282020-M_fall_winter_3", "LINWR-b1282020-M_fall_winter_3"
), time = c("2022-11-01 19:03:31", "2022-11-04 22:03:33", "2022-11-07 18:03:34", 
"2022-11-08 21:03:34", "2022-11-10 21:03:39", "2022-11-14 18:03:37", 
"2022-11-17 06:04:08", "2022-11-18 06:04:08", "2022-11-19 00:03:41", 
"2022-11-27 00:03:51", "2022-01-14 11:00:08", "2022-01-15 13:59:45", 
"2022-01-20 02:59:54", "2022-01-21 03:00:14", "2022-01-21 16:59:47", 
"2022-01-22 16:59:45"), class = c("migrating", "migrating", "migrating", 
"stopover", "stopover", "migrating", "migrating", "stopover", 
"winter", "winter", "migrating", "stopover", "stopover", "migrating", 
"stopover", "winter")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16"))

creating new columns for unique events and then counting events in R by group

There are 2 best solutions below

data

Related Questions in R

Related Questions in TIME-SERIES

Related Questions in UNIQUEIDENTIFIER

Related Questions in DATA-MANAGEMENT

Trending Questions

Popular # Hahtags

Popular Questions