R, Assign a group ID to combinations of values in another observation ID variable

Question

R, Assign a group ID to combinations of values in another observation ID variable

148 Views Asked by a123 At 19 January 2024 at 10:55

Assume that I have several years of personal data, in which a Personal ID (indiv) identifies individuals across years, but the household ID (househ) only assigns household relation within a year.

Example:

libary(dplyr)
test_01 <- data.frame(indiv=c(1,2,3,4,5,6),househ=c(1,1,2,3,4,4),time=rep(1,6)) #1&2 form a household, 3 and 4 are single, 5&6 form a household
test_02 <- data.frame(indiv=c(2,3,4,5,6,7),househ=c(1,2,2,3,3,4),time=rep(2,6)) #1 exits, so 2 is now a new household, 3&4 now form a new household, 5&6 still do, 7 enters
test_03 <- data.frame(indiv=c(2,3,4,5,7,8,9,10),househ=c(1,2,2,3,4,5,5,6),time=rep(3,8)) #according to logic above
data_test_panel <- bind_rows(test_01,test_02,test_03)

The desired time-consistent household variable would be:

data_test_panel$true_household <- c(1,1,2,3,4,4,5,6,6,4,4,7,5,6,6,8,7,9,9,10)

So far I tried:

library(data.table)
setDT(data_test_panel)[,cons_household := .GRP,.(time,househ)] # where cons_household is the new household ID. However, this doesn't give the same household ID across time but assigns new values for every appearance of a household.

Thank you kindly for all your help! /Severin

Original Q&A

There are 2 best solutions below

Tobo On 20 January 2024 at 08:11

A `data.table` approach that is reasonably nice

As the previous answer shows, data.table's handling of list columns doesn't lend itself to the dplyr-style solution in which the list column of members per annual household is broadcast to all rows of the data, and then grouped-by to create the IDs.

However, we can use the following: (1) Collapse (instead of broadcasting) to a list column with a single row per household/year. (2) Create the ID as the index of each annual household in a list of unique households from all years. (3) Uncollapse the list column.

(data_test_panel
  [, .(hh_indivs = list(indiv)), by=.(househ, time)                # collapse
   ][, hh_id := match(hh_indivs, unique(hh_indivs))                # index
    ][, .(indiv = unlist(hh_indivs), hh_id), by=.(househ, time)])  # uncollapse

This avoids the complicated code to mimic the broadcasted list column, and the problematic use of toString() to enable a groupby.

**SamR** · Accepted Answer · 2024-01-19T11:20:24.803000

Create a list column of unique individuals in a given household at each time period. We can then take the group number by that column.

`dplyr` approach

This is much nicer than the data.table approach:

library(dplyr)
data_test_panel |>
    mutate(
        indiv_in_household = list(indiv),
        .by = c(househ, time)
    ) |>
    mutate(
        cons_household = cur_group_id(),
        .by = indiv_in_household
    )

Note this uses per-operation grouping with the .by parameter which requires at least dplyr v1.1.0.

`data.table` approach

The syntax for this is much more complicated because data.table will automatically unroll your list column into a vector if you try to make it list(indiv) by group.

data_test_panel[, indiv_in_household := .SD[
    , .(split(
        rep(indiv, length(indiv)),
        rep(seq(length(indiv)), each = length(indiv))
    )),
    .(househ, time)
]$V1]

head(data_test_panel)
#    indiv househ  time true_household indiv_in_household
#    <num>  <num> <num>          <num>             <list>
# 1:     1      1     1              1                1,2
# 2:     2      1     1              1                1,2
# 3:     3      2     1              2                  3
# 4:     4      3     1              3                  4
# 5:     5      4     1              4                5,6
# 6:     6      4     1              4                5,6

Also data.table doesn't support grouping by list columns so we need to use toString(), which is OK in this case but in general is not ideal as you have to think through the format of the column to ensure it will not somehow cause ambiguity leading two distinct groups to be merged.

data_test_panel[, cons_household := .GRP, sapply(indiv_in_household, toString)]

data_test_panel[cons_household != true_household]
# Empty data.table (0 rows and 6 cols): indiv,househ,time,true_household,cons_household,indiv_in_household

R, Assign a group ID to combinations of values in another observation ID variable

There are 2 best solutions below

`dplyr` approach

`data.table` approach

A `data.table` approach that is reasonably nice

Related Questions in R

Related Questions in GROUPING

Related Questions in IDS

Trending Questions

Popular # Hahtags

Popular Questions

R, Assign a group ID to combinations of values in another observation ID variable

There are 2 best solutions below

dplyr approach

data.table approach

A data.table approach that is reasonably nice

Related Questions in R

Related Questions in GROUPING

Related Questions in IDS

Trending Questions

Popular # Hahtags

Popular Questions

`dplyr` approach

`data.table` approach

A `data.table` approach that is reasonably nice