Comparing groups when variable has multiple occurences per case

76 Views Asked by At

First, I admit, I've asked this elsewhere (Crossvalidated), but I guess it is not necessarily the same people reading both forums.

I'm finishing a manuscript, but have one nagging problem which I'm unable to find an answer.

Study has two groups which receive different procedure for a disease. After the procedure, one patient may face multiple different adverse events (complications). I want to test if there is a difference in total number of adverse events between groups.

Obviously it would be all too easy with a simple single categorical variable, but I'm in doubt if using chi-squared or Fisher's test is the right approach. If I'm not mistaken, their presumptions differ as this is not simple on/off per case-situation.

One idea I got was to create a continuous dummy variable for total number of complications per case and compare those with t-test or something, but it still answers a bit different question imho.

The exact problem without reported answer can be found for example from the CHOCOLATE-trial (https://www.bmj.com/content/363/bmj.k3965.long) table 3, last 3 rows.

Here is a reproducible code of the end result.

# Values are strored in vectors where each level means if a complication occured. Level 0 means there was none
# All variables have similar levels
# If variable compl_1 would have level 0 (there was no complication on that case), 
# variables compl_2 and compl_3 would also have in real dataset.
# This is not true for the following reproducible example,
# but it hopefully still delineates the problem.
# Here we have a sample size of 200 cases, 100 for each group

data <- data.frame(group = as.factor(ifelse(runif(200) > 0.5, 0, 1)),
                   compl_1 = c(rep(0, 140), rep(1, 40), rep(2, 14), rep(3,4), rep(4,2)), 
                   compl_2 = c(rep(0, 182), rep(1, 10), rep(2, 7), rep(3,1), rep(4,0)), 
                   compl_3 = c(rep(0, 187), rep(1, 5), rep(2, 7), rep(3,1), rep(4,0)))

# Then we count totals for each complication type by pivoting long, doing the count and pivoting back
table <- data |> 
  pivot_longer(compl_1:compl_3) |> 
  count(group, value) |> 
  pivot_wider(names_from = group, values_from = n, names_sort = T, values_fill = 0)

# Of course count for level 0 is bonkers as there were 100 cases per group
# So this is fixed by replacing it with the count of 0-levels in compl_1
no_complication <- data |> 
  dplyr::select(group, compl_1) |> 
  count(group, compl_1) |> 
  pivot_wider(names_from = group, values_from = n, names_sort = T, values_fill = 0) |> 
  slice_head(n=1)
table[1,] <- no_complication[1,]

# Next we add a row that contains the total number of complication (occurences of levels 1-4 for compl-variables)
complications_total <- table |> 
  slice(-1) |> 
  mutate(`0` = sum(`0`, na.rm = T),
         `1` = sum(`1`, na.rm = T)) |> 
  slice(1)

#To make end result clearer, let's add some names to cases
complications_total$value <- factor(complications_total$value, labels = "Total complications")
table$value <- factor(table$value, labels = c("No complication",
                                              "Head cut",
                                              "Foot off",
                                              "Hand torn",
                                              "Eyes blown"))

#Then we add the row of totals and give names to operations
table <- table |> 
  add_row(complications_total[1,]) |> 
  arrange(desc(`0`)) |> 
  rename(Henry = `0`,
         Martin = `1`)

So in other words, I want to add a new column with p-values and with the correct statistical test. Comparing group "No complications" is obviously easy, but for others I can't come up with anything.

Edit: After some extra thought, I think I've found a solution. Was probably overthinking this. I'll just take case based mean of each complication and compare those. I guess this is just a variant of what Brian suggested, so I'll flag that as the solution. Thanks for both!

2

There are 2 best solutions below

2
Brian Syzdek On BEST ANSWER

As done in the paper, you could calculate the relative risk of a complication in the groups if you just wish to collapse all complications.

library(tidyverse)
library(epitools)
rr_res <- table %>% # From your final 'table' data above
  # Create contingency table, collapsing all complications into one group
  mutate(value = factor(ifelse(value == "No complication", "No complication", "Complication"),
                        # Order for risk of complication
                        levels = c("No complication", "Complication"))) %>% 
  group_by(value) %>% 
  summarise(Henry = sum(Henry), Martin = sum(Martin)) %>% 
  select(-value) %>% 
  as.matrix() %>% 
  t() %>% # Conditions in rows, outcome in columns
  riskratio.wald()
rr_res$measure[2,] %>% round(2)

produces risk ratio with 95%CI of being in Martin group and developing complication.

estimate    lower    upper 
    0.82     0.67     0.99 

You might also produce odds-ratio with test.

1
Rui Barradas On

First, I will rewrite the code to table the values. It seems that you are still getting values that add up to greater than 200 cases.

suppressPackageStartupMessages({
  library(tidyverse)
  library(lme4)
})

set.seed(2023)

data <- data.frame(
  group = as.integer(runif(200) > 0.5),
  compl_1 = c(rep(0, 140), rep(1, 40), rep(2, 14), rep(3,4), rep(4,2)), 
  compl_2 = c(rep(0, 182), rep(1, 10), rep(2, 7), rep(3,1), rep(4,0)), 
  compl_3 = c(rep(0, 187), rep(1, 5), rep(2, 7), rep(3,1), rep(4,0))
)

lbl_group <- c("Henry", "Martin")
lbl_complication <- c(
  "No complication", "Head cut", "Foot off", "Hand torn", "Eyes blown"
  # , "Total complications"
)

tbl <- data %>%
  mutate(group = factor(group, labels = lbl_group)) %>%
  table() %>%
  apply(1:2, sum) %>%
  t() %>%
  as.data.frame() %>%
  rownames_to_column(var = "Complication") %>%
  mutate(Complication = factor(Complication, labels = lbl_complication)) %>%
  arrange(desc(Henry))

tbl
#>      Complication Henry Martin
#> 1 No complication    66     74
#> 2        Head cut    26     14
#> 3        Foot off     7      7
#> 4       Hand torn     4      0
#> 5      Eyes blown     2      0

colSums(tbl[-1L])
#> Henry Martin 
#>   105     95 

Created on 2023-12-07


As for the question, if you want to analyze differences in complications by individual, an option is a mixed effects model with complication as a regressor and individuals, Person in the code below, as a random effect.
The result of ranef tells you that there are no such differences.
I also include confidence intervals for the fixed effect Complication.

tbl2 <- tbl %>%
  pivot_longer(-Complication, names_to = "Person", values_to = "Count")

mm_person <- lmer(Count ~ Complication + (1|Person), data = tbl2)
#> boundary (singular) fit: see help('isSingular')

# commented out, would clutter the answer's output
# summary(mm_person)
# the following instruction is off-topic
# fixef(mm_person)

# there are no random effects by person
ranef(mm_person)
#> $Person
#>        (Intercept)
#> Henry            0
#> Martin           0
#> 
#> with conditional variances for "Person"

# given the above result, the first 2 rows are
# not interesting, keep the others only
confint(mm_person)
#> Computing profile confidence intervals ...
#>                             2.5 %     97.5 %
#> .sig01                   0.000000   4.627884
#> .sigma                   2.303055   5.638249
#> (Intercept)             64.828860  75.170921
#> ComplicationHead cut   -57.307046 -42.692953
#> ComplicationFoot off   -70.307046 -55.692953
#> ComplicationHand torn  -75.307046 -60.692953
#> ComplicationEyes blown -76.307046 -61.692953

Created on 2023-12-07