I have a longitudinal dataset for a randomised controlled trial, where there are two arms (arm), and an outcome (Y) assessed at three timepoints (period). The first period is 0, and is the baseline score, recorded before randomisation.
I have simulated a dataset with:
library(simstudy)
set.seed(420)
data <- defData(varname = "arm", dist = "binary", formula = 0.5)
data <- defData(data, varname = "Y0", dist = "normal", formula = 10, variance = 2)
data <- defData(data, varname = "Y1", dist = "normal", formula = "Y0 + 5 + 5 * arm", variance = 2)
data <- defData(data, varname = "Y2", dist = "normal", formula = "Y0 + 10 + 5 * arm", variance = 2)
data <- genData(500, data)
data <- addPeriods(data, nPeriods = 3, idvars = "id", timevars = c("Y0", "Y1", "Y2"), timevarName = "Y")
data <- data %>%
mutate(
arm = as.factor(arm),
period = as.factor(period)
)
The summary stats for the score at each timepoint and within each arm are:
data %>%
group_by(arm, period) %>%
summarise(mean = mean(Y))
# A tibble: 6 × 3
# Groups: arm [2]
arm period mean
<fct> <fct> <dbl>
1 0 0 10.1
2 0 1 14.9
3 0 2 20.2
4 1 0 10.2
5 1 1 20.2
6 1 2 25.2
I wish to model the mean difference at each timepoint using a constrained baseline model (https://bmjopen.bmj.com/content/6/12/e013096). Simply, this fits a regression model with a fixed effect for time period and the interaction between time period and treatment group (no main effect for treatment). This has the effect of 'constraining' the baseline, or assuming that it is from the same distribution (a fair assumption in most RCTs). Including the main effect allows the baseline to differ between the groups, which is not desired in this case.
I have attempted to fit this model using the lme4 package:
model1 <- lmer(Y ~ period + period:arm + (1 | id), data = data)
However, when fitting the emmeans, I have noticed that the baseline scored (which should be identical) are instead equal to the group specific means:
emmeans(model1, ~ arm | period)
period = 0:
arm emmean SE df lower.CL upper.CL
0 10.1 0.113 943 9.83 10.3
1 10.2 0.113 943 9.97 10.4
period = 1:
arm emmean SE df lower.CL upper.CL
0 14.9 0.113 943 14.71 15.1
1 20.2 0.113 943 20.01 20.5
period = 2:
arm emmean SE df lower.CL upper.CL
0 20.2 0.113 943 19.94 20.4
1 25.2 0.113 943 25.01 25.5
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
This has happened because the formula that was specified in the model has expanded out to the interactions at each timepoint (including baseline):
colnames(model.matrix(model1))
[1] "(Intercept)" "period1" "period2" "period0:arm1" "period1:arm1" "period2:arm1"
I was expecting to see:
[1] "(Intercept)" "period1" "period2" "period1:arm1" "period2:arm1"
(with no period0:arm1 term).
My question is: how can I modify this model so that it is actually a constrained baseline model?
Your
periodvariable is a factor, so when you model it inlme4it is dummy-coded, thus why interactions are included forperiod0in addition toperiod1andperiod2. As far as I'm aware, you can't asklme4to only include certain values of a variable. You would have to cull your data to exclude period0 and run on just that data for your baseline model.In this case, when you run
colnames(model.matrix(baseline_model))you won't haveperiod1in the list because the intercept isperiod1(where before the intercept wasperiod0).