I'm using R package mice for multiple imputation. I only need to impute a subset of my variables, but I thought it could be good to use the full data set for the imputation of these variables. Hence I'd like to use the where flag to specify which values (and variables) to impute. However, if I set where, then many, or sometimes all, of the values indicated by the where input matrix remain NA, whereas if I don't set where, then all missing values are imputed. I don't know why that would be.
Using R 4.2.3, mice 3.16.0. Generate random data with missing values
library(mice)
set.seed(2024)
df <- matrix(data = sample(100,100,replace=TRUE), ncol = 5)
df[df>80] = NA
> df
[,1] [,2] [,3] [,4] [,5]
[1,] 66 1 80 43 78
[2,] 37 75 NA NA 16
[3,] 45 35 67 58 36
[4,] 60 NA 25 49 46
[5,] 17 28 34 5 NA
[6,] 32 48 NA 80 72
[7,] NA NA 57 6 73
[8,] 29 NA 49 3 20
[9,] 11 NA 54 25 18
[10,] 16 NA NA 33 34
[11,] 29 43 49 75 19
[12,] 62 58 3 NA 12
[13,] 14 20 25 30 NA
[14,] 34 55 65 NA 66
[15,] 26 NA 6 73 16
[16,] 44 60 20 61 45
[17,] 50 4 9 3 62
[18,] 26 7 79 29 70
[19,] 32 52 77 20 68
[20,] NA 57 80 61 31
where matrix specifies that I'm only interested in imputing the second variable
where.mat = matrix(FALSE,nrow = dim(df)[1],ncol = dim(df)[2])
where.mat[,2] = is.na(df[,2])
> where.mat
[,1] [,2] [,3] [,4] [,5]
[1,] FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE
[4,] FALSE TRUE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE
[7,] FALSE TRUE FALSE FALSE FALSE
[8,] FALSE TRUE FALSE FALSE FALSE
[9,] FALSE TRUE FALSE FALSE FALSE
[10,] FALSE TRUE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE FALSE
[15,] FALSE TRUE FALSE FALSE FALSE
[16,] FALSE FALSE FALSE FALSE FALSE
[17,] FALSE FALSE FALSE FALSE FALSE
[18,] FALSE FALSE FALSE FALSE FALSE
[19,] FALSE FALSE FALSE FALSE FALSE
[20,] FALSE FALSE FALSE FALSE FALSE
Imputation using the where matrix
test1 = mice(data = df,
m = 1,
ignore = NULL,
maxit = 1,
where = where.mat,
method = "pmm")
imp_dat <- complete(test1, action = 1, include = F)
Not all values are imputed
> sum(is.na(imp_dat[,2]))
[1] 2
> imp_dat[,2]
[1] 1 75 35 43 28 48 NA 43 60 NA 43 58 20 55 43 60 4 7 52 57
>
Run again, now without the where matrix
test2 = mice(data = df,
m = 1,
ignore = NULL,
maxit = 1,
#where = where.mat,
method = "pmm")
imp_dat2 <- complete(test2, action = 1, include = F)
Now everything is imputed
> sum(is.na(imp_dat2[,2]))
[1] 0
> imp_dat2[,2]
[1] 1 75 35 7 28 48 7 43 43 43 43 58 20 55 20 60 4 7 52 57
>
Why doesn't it work with the where matrix? I've confirmed that the predictorMatrix is correct, and it isn't a collinearity issue, since it works in the second example.
> test1$predictorMatrix
V1 V2 V3 V4 V5
V1 0 1 1 1 1
V2 1 0 1 1 1
V3 1 1 0 1 1
V4 1 1 1 0 1
V5 1 1 1 1 0