Does the "MIA" option in partykit::ctree() do anything, and--if so--how can we tell from the output?

16 Views Asked by At

Recently I've been trying to use the "partykit" R package and have found it difficult to understand how the ctree() function handles missing values. There is an option named MIA with somewhat cryptic documentation.

From the documentation in the function ctree_control(), we have the following description:

MIA: a logical determining the treatment of NA as a category in split, see Twala et al. (2008).

The method is described pretty clearly in Section 2 of the reference paper. Basically when determining splits that can be made for a variable X by considering a subset Y of its valus, the candidate splits are all of the following form:

  1. ( X in Y or X missing ) vs. (X not in Y)
  2. (X in Y) vs. (X not in Y, or X missing)
  3. (X is missing) vs. (X is not missing)

However, when I run ctree() for data where the predictor has missing values, the tree output doesn't seem to reflect this at all. The split descriptions don't mention anything about missing values. To illustrate, I created example data where we predict an outcome variable y using a binary predictor, x, with values "A" and "B". Some of the observations have missing values for x.

Below is a small reproducible example:

library(partykit)

# Generate example data
# First 50 observations have high outcome values,
# last 50 observations have low outcome values
  y <- c(rep(11, times = 49), 6,
         rep(1,  times = 49), 6)
  x <- c(rep('A', times = 48), c('B', 'A'),
         rep('B', times = 48), c('A', 'B'))

  example_data <- data.frame(x = factor(x), y)

# Set some values missing
  example_data[['x']][c(51, 52)] <- NA
  
# Fit the tree
  ctree(
    formula = y ~ x,
    data = example_data,
    control = ctree_control(
      MIA = TRUE, majority = FALSE,
      maxsurrogate = 0
    )
  )
#> 
#> Model formula:
#> y ~ x
#> 
#> Fitted party:
#> [1] root
#> |   [2] x in B: 1.300 (n = 50, err = 120.5)
#> |   [3] x in A: 10.700 (n = 50, err = 120.5)
#> 
#> Number of inner nodes:    1
#> Number of terminal nodes: 2

From this output, I can't tell that the MIA = TRUE option did anything. It's not clear if missingness was taken into account when forming splits and--if so--which part of the split gets assigned the missing values.

Should I be doing something differently to get the missingness included in the splits and reported on in the ctree() output? Is this a bug? What's going on?

0

There are 0 best solutions below