Why does lm() with the subset argument give a different answer than subsetting in advance?

Question

Why does lm() with the subset argument give a different answer than subsetting in advance?

300 Views Asked by itsMeInMiami At 26 December 2021 at 23:43

I am using lm() on a training set of data that includes a polynomial. When I subset in advance with [ ] I get different coefficients compared to using the subset argument in the lm() function call. Why?

library(ISLR2)

set.seed (1)
train <- sample(392, 196)

auto_train <- Auto[train,]


lm.fit.data <- lm(mpg ~ poly(horsepower, 2), data = auto_train)
summary(lm.fit.data)
#> 
#> Call:
#> lm(formula = mpg ~ poly(horsepower, 2), data = auto_train)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -12.8711  -2.6655  -0.0096   2.0806  16.1063 
#> 
#> Coefficients:
#>                      Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)           23.8745     0.3171  75.298  < 2e-16 ***
#> poly(horsepower, 2)1 -89.3337     4.4389 -20.125  < 2e-16 ***
#> poly(horsepower, 2)2  33.2985     4.4389   7.501 2.25e-12 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.439 on 193 degrees of freedom
#> Multiple R-squared:  0.705,  Adjusted R-squared:  0.702 
#> F-statistic: 230.6 on 2 and 193 DF,  p-value: < 2.2e-16


lm.fit.subset <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
summary(lm.fit.subset)
#> 
#> Call:
#> lm(formula = mpg ~ poly(horsepower, 2), data = Auto, subset = train)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -12.8711  -2.6655  -0.0096   2.0806  16.1063 
#> 
#> Coefficients:
#>                       Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)            23.5496     0.3175  74.182  < 2e-16 ***
#> poly(horsepower, 2)1 -123.5881     6.4587 -19.135  < 2e-16 ***
#> poly(horsepower, 2)2   47.7189     6.3613   7.501 2.25e-12 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.439 on 193 degrees of freedom
#> Multiple R-squared:  0.705,  Adjusted R-squared:  0.702 
#> F-statistic: 230.6 on 2 and 193 DF,  p-value: < 2.2e-16

^{Created on 2021-12-26 by the reprex package (v2.0.1)}

Original Q&A

There are 2 best solutions below

Jon Manese On 27 December 2021 at 00:24

In your second call it looks like poly() is computed first before subsetting. Compare the outputs of model.frame() below:

# first call
model.frame(mpg ~ poly(horsepower, 2), data = auto_train)[1:5,]
#>      mpg poly(horsepower, 2).1 poly(horsepower, 2).2
#> 326 44.3         -0.1037171808          0.1498371034
#> 169 23.0         -0.0372467155         -0.0099055358
#> 131 26.0         -0.0429441840         -0.0000530004
#> 301 23.9         -0.0239526225         -0.0300950106
#> 272 23.2          0.0045347198         -0.0601592336

# second call
model.frame(mpg ~ poly(horsepower, 2), data = Auto, subset = train)[1:5,]
#>      mpg poly(horsepower, 2).1 poly(horsepower, 2).2
#> 326 44.3         -0.0741931315          0.1133792778
#> 169 23.0         -0.0282078693         -0.0034299423
#> 131 26.0         -0.0321494632          0.0039029222
#> 301 23.9         -0.0190108168         -0.0185862638
#> 272 23.2          0.0006971527         -0.0418538153

# same as
model.frame(mpg ~ poly(horsepower, 2), data = Auto)[train,][1:5,]
#>      mpg poly(horsepower, 2).1 poly(horsepower, 2).2
#> 326 44.3         -0.0741931315          0.1133792778
#> 169 23.0         -0.0282078693         -0.0034299423
#> 131 26.0         -0.0321494632          0.0039029222
#> 301 23.9         -0.0190108168         -0.0185862638
#> 272 23.2          0.0006971527         -0.0418538153

**Ben Bolker** · Accepted Answer · 2021-12-27T00:45:03.017000

tl;dr As suggested in other comments and answers, the characteristics of the orthogonal polynomial basis are computed before the subsetting is taken into account.

To add more technical detail to @JonManes's answer, let's look at lines 545-553 of the R code where 'model.frame' is defined.

First we have (lines 545-549)

 if(is.null(attr(formula, "predvars"))) {
        for (i in seq_along(varnames))
            predvars[[i+1L]] <- makepredictcall(variables[[i]], vars[[i+1L]])
        attr(formula, "predvars") <- predvars
    }

At this point in the code, formula will not be an actual formula (that would be too easy!), but rather a terms object that contains various useful-to-developers info about model structures ...
predvars is the attribute that defines the information needed to properly reconstruct data-dependent bases like orthogonal polynomials and splines (see ?makepredictcall for a little bit more information, or here, although in general this stuff is really poorly documented; I'd expect it to be documented here but it isn't ...). For example,

attr(terms(model.frame(mpg ~ poly(horsepower, 2), data = auto_train)),  "predvars")

gives

list(mpg, poly(horsepower, 2, coefs = list(alpha = c(102.612244897959, 
142.498828460405), norm2 = c(1, 196, 277254.530612245, 625100662.205702
))))

These are the coefficients for the polynomial, which depend on the distribution of the input data.

Only after this information has been established, on line 553, do we get

subset <- eval(substitute(subset), data, env)

In other words, the subsetting argument doesn't even get evaluated until after the polynomial characteristics are determined (all of this information is then passed to the internal C_modelframe function, which you really don't want to look at ...)

Note that this issue does not result in an information leak between training and testing sets in a statistical learning context: the parameterization of the polynomial doesn't affect the predictions of the model at all (in theory, although as usual with floating point the results are unlikely to be exactly identical). At worst (if the training and full sets were very different) it could reduce numerical stability a bit.

FWIW this is all surprising (to me) and seems worth raising on the [email protected] mailing list (at least a note in the documentation seems in order).

Why does lm() with the subset argument give a different answer than subsetting in advance?

There are 2 best solutions below

Related Questions in R

Related Questions in LM

Related Questions in POLY

Trending Questions

Popular # Hahtags

Popular Questions