restricting k in generalized additive models

99 Views Asked by At

I am using mgcv::gam to fit generalized additive models to ecological data. I have been setting k as large as possible as advised in various books, tutorials, and posts here, however, some of the resulting smooths are extremely wiggly, which makes it extremely difficult to interpret in the context of biology/ecology.

I read a paper that uses a similar approach and states that because "the resulting smooth is completely data-driven and may not be biologically/ecologically sensible (i.e., excessive wiggliness without a reasonable mechanistic interpretation)" they restricted k = 5 "as an additional step to avoid overfitting."

Is setting k to a low value (e.g. 3-5) in this way to limit wiggliness a valid/defensible approach? E.g. if biologically we expect relationships to look either linear or like second order polynomials, would manually selecting a low k across all covariates in a model be a valid approach? what about cases where we don't have any expectation about how the predictor might relate to the response variable?

1

There are 1 best solutions below

0
Gavin Simpson On

This is really a statistical question and better suited to CrossValidated, but...

I don't know where you got the impression that setting k to be "as large as possible" was recommended, but that is not a good strategy for selecting k in general.

The general strategy should be to set k to be as large as needed to create a basis rich enough that the true (but unknown) function or a close approximation to that true function is representable by the basis.

If you know the effect is linear (I'd then ask "How do you know it is linear?" but that's a separate issue) then you shouldn't be fitting a smooth function for that effect; you should use your knowledge and fit the linear function. If you think the effect of a covariate on the response is not linear, but is smooth though of low degree (low EDF) then use that knowledge; many effects in ecology will be representable by smooths with low EDF, although even something as simple as a logistic curve needs more EDF than you might think (we have an example in our training materials wherek needs to be ≅ 10 to pass the heuristic check in mgcv::k.check().)

This is why blanket statements such as the one you quote about ecological realism are also less than useful. Their general point is well-made; biological/ecological plausibility should be considered when fitting smooths. But a basis with 4 basis functions is not very rich at all, and is likely prone to underfitting (bias), which can then cause problems for some of the theory used to develop the credible intervals we place around estimated functions, the p values shown in the summary() output etc.

If you choose to set k low in the manner you describe, then you do need to explore the idea (assumption) that your basis was sufficiently large enough, via k.check() or modelling of the deviance residuals.

You can also use the derivatives of the estimated smooths (with a simultaneous credible interval) to investigate if wiggles in the estimated functions are sufficiently interesting or not; if the simultaneous interval of the derivative at a particular value of the covariate excludes zero (0), that would indicate a potentially uninteresting wiggle, a feature of the estimated function that is not sufficiently large as to be discernible given the uncertainty in the estimated smooth itself.

In short, yes, you can set k low if you expect or know that the true relationship should be both smooth and of low degree, but you do need to check that assumption (if the data are telling you something different to your prior expectations, why is that?), and you should be aware that many simple-looking functions need more EDF that you might expect. For the inference to remain valid you need to avoid biased functions, which you'd get if you underfit (oversmooth).

If you need to build in additional constraints for biological/ecological plausibility, such as monotonicity constraints, then see the scam