What does "~." mean in R?

128 Views Asked by At

it's the first time I've used the model.matrix command on R. I'm posting the code from an example found in a book:

model.matrix(~., data)

I don't understand what ~. is.

Looking on the internet I saw that the first argument would be "an object of an appropriate class" but I don't understand what ~. is.

2

There are 2 best solutions below

0
Robert Schauner On

The dot means all columns in your data.frame not otherwise mentioned.

Running a few examples with the iris dataset shows this pretty well.

library(tidyverse)

model.matrix( ~ ., data = iris) %>% colnames()
# [1] "(Intercept)"       "Sepal.Length"      "Sepal.Width"      
# [4] "Petal.Length"      "Petal.Width"       "Speciesversicolor"
# [7] "Speciesvirginica" 

model.matrix( ~ Petal.Length, data = iris) %>% colnames()
# [1] "(Intercept)"  "Petal.Length"

model.matrix(Petal.Length ~ ., data = iris) %>% colnames()
# [1] "(Intercept)"       "Sepal.Length"      "Sepal.Width"      
# [4] "Petal.Width"       "Speciesversicolor" "Speciesvirginica" 

You could forgo the . syntax and spell it out. I think this is a better approach since in this case a reader does not need to know what the columns in iris are. You can just tell what our model will include. It also has the side effect of preventing you from including something accidentally.

model.matrix( ~ Petal.Length + Sepal.Length + Sepal.Width + Species, data = iris) %>% colnames()
# [1] "(Intercept)"       "Sepal.Length"      "Sepal.Width"      
# [4] "Petal.Length"      "Petal.Width"       "Speciesversicolor"
# [7] "Speciesvirginica" 

Another resource on making these model matrices that I have used is this article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7873980/. It's specific for RNA seq, but the model matrices can be used for any similar style of comparison/model.

0
Arthur On

There is a formula data type in R that specifies some selection and transformation of columns in a dataframe as input to a modeling or mathematical function.

For example,

lm(mpg ~ wt + hp, data = mtcars)

Will fit a linear model with mpg as the response (y) variable and wt and hp as the predictor (x) variable, both taken from the mtcars dataframe.

Why is this useful? Well, without it, you have to prepare x and y matrix arguments every time you fit a model. This is how most other mathematical languages work and it's kind of a pain. You'd have to write something like

# pseudo-code
x <- as.matrix(mtcars[, c("wt", "hp")])
y <- mtcars[, "mpg"]
fit_lm(x, y)

The formula type makes it easy and agile to try models with various terms or transformations (i.e. log(mpg) ~ wt + hp) without having to define x and y inputs every...single...time.

The formula type also avoids having to type a lot of indexing $ or [] or quotes "hp". Elegance is prized in R and it's nice to have clean code without lots of quotes and brackets.

The ~ symbol is a part of a formula. It separates the response and predictor variables in specification of a model. Or, sometimes, there is a one-sided formula if the math has now equivalent of a response variable and it's just ~x1 + x2.

The . is a shorthand for "include all other columns." So, mpg ~ . is shorthand for mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb.

model.matrix is a function that creates the matrix input to a regression. It will expand out factor variables into dummy variables. It will keep numeric variables as-is. If there is a transformation defined in the formula, it will perform it. You can give it the set of variables to the right of the ~ and it will do all this preparation for you.