The overall goal of the exercise is to find perfectly collinear columns in a very large, but sparse matrix X in R in the context of a linear regression problem. One approach that pops up from time to time is to make use of the underlying QR decomposition of lm() to extract the OLS coefficients and drop all variables that are assigned NA as estimate. While base::qr() is much too slow to be a viable option, Matrix::qr() handles the sparse matrix well and is quite fast. Unfortunately, the results of the two implementations are quite different.
Consider the toy data set
# random design matrix
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- x1 + x2
X <- cbind(1, x1, x2, x3)
# random outcome variable
y <- rnorm(100)
where variable x3 is a linear combination of x1 and x2 and therefore the goal is to drop x3 from X. We can quickly estimate the OLS coefficients using base as follows
# base implementation
qr.base <- base::qr(X)
coef.base <- base::qr.coef(qr.base, y)
coef.base
x1 x2 x3
-0.04330410 -0.04889519 0.07719935 NA
Here, the coefficient of x3 is automatically set to NA, as desired. Similar results can be obtained by using the dense matrix X as input for Matrix::qr(). However, things change when working with a sparse matrix class:
# Matrix implementation (sparse)
X.sparse <- Matrix::sparse.model.matrix(~ x1 + x2 + x3)
qr.matrix <- Matrix::qr(X.sparse)
coef.matrix <- Matrix::qr.coef(qr.matrix, y)
coef.matrix
(Intercept) x1 x2 x3
-3.125000e-02 -2.088811e+14 -2.088811e+14 2.088811e+14
where something clearly goes wrong at some point due to X not being of full rank. Is there a way to generate results similar to what base::qr() gives using Matrix::qr()?
It looks like
Matrix::qr.coef(qr.matrix, y)is giving a different answer because it never drops columns/pivots to deal with collinearity. It thus seems like you would need to deal with this prior to computing the QR decomposition ofX.sparse. If you wanted to implement this, one way to consider which columns need to be dropped uses a trick based on the reduced row echelon form oft(X) %*% Xthat works as follows.Consider a slight variation on your toy data set:
I've added two "flavors" of collinearity because I think it's helpful to see the distinction in what follows. I also added an additional independent predictor
x6to show how the pattern showcased continues with additional predictors.We can extract information about collinearity from the matrix
t(X) %*% X. This is because it's actually the inability to invert this singular matrix that relates to the key problems caused by collinearity in the first place. In ordinary least squares, this is also what prevents us from simply being able directly to calculateThe process works as follows
In
collinear_sets_matrix, any row/columniwith a 1 in positioniand 0s elsewhere does not belong to a collinear set. In the example above, this only includes theinterceptand predictorx6.Given
rref_tX_X, one could also work through the matrix columns left to right to determine predictors to keep and predictors to drop. Keep a counterkof columns to keep that starts at 1. When a column inrref_tX_Xhas a 1 in positionkand 0s elsewhere, keep that column and incrementk. Continue checking all columns.In the example above, we would keep
intercept,x1,x2,x4, andx6, which also matches values that are notNAfollowing thebase::qr()andbase::qr.coef()process