I'm trying to estimate a model using speedglm in R. The dataset is large (~69.88 million rows and 38 columns). Multiplying the number of rows and columns results in ~2.7 billion which is outside the integer limit. I can't provide the data, but the following examples recreate the issue.
library(speedglm)
# large example that works
require(biglm)
n <- 500000
k <- 500
y <- rgamma(n, 1.5, 1)
x <- round(matrix(rnorm(n*k), n, k), digits = 3)
colnames(x) <- paste("s", 1:k, sep = "")
da <- data.frame(y, x)
fo <- as.formula(paste("y~", paste(paste("s", 1:k, sep = ""), collapse = "+")))
working.example <- speedglm(fo, data = da, family = Gamma(log))
# repeat with large enough size to break
k <- 5000 # 10 times larger than above
x <- round(matrix(rnorm(n*k), n, k), digits = 3)
colnames(x) <- paste("s", 1:k, sep = "")
da <- data.frame(y, x)
fo <- as.formula(paste("y~", paste(paste("s", 1:k, sep = ""), collapse = "+")))
failed.example <- speedglm(fo, data = da, family = Gamma(log))
# attempting to resolve error with chunksize
attempted.fixed.example <- speedglm(fo, data = da, family = Gamma(log), chunksize = 10^6)
This causes an error and integer overflow warning.
Error in if (!replace && is.null(prob) && n > 1e+07 && size <= n/2) .Internal(sample2(n, :
missing value where TRUE/FALSE needed
In addition: Warning message:
In nrow(X) * ncol(X) : NAs produced by integer overflow
I understand the warning, but I do not understand the error. They seem to be related in this case as they appear together after each attempt.
Removing columns allows the estimation to complete. It does not seem to matter which columns are removed; removing interacted or non-interacted variables will both result in a completed estimation. The chunksize option was added after receiving the error initially, but has not helped.
My questions are: (1) what causes the first error? (2) is there a way to estimate models using data such that the number of rows by the number of columns is larger than the integer limit? (3) is there a better na.action to use in this case?
Thanks,
JP.
Running: R version 3.3.3 (2017-03-06)
Actual code below:
dft_var <- c("cltvV0", "cltvV60", "cltvV120", "VCFLBRQ", "ageV0",
"ageV1", "ageV8", "ageV80", "FICOV300", "FICOV650",
"FICOV900", "SingleHouse", "Apt", "Mobile", "Duplex",
"Row", "Modular", "Rural", "FirstTimeBuyer",
"FirstTimeBuyerMissing", "brwtotinMissing", "IncomeRatio",
"VintageBefore2001", "NFLD", "yoy.fcpwti:province_n")
logit1 <- speedglm(formula = paste("DefaultFlag ~ ",
paste(dft_var, collapse = "+"),
sep = ""),
family = binomial(logit),
na.action = na.exclude,
data = default.data,
chunksize = 1*10^7)
Update:
Based on my investigation below, @James figured out that the problem can be avoided by providing non-
NULLvalue for the parametersparsein the call of thespeedglmfunction, as it prevents the internal call of theis.sparsefunction.Using the example above, the following should now work:
My original answer:
Both the warning and the error come from the same line in the function
is.sparsein the packagespeedglm.The line is:
The warning happens because of the use of
nrow(X)*ncol(X)for a large matrix. Thenrowandncolfunctions returnintegervalues, which can overflow. here is an illustration.The error happens because the
samplefunction is confused when X is a large matrix andsize = NA. Here is an illustration: