Fitting a truncated lognormal distribution with fitdistrplus

392 Views Asked by At

In trying to fit a lognormal distribution to my truncated data, I found the following two Stackoverflow posts and followed them:

Fitting a lognormal distribution to truncated data in R Fitting a truncated lognormal distribution in R

However, it appears this solution no longer works, as the dtrunc and ptrunc functions from the truncdist package now fail to pass the test function of fitdistrplus.

dtruncated_log_normal <- function(x, a,b, meanlog, sdlog)
  dtrunc(x, "lnorm", a=a, b=b, meanlog=meanlog, sdlog=sdlog)

ptruncated_log_normal <- function(q, a,b, meanlog, sdlog)
  ptrunc(q, "lnorm", a=a, b=b, meanlog=meanlog, sdlog=sdlog)

fit <- fitdist(s, "truncated_log_normal", start=list(a=0.001, b=90, meanlog=mean(log(s)), sdlog=sd(log(s))))

We run into basically all the errors of the test function, returning

Error in fitdist(s, "truncated_log_normal", start = list(a = 0.001, b = 90,  : 
  the function mle failed to estimate the parameters, 
                with the error code 100
In addition: Warning messages:
1: In fitdist(s, "truncated_log_normal", start = list(a = 0.001, b = 90,  :
  The dtruncated_log_normal function should return a vector of with NaN values when input has inconsistent values and not raise an error
2: In fitdist(s, "truncated_log_normal", start = list(a = 0.001, b = 90,  :
  The ptruncated_log_normal function should return a vector of with NaN values when input has inconsistent parameters and not raise an error

Sample of my vector with over 2000 elements:

> dput(head(s,150))
c(88.443, 89.296, 89.327, 87.776, 89.405, 89.824, 89.997, 87.678, 
89.665, 88.814, 88.841, 89.728, 89.365, 89.476, 89.189, 88.251, 
88.939, 89.945, 89.567, 89.613, 89.317, 89.622, 87.674, 89.19, 
89.782, 89.891, 89.954, 89.556, 89.093, 89.637, 89.052, 87.395, 
87.835, 89.357, 87.733, 89.459, 88.197, 88.539, 88.564, 87.857, 
88.74, 88.955, 89.691, 88.102, 89.635, 89.116, 89.584, 88.288, 
86.95, 89.182, 89.435, 88.93, 87.567, 89.083, 88.52, 88.897, 
89.54, 88.557, 89.269, 89.854, 89.31, 88.274, 89.126, 89.431, 
88.257, 88.872, 88.978, 89.03, 87.434, 88.305, 89.656, 87.556, 
89.209, 89.508, 87.781, 88.068, 89.933, 87.256, 88.906, 89.067, 
88.92, 87.947, 88.196, 88.951, 89.594, 88.378, 87.482, 88.817, 
89.65, 89.392, 89.932, 87.896, 89.909, 89.265, 89.954, 89.827, 
87.49, 87.786, 89.208, 89.728, 88.905, 87.566, 86.612, 88.363, 
87.457, 87.639, 88.907, 88.425, 87.244, 88.546, 88.221, 89.293, 
87.469, 87.31, 89.107, 88.442, 89.133, 88.812, 88.418, 89.456, 
88.512, 89.514, 87.446, 88.374, 89.282, 87.415, 89.004, 87.627, 
89.107, 89.168, 89.589, 89.288, 88.496, 89.807, 87.518, 88.796, 
88.001, 87.322, 87.353, 88.055, 88.81, 88.456, 87.876, 87.7, 
88.675, 88.996, 89.479, 86.781, 86.928, 87.356)

Is there a workaround for this in 2023? My data is already cleaned and contains no inconsistent, empty, NA or any other strange values.

1

There are 1 best solutions below

3
user2554330 On

The first problem is the restriction on the parameters: 0 < a < min(s) < max(s) < b and 0 < sdlog are requirements of the model. The optimizer doesn't know that, and runs into trouble because values that violate those restrictions generate lots of errors.

One way to deal with problems like this is to modify the parameters so they are unrestricted. For example, you could use log(a) as a parameter, because it's guaranteed to be positive, and use log(b-a) as another parameter, to guarantee b > a, and log(sdlog) as another.

The next problem is harder. For some of the values tried by the optimizer, the probability of the original distribution falling in the truncation interval evaluates to zero. Specifically, I saw this in debugging when a = 0.001, b = 90, sdlog = 0.009529981 (your starting values), and meanlog = 5.176146 (somewhat larger than the starting value).

The probability isn't really zero, it's some small value that rounds down to zero. The solution to this is to work with log probabilities instead of probabilities, but I don't think that option is available to you --- the code where this problem comes up is in truncdist, but the parameters are coming from fitdistrplus. Those two packages would need to work together to solve this, or perhaps you could write specialized versions of dtrunc and ptrunc to do it.

EDITED TO ADD:

The basic idea used by dtrunc is that the truncated density is equal to the regular density divided by the probability of being in the truncation interval, i.e. d/(pb - pa) where d is the full density, pb and pa are the CDF values at the endpoints. The numerical problem is that pb == pa due to rounding.

The solution to this is to rescale everything, and work on the log scale. That is, use (d/pb)/(1 - pa/pb) = exp(log(d) - log(pb) - log1p(-exp(log(pa) - log(pb)))).

The code below does this as well as solving the first problem. It doesn't use the truncdist package, it does the calculations using base functions.

And that's still not enough! Now the problem is that fitdist forces optim to compute a Hessian matrix, and numerical issues cause that to fail. There's a way around this: I'll define a "custom" optimization function, that is just the regular optim() with the hessian argument forced to be FALSE.

And here's the result. Sorry, no standard errors.

library(fitdistrplus) 
#> Loading required package: MASS
#> Loading required package: survival

s <- c(88.443, 89.296, 89.327, 87.776, 89.405, 89.824, 89.997, 87.678, 
       89.665, 88.814, 88.841, 89.728, 89.365, 89.476, 89.189, 88.251, 
       88.939, 89.945, 89.567, 89.613, 89.317, 89.622, 87.674, 89.19, 
       89.782, 89.891, 89.954, 89.556, 89.093, 89.637, 89.052, 87.395, 
       87.835, 89.357, 87.733, 89.459, 88.197, 88.539, 88.564, 87.857, 
       88.74, 88.955, 89.691, 88.102, 89.635, 89.116, 89.584, 88.288, 
       86.95, 89.182, 89.435, 88.93, 87.567, 89.083, 88.52, 88.897, 
       89.54, 88.557, 89.269, 89.854, 89.31, 88.274, 89.126, 89.431, 
       88.257, 88.872, 88.978, 89.03, 87.434, 88.305, 89.656, 87.556, 
       89.209, 89.508, 87.781, 88.068, 89.933, 87.256, 88.906, 89.067, 
       88.92, 87.947, 88.196, 88.951, 89.594, 88.378, 87.482, 88.817, 
       89.65, 89.392, 89.932, 87.896, 89.909, 89.265, 89.954, 89.827, 
       87.49, 87.786, 89.208, 89.728, 88.905, 87.566, 86.612, 88.363, 
       87.457, 87.639, 88.907, 88.425, 87.244, 88.546, 88.221, 89.293, 
       87.469, 87.31, 89.107, 88.442, 89.133, 88.812, 88.418, 89.456, 
       88.512, 89.514, 87.446, 88.374, 89.282, 87.415, 89.004, 87.627, 
       89.107, 89.168, 89.589, 89.288, 88.496, 89.807, 87.518, 88.796, 
       88.001, 87.322, 87.353, 88.055, 88.81, 88.456, 87.876, 87.7, 
       88.675, 88.996, 89.479, 86.781, 86.928, 87.356)

dtruncated_log_normal <- function(x, loga, logbminusa, meanlog, logsdlog) {

  a <- exp(loga)
  b <- exp(logbminusa) + a
  goodvals <- is.finite(x) & a <= x & x <= b
  sdlog <- exp(logsdlog)
  # cat("parms:", c(min(x[goodvals]), max(x[goodvals]), loga, a, logbminusa, b, meanlog, sdlog), "\n")
  logpab <- plnorm(c(a,b), meanlog, sdlog, log = TRUE)
  result <- rep(0, length(x))
  logd <- dlnorm(x[goodvals], meanlog, sdlog, log = TRUE)
  result[goodvals] <- exp(logd - logpab[2] - log1p(-exp(logpab[1]- logpab[2])))
  result
}

ptruncated_log_normal <- function(q, loga, logbminusa, meanlog, logsdlog) rep(NaN, length(q))

myoptim <- function(..., hessian) {
  optim(..., hessian = FALSE)
}

fit <- fitdist(s, "truncated_log_normal", start=list(loga=log(min(s) - 1), logbminusa=log(max(s) - min(s) + 2), meanlog=mean(log(s)), logsdlog=log(sd(log(s)))), custom.optim = myoptim)

fit
#> Fitting of the distribution ' truncated_log_normal ' by maximum likelihood 
#> Parameters:
#>             estimate Std. Error
#> loga        4.439900         NA
#> logbminusa  1.654508         NA
#> meanlog     4.490567         NA
#> logsdlog   -4.354114         NA

# Convert back to the original scale:
ests <- as.list(fit$estimate)
with(ests, {   
  a <- exp(loga)
  b <- exp(logbminusa) + a
  sdlog <- exp(logsdlog)
  c(a = a, b = b, meanlog = meanlog, sdlog = sdlog)
  })
#>           a           b     meanlog       sdlog 
#> 84.76649227 89.99700006  4.49056699  0.01285382

Created on 2023-03-19 with reprex v2.0.2