In trying to fit a lognormal distribution to my truncated data, I found the following two Stackoverflow posts and followed them:
Fitting a lognormal distribution to truncated data in R Fitting a truncated lognormal distribution in R
However, it appears this solution no longer works, as the dtrunc and ptrunc functions from the truncdist package now fail to pass the test function of fitdistrplus.
dtruncated_log_normal <- function(x, a,b, meanlog, sdlog)
dtrunc(x, "lnorm", a=a, b=b, meanlog=meanlog, sdlog=sdlog)
ptruncated_log_normal <- function(q, a,b, meanlog, sdlog)
ptrunc(q, "lnorm", a=a, b=b, meanlog=meanlog, sdlog=sdlog)
fit <- fitdist(s, "truncated_log_normal", start=list(a=0.001, b=90, meanlog=mean(log(s)), sdlog=sd(log(s))))
We run into basically all the errors of the test function, returning
Error in fitdist(s, "truncated_log_normal", start = list(a = 0.001, b = 90, :
the function mle failed to estimate the parameters,
with the error code 100
In addition: Warning messages:
1: In fitdist(s, "truncated_log_normal", start = list(a = 0.001, b = 90, :
The dtruncated_log_normal function should return a vector of with NaN values when input has inconsistent values and not raise an error
2: In fitdist(s, "truncated_log_normal", start = list(a = 0.001, b = 90, :
The ptruncated_log_normal function should return a vector of with NaN values when input has inconsistent parameters and not raise an error
Sample of my vector with over 2000 elements:
> dput(head(s,150))
c(88.443, 89.296, 89.327, 87.776, 89.405, 89.824, 89.997, 87.678,
89.665, 88.814, 88.841, 89.728, 89.365, 89.476, 89.189, 88.251,
88.939, 89.945, 89.567, 89.613, 89.317, 89.622, 87.674, 89.19,
89.782, 89.891, 89.954, 89.556, 89.093, 89.637, 89.052, 87.395,
87.835, 89.357, 87.733, 89.459, 88.197, 88.539, 88.564, 87.857,
88.74, 88.955, 89.691, 88.102, 89.635, 89.116, 89.584, 88.288,
86.95, 89.182, 89.435, 88.93, 87.567, 89.083, 88.52, 88.897,
89.54, 88.557, 89.269, 89.854, 89.31, 88.274, 89.126, 89.431,
88.257, 88.872, 88.978, 89.03, 87.434, 88.305, 89.656, 87.556,
89.209, 89.508, 87.781, 88.068, 89.933, 87.256, 88.906, 89.067,
88.92, 87.947, 88.196, 88.951, 89.594, 88.378, 87.482, 88.817,
89.65, 89.392, 89.932, 87.896, 89.909, 89.265, 89.954, 89.827,
87.49, 87.786, 89.208, 89.728, 88.905, 87.566, 86.612, 88.363,
87.457, 87.639, 88.907, 88.425, 87.244, 88.546, 88.221, 89.293,
87.469, 87.31, 89.107, 88.442, 89.133, 88.812, 88.418, 89.456,
88.512, 89.514, 87.446, 88.374, 89.282, 87.415, 89.004, 87.627,
89.107, 89.168, 89.589, 89.288, 88.496, 89.807, 87.518, 88.796,
88.001, 87.322, 87.353, 88.055, 88.81, 88.456, 87.876, 87.7,
88.675, 88.996, 89.479, 86.781, 86.928, 87.356)
Is there a workaround for this in 2023? My data is already cleaned and contains no inconsistent, empty, NA or any other strange values.
The first problem is the restriction on the parameters:
0 < a < min(s) < max(s) < band0 < sdlogare requirements of the model. The optimizer doesn't know that, and runs into trouble because values that violate those restrictions generate lots of errors.One way to deal with problems like this is to modify the parameters so they are unrestricted. For example, you could use
log(a)as a parameter, because it's guaranteed to be positive, and uselog(b-a)as another parameter, to guaranteeb > a, andlog(sdlog)as another.The next problem is harder. For some of the values tried by the optimizer, the probability of the original distribution falling in the truncation interval evaluates to zero. Specifically, I saw this in debugging when
a = 0.001, b = 90, sdlog = 0.009529981(your starting values), andmeanlog = 5.176146(somewhat larger than the starting value).The probability isn't really zero, it's some small value that rounds down to zero. The solution to this is to work with log probabilities instead of probabilities, but I don't think that option is available to you --- the code where this problem comes up is in
truncdist, but the parameters are coming fromfitdistrplus. Those two packages would need to work together to solve this, or perhaps you could write specialized versions ofdtruncandptruncto do it.EDITED TO ADD:
The basic idea used by
dtruncis that the truncated density is equal to the regular density divided by the probability of being in the truncation interval, i.e.d/(pb - pa)wheredis the full density,pbandpaare the CDF values at the endpoints. The numerical problem is thatpb == padue to rounding.The solution to this is to rescale everything, and work on the log scale. That is, use
(d/pb)/(1 - pa/pb) = exp(log(d) - log(pb) - log1p(-exp(log(pa) - log(pb)))).The code below does this as well as solving the first problem. It doesn't use the
truncdistpackage, it does the calculations using base functions.And that's still not enough! Now the problem is that
fitdistforcesoptimto compute a Hessian matrix, and numerical issues cause that to fail. There's a way around this: I'll define a "custom" optimization function, that is just the regularoptim()with thehessianargument forced to beFALSE.And here's the result. Sorry, no standard errors.
Created on 2023-03-19 with reprex v2.0.2