In R, how to impute right-censored missing data to follow the assumed distribution?

243 Views Asked by At

Earlier on, random imputation of left-censored data to follow the assumed distribution has been explained here in Stack Overflow. Than can be easily achieved with the censlm package.

But what if I wanted to do something similar for right-censored data?

First, imputation of left-censored missing data (image annotation code omitted):

library(ggplot2)
library(dplyr)
set.seed(123)

# Simulate log-normally distributed biomarker data
original_data <- rlnorm(10000, 2.3, 0.4)

# Display minimum value
min(original_data)
#> [1] 2.142283

# Set the lower limit of quantification (= LLOQ)
lloq <- 8

# Erase values below lloq for plotting purposes
left_censored_data <- replace(original_data, original_data < lloq, NA) %>% na.omit()

# Display left-censored data (values below LLOQ erased)
ggplot() +
  geom_histogram(aes(left_censored_data), binwidth = 0.3) 


Created on 2023-08-26 with reprex v2.0.2

The missing left-censored values can be easily imputed randomly with the censlm package, with log-normal distribution here assumed (image annotation code omitted):

library(censlm)

# For imputation, replace values below LLOQ with the LLOQ value
obs <- replace(original_data, original_data < lloq, lloq)

# Compute (random) imputations with {censlm} to fill back the distribution below LLOQ
fit <- clm(log(obs) ~ 1, left = log(lloq))
imputed_data <- exp(imputed(fit))

# Combine the original and imputed data in long format for plotting purposes
overlayed <- data.frame(original_data, imputed_data)
overlayed <- stack(overlayed[, c(1,2)])
names(overlayed) <- c("values", "data_frame")

# Create an overlayed plot 
ggplot(overlayed) +
geom_density(aes(x=values, lty=data_frame, size=data_frame, color=data_frame), alpha=.5, bw = 1.0) +
scale_size_manual("type", values = c(0.5, 3), guide = "none")

Created on 2023-08-26 with reprex v2.0.2

But how about right-censored data? How can random values above the upper limit of quantification (= ULOQ) be calculated, if log-normal distribution, again, is assumed?

# Right-censored data

# Set the UPPER limit of quantification (= ULOQ)
uloq <- 15

# Erase values above ULOQ for plotting purposes
right_censored_data <- replace(original_data, original_data > uloq, NA) %>% na.omit()

# Display right-censored data (values above ULOQ erased)
ggplot() +
  geom_vline(xintercept = uloq, linetype="dotted") +
  geom_histogram(aes(original_data), binwidth = 0.3, alpha = 0.0) +
  geom_histogram(aes(right_censored_data), binwidth = 0.3) +
  geom_text(aes(x = 16, y = 12,
    label = "Values above ULOQ (15.0) right-censored away"),
      hjust = 0, color="blue")

Created on 2023-08-26 with reprex v2.0.2

One, very painstaking way would be to flip the data horizontally and use censlm again, but surely a more elegant way exists?

2

There are 2 best solutions below

3
Severin Pappadeux On BEST ANSWER

Ok, lets make a bit more clear what I propose. You need log-normal, therefore you have to get μ, σ from your incomplete data. We need two estimated values.

So, mean won't work because we don't have right tale. Stddev won't won't work because we don't have right tale. But from what you have, there are two values you could estimate: mode of the distribution and Full Width at Half Max (FWHM). Those two you COULD get from incomplete data

I roughly guessed mode to be 8 and FWHM to be 14-5 (right side minus left side). Then I took my Python code and just replaced two non-linear equation for mean and q25 with equations for mode and FWHM. And I got some estimate for μ, σ as well as distribution shape and sampling. You just have to find good estimators for mode and FWHM and put them in and it should work

Code, Python 3.10 Windows x64

import numpy as np
import scipy.optimize as opt

import matplotlib.pyplot as plt

LOG_2 = np.log(2.0)

def ff(variables):
    mode= 8. # taken from graph, to be replaced with proper estimator
    fwhm = 14.-5. # taken from graph, to be replaced with proper estimator

    μ, σ = variables

    # taken from wiki https://en.wikipedia.org/wiki/Log-normal_distribution
    mode_eq = np.exp(μ - σ*σ) - mode
    # taked from http://openafox.com/science/peak-function-derivations.html#lognormal
    fwhm_eq = np.exp((μ - σ*σ) + np.sqrt(2.0*σ*σ*LOG_2)) - np.exp((μ - σ*σ) - np.sqrt(2.0*σ*σ*LOG_2)) - fwhm

    return [mode_eq, fwhm_eq]

μ, σ = opt.fsolve(ff, (5,1) )
print(μ, σ)

rng = np.random.default_rng(135797537)
data = rng.lognormal(μ, σ, 200000)

fig, ax = plt.subplots()
ax.hist(data, bins=100)

print('mean: ' + str(np.mean(data)))
print('stdev: ' + str(np.std(data)))

and code produced output

2.286994459923708 0.45557976057161764
mean: 10.930206083816197
stdev: 5.249043744738874

and picture

enter image description here

3
Mikko Marttila On

You can use maximum likelihood estimation like in the left censored case. For convenience, I’ve updated censlm with a right argument to do that:

obs <- replace(original_data, original_data > uloq, uloq)
fit <- clm(log(obs) ~ 1, right = log(uloq))
summary(fit)
#> 
#> Call:
#> survreg(formula = Surv(log(obs), log(obs) < log(uloq), type = "right") ~ 
#>     1, dist = "gaussian")
#>                Value Std. Error    z      p
#> (Intercept)  2.29921    0.00407  565 <2e-16
#> Log(scale)  -0.91770    0.00799 -115 <2e-16
#> 
#> Scale= 0.399 
#> 
#> Gaussian distribution
#> Loglik(model)= -5909.1   Loglik(intercept only)= -5909.1
#> Number of Newton-Raphson Iterations: 5 
#> n= 10000

imputed_data <- exp(imputed(fit))
plot(density(original_data))
lines(density(imputed_data), col = 2, lty = 2, lwd = 2)