What is the fastest R way to generate random variates of the Bates distribution?

92 Views Asked by At

It is a simple task to get random numbers of the Bates distribution. I need 1 million averages to run 10k times:

bates1 = replicate(10000, mean(runif(1e+06,1,5)))
summary(bates1)

I waited forever to complete its calcs. I tried for loops also with no avail (infinitesimally slow).

Any way out of this?

I tried the for loop,

set.seed(999)
for (i in 1:10000) {
x <- randomLHS(1e+6,1)
x <- 1 + 4*x
y[i] <- mean(x)
}
summary(y)

And before the code, allocating space for x and y (using length() ).

4

There are 4 best solutions below

0
neilfws On BEST ANSWER

There are lots of ways to do parallel computation in R. You could look at:

As an example using the doParallel library on my work machine (a modest Surface Book 2):

library(doParallel)

registerDoParallel(7)

# original version
system.time ( { replicate(10000, mean(runif(1e+06,1,5))) } )
 user  system elapsed 
 319.70   20.36  340.39

# parallel version 7 cores
 system.time( { times(10000) %dopar% mean(runif(1e+06,1,5)) } )
  user  system elapsed 
  6.06    1.14  125.75 

So around 2 minutes, as opposed to a bit over 5 minutes (not exactly "forever" but long enough).

Some of these other answers may also help.

2
Gregor Thomas On

Without parallelizing, we can go faster by finding high-performance versions of the needed functions. The dqrng package has a version of runif that is about 3x faster than base, and on long vectors sum(x) / length(x) is a little faster than mean(x).

library(dqrng)
nn = 1e6
bench::mark(
  mean(runif(nn, 1, 5)),
  mean(dqrunif(nn, 1, 5)),
  sum(dqrunif(nn, 1, 5)) / nn,
  check = FALSE
)
# # A tibble: 3 × 13
#   expression                     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#   <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
# 1 mean(runif(nn, 1, 5))      37.09ms     40ms      25.1    7.63MB     5.02    10     2
# 2 mean(dqrunif(nn, 1, 5))     8.23ms     12ms      85.2    7.63MB    11.4     30     4
# 3 sum(dqrunif(nn, 1, 5))/nn   7.78ms    8.5ms     117.     7.63MB    13.3     44     5
# # ℹ 5 more variables: total_time <bch:tm>, result <list>, memory <list>, time <list>,
# #   gc <list>

We get a > 3x speed-up combining these two approaches, which would be even faster if combined with parallelization as in neilfws's answer:

library(doParallel)
registerDoParallel(7)
system.time( { times(n) %dopar% sum(dqrunif(nn, 1, 5)) / nn } )
#    user  system elapsed 
# 178.940  31.608  86.432 

Under 1.5 minutes (compared to about 6.5 minutes for your original code on my laptop).

0
Stéphane Laurent On

You don't simulate the Bates distribution here. The Bates distribution with integer paramater n is the mean of n uniform random variables on (0,1). A fast way to get it is:

n <- 6
nsims <- 100000
usims <- matrix(runif(n*nsims), nrow = n, ncol = nsims)
bates_sims <- colMeans(usims)
0
jblood94 On

The Bates distribution can be seen as a polynomial approximation to the normal distribution (as n approaches Inf, it is the normal distribution). With n = 1e6, the normal distribution is a very good approximation. Demonstrating with 1e5 samples:

library(parallel)

# Direct computation of Bates r.v.
cl <- makeCluster(parallel::detectCores() - 1L, type = "PSOCK")
clusterEvalQ(cl, library(dqrng))
system.time(x1 <- unlist(parLapply(cl, 1:1e3, \(i) replicate(100, sum(dqrunif(1e6, 1, 5)))))/1e6)
#>    user  system elapsed 
#>    0.02    0.00  103.25

# normal approximation
x2 <- rnorm(1e5, 3, sqrt(16/12/1e6))

# Kolmogorov-Smirnov test for a difference between the two distributions
ks.test(x1, x2)
#> 
#>  Asymptotic two-sample Kolmogorov-Smirnov test
#> 
#> data:  x1 and x2
#> D = 0.00312, p-value = 0.7151
#> alternative hypothesis: two-sided

# plot the empirical CDFs
plot(ecdf(x1), col = "blue")
plot(ecdf(x2), col = "orange", add = TRUE)

enter image description here

The two sets of samples are essentially numerically indistinguishable. For comparison, plot the empirical CDF for two different samples from a normal distribution.

plot(ecdf(rnorm(1e5, 3, sqrt(4/3/1e6))), col = "blue")
plot(ecdf(rnorm(1e5, 3, sqrt(4/3/1e6))), col = "orange", add = TRUE)

enter image description here