R Function For Coupon Collector's Function with Finite Number of Trials

Question

R Function For Coupon Collector's Function with Finite Number of Trials

95 Views Asked by welijahclark At 22 November 2023 at 04:45

I'm trying to write up a function that runs the Coupon Collector's Problem in R. I'm trying to make a function that will spit out the number of unique coupons collected from sample size n with a finite number of trials t. (For example, having 500 unique coupons in a sample space, and picking a coupon from it 1000 times with replacement.)

I was able to write a similar function that returns how many times you have to sample it to collect ALL tickets. However, that function only requires specifying n, and does not account for the finite number of trials.

So far, this is how I have attempted to go about writing this function, with the function calling for n and t.

I assumed that if the length of the trials was less than t for the while loop, I could keep counting the unique coupons until trials == t, then return the number x out of n tickets collected after t trials. So far, all R has done is stall indefinitely.

Collectathon2 <- function(n, t){
      Coupons <- 1:n
      Collected_Coupons <- c()
      Trials <- 0
      
      while (length(Trials) < t) {
        New_Coupon <- sample(n, 1)
        Collected_Coupons <- unique(c(Collected_Coupons, New_Coupon))
        Unique_Coupons <- length(Collected_Coupons)
        Trials <- Trials + 1
      }
      return(Unique_Coupons)
    }

I assumed I would only need to modify this function below, which will run until it gets all of the tickets. This one works and spits a number out pretty quickly.

Collectathon <- function(n){
      Collected_Coupons <- c()
      Trials <- 0
      
      while (length(Collected_Coupons) < n) {
        New_Coupon <- sample(n, 1)
        Collected_Coupons <- unique(c(Collected_Coupons, New_Coupon))
        Trials <- Trials + 1
        }
      return(Trials)
    }

Original Q&A

There are 3 best solutions below

Rui Barradas On 22 November 2023 at 09:13

The following function solves the problem though its output is not exactly what is asked for.

Collectathon <- function(n, total){
  Collected_Coupons <- logical(n)
  Trials <- 0L
  
  while (!all(Collected_Coupons) && Trials < total) {
    Trials <- Trials + 1L
    New_Coupon <- sample(n, 1)
    Collected_Coupons[New_Coupon] <- TRUE
  }
  
  list(
    All = (Trials * all(Collected_Coupons)) |> as.logical(),
    Collected = sum(Collected_Coupons),
    Trials = Trials
  )
}

Collectathon(10, 30)
#> $All
#> [1] TRUE
#> 
#> $Collected
#> [1] 10
#> 
#> $Trials
#> [1] 20
Collectathon(10, 50)
#> $All
#> [1] TRUE
#> 
#> $Collected
#> [1] 10
#> 
#> $Trials
#> [1] 23
Collectathon(100, 200)
#> $All
#> [1] FALSE
#> 
#> $Collected
#> [1] 88
#> 
#> $Trials
#> [1] 200
Collectathon(1000, 5000)
#> $All
#> [1] FALSE
#> 
#> $Collected
#> [1] 992
#> 
#> $Trials
#> [1] 5000

^{Created on 2023-11-22 with reprex v2.0.2}

ThomasIsCoding On 22 November 2023 at 15:28

You can try a recursive approach if you are interested

f <- function(n, s = c(), k = 0) {
  if (length(s) == n) {
    return(list(coupon = s, iter = k))
  }
  x <- sample.int(n, 1)
  Recall(n, c(s, x[!x %in% s]), k + 1)
}

such that

> f(5)
$coupon
[1] 3 1 4 2 5

$iter
[1] 8


> f(10)
$coupon
 [1]  5  4  7  8  2  1  3 10  6  9

$iter
[1] 30

where $coupon records the valid unique coupons in the collected order, and $iter shows the number of trials.

**jblood94** · Accepted Answer · 2023-11-22T14:30:15.643000

If the coupons all have an equal probability of being drawn on each draw, then they are indistinguishable, and the only interesting thing is the number of unique coupons (k) after t draws. Once that number is known, one can easily simulate which coupons were collected with sample(k, t).

A naive function to return the result of n simulations of k (the number of unique coupons collected from a set of N unique coupons after t random uniform draws):

library(data.table) # for `uniqueN`

rcoupon1 <- function(n, N, t) replicate(n, uniqueN(sample(N, t, 1)))

If many samples of k are desired for the same N and t, it will be more efficient to first calculate the full probability mass function of k and use that to sample k. This CV answer provides a way to efficiently calculate the PMF of k.

library(copula)

Rcpp::cppFunction("
  NumericVector dcoupon_exact(const int& N, const int& t) {
    int maxk;
    int n1 = N - 1;
    if (N > t) {
      maxk = t;
    } else {
      maxk = N;
    }
    
    NumericVector k (maxk);
    k(0) = 1;
    
    for (int i = 1; i < maxk; i++) {
      for (int j = i; j > 0; j--) {
        k(j) = (k(j - 1)*(N - j) + k(j)*(j + 1))/N;
      }
      k(0) = k(0)/N;
    }
    
    for (int i = maxk; i < t; i++) {
      for (int j = n1; j > 0; j--) {
        k(j) = (k(j - 1)*(N - j) + k(j)*(j + 1))/N;
      }
      k(0) = k(0)/N;
    }
    
    return k;
  }
")

dcoupon <- function(N, t, exact = TRUE) {
  l <- min(t, N)
  k <- 1:l
  if (t < 200) {
    logS <- log(Stirling2.all(t)[k])
  } else {
    if (exact) return(dcoupon_exact(N, t))
    # estimate the log Stirling numbers
    v <- t/k
    G <- 1/v
    vexpv <- v/exp(v)
    # Newton-Raphson
    for (i in 1:5) G <- G - (G - (vexpG <- vexpv*exp(G)))/(1 - vexpG)
    logS <- (log(v - 1) - log(v*(1 - G)))/2 +
      (t - k)*(log(v - 1) - log(v - G)) + t*log(k) - k*log(t) + k*(1 - G) +
      lgamma(t + 1) - lgamma(k + 1) - lgamma(t - k + 1)
    if (l == t) logS[t] <- 0
  }
  pmf <- exp(logS + lgamma(N + 1) - lgamma(N - k + 1) - t*log(N))
  pmf/sum(pmf)
}

rcoupon2 <- function(n, N, t, exact = TRUE) sample(N, n, 1, dcoupon(N, t, exact))

Benchmarking n = 1e4, N = 500, t = 1000:

set.seed(39634386)

microbenchmark::microbenchmark(
  rcoupon1 = {k1 <- rcoupon1(1e4, 500L, 1e3L)},
  rcoupon2.exact = {k2 <- rcoupon2(1e4, 500L, 1e3L)},
  rcoupon2.approx = {k3 <- rcoupon2(1e4, 500L, 1e3L, FALSE)}
)
#> Unit: microseconds
#>             expr       min         lq        mean     median        uq        max neval
#>         rcoupon1 1141546.6 1208177.20 1390417.264 1248621.90 1378973.1  4027339.8   100
#>   rcoupon2.exact    4246.5    4340.90    5442.147    4411.35    5312.8    21677.0   100
#>  rcoupon2.approx     653.0     689.35    1755.478     712.10     770.2    97984.3   100

Quick check that the distributions of k are the same:

ks.test(k1, k2)
#> Warning in ks.test.default(k1, k2): p-value will be approximate in the presence
#> of ties
#> 
#>  Asymptotic two-sample Kolmogorov-Smirnov test
#> 
#> data:  k1 and k2
#> D = 0.0098, p-value = 0.7229
#> alternative hypothesis: two-sided
ks.test(k1, k3)
#> Warning in ks.test.default(k1, k3): p-value will be approximate in the presence
#> of ties
#> 
#>  Asymptotic two-sample Kolmogorov-Smirnov test
#> 
#> data:  k1 and k3
#> D = 0.0072, p-value = 0.9578
#> alternative hypothesis: two-sided

Timing a larger problem (n = N = t = 1e4):

system.time(rcoupon1(1e4, 1e4L, 1e4L))
#>    user  system elapsed 
#>   10.89    0.51   12.11
system.time(rcoupon2(1e4, 1e4L, 1e4L))
#>    user  system elapsed 
#>    0.61    0.00    0.63
system.time(rcoupon2(1e4, 1e4L, 1e4L, FALSE))
#>    user  system elapsed 
#>    0.02    0.00    0.02

The approximate PMF (which is a very good approximation for large N and t) can easily handle a very large sample of k for a very large coupon collection problem. For example, let's take 1M samples of k for 10M draws from a set of 1M coupons (n = 1e6, N = 1e6, t = 1e7):

system.time(rcoupon2(1e6, 1e6, 1e7, FALSE))
#>    user  system elapsed 
#>    1.18    0.02    1.21

Performing the same calculation using the naive function would take about a week and a half on my (very old) machine:

system.time(rcoupon1(10, 1e6, 1e7))
#>    user  system elapsed 
#>   12.07    0.43    9.71

R Function For Coupon Collector's Function with Finite Number of Trials

There are 3 best solutions below

Related Questions in R

Related Questions in FUNCTION

Related Questions in STATISTICS

Related Questions in STATISTICS-BOOTSTRAP

Trending Questions

Popular # Hahtags

Popular Questions