I'm trying to write up a function that runs the Coupon Collector's Problem in R. I'm trying to make a function that will spit out the number of unique coupons collected from sample size n with a finite number of trials t. (For example, having 500 unique coupons in a sample space, and picking a coupon from it 1000 times with replacement.)
I was able to write a similar function that returns how many times you have to sample it to collect ALL tickets. However, that function only requires specifying n, and does not account for the finite number of trials.
So far, this is how I have attempted to go about writing this function, with the function calling for n and t.
I assumed that if the length of the trials was less than t for the while loop, I could keep counting the unique coupons until trials == t, then return the number x out of n tickets collected after t trials. So far, all R has done is stall indefinitely.
Collectathon2 <- function(n, t){
Coupons <- 1:n
Collected_Coupons <- c()
Trials <- 0
while (length(Trials) < t) {
New_Coupon <- sample(n, 1)
Collected_Coupons <- unique(c(Collected_Coupons, New_Coupon))
Unique_Coupons <- length(Collected_Coupons)
Trials <- Trials + 1
}
return(Unique_Coupons)
}
I assumed I would only need to modify this function below, which will run until it gets all of the tickets. This one works and spits a number out pretty quickly.
Collectathon <- function(n){
Collected_Coupons <- c()
Trials <- 0
while (length(Collected_Coupons) < n) {
New_Coupon <- sample(n, 1)
Collected_Coupons <- unique(c(Collected_Coupons, New_Coupon))
Trials <- Trials + 1
}
return(Trials)
}
If the coupons all have an equal probability of being drawn on each draw, then they are indistinguishable, and the only interesting thing is the number of unique coupons (
k) aftertdraws. Once that number is known, one can easily simulate which coupons were collected withsample(k, t).A naive function to return the result of
nsimulations ofk(the number of unique coupons collected from a set ofNunique coupons aftertrandom uniform draws):If many samples of
kare desired for the sameNandt, it will be more efficient to first calculate the full probability mass function ofkand use that to samplek. This CV answer provides a way to efficiently calculate the PMF ofk.Benchmarking
n = 1e4,N = 500,t = 1000:Quick check that the distributions of
kare the same:Timing a larger problem (
n = N = t = 1e4):The approximate PMF (which is a very good approximation for large
Nandt) can easily handle a very large sample ofkfor a very large coupon collection problem. For example, let's take 1M samples ofkfor 10M draws from a set of 1M coupons (n = 1e6,N = 1e6,t = 1e7):Performing the same calculation using the naive function would take about a week and a half on my (very old) machine: