How do I simulate Bayes theorem in R to get a sense of uncertainty?

80 Views Asked by At

Okay, so Bayes theorem is p(A|B) = p(A) * P(B|A) * 1/p(B).

I want to simulate p(A|B) using all the uncertainty surrounding p(A), p(B|A) and p(B).

rbeta seems like a good choice.

data: imagine I just spent 20 days in a new location and have a sense of the daily weather.

library(tidyverse)
set.seed(4434)
sims = 1000 #let's simulate each of these proportions 1000 times
p(rain) = rbeta(sims,5,15) #it rained 5 times in the 20 days
p(overcast | rain) = rbeta(sims, 4,1) #of the 5 times it rained, it wasn't completely overcast 1x
p(overcast) = rbeta(sims, 10,10) #half of the days I was there were overcast

That should give me anything that I need to calculate p(A|B) or p(rain | overcast). Let's calculate this and take a quick look:

data <- tibble(p_rain, p_overcast, p_overcast_rain) %>% 
  mutate(numerator = p_rain * p_overcast_rain,
         p_rain_overcast = numerator / p_overcast)


summary(data$p_rain_overcast)

table(data$p_rain_overcast > 1)

the problem: I am getting results outside of the 0 to 1 bounds Outside of the 0 to 1 bounds

So, in 22 of the simulations the numerator > denominator, which would suggest the impossible: Both B and A happening is more likely than B happening. This is clearly a possible result with the simulation approach.

The question: Now, assuming this approach isn't completely silly (is it?), I am curious about what do I do with values above 1 (or more generally outside of 0 to 1)?

#if you're curious what I did# I can see three options, but I am not sure about their validity or pitfalls:

**1. just make p_rain_overcast values > 1 = 1 **

data %>% 
  mutate(p_rain_overcast = ifelse(p_rain_overcast > 1, 
                                  1, 
                                  p_rain_overcast)) %>% 
  reframe(
    quantiles = c(0,.1,.25,.5,.75,.9,1),
    values = quantile(p_rain_overcast, c(0,.1,.25,.5,.75,.9,1)))

output: truncation results

**2. I can resimulate until the numerator is less or equal to denominator **

SimNumerator <- function(limit, p_a, p_ba){
  
  numerator <- p_a * p_ba
  
  #something to put the values
  fillThis <- vector("numeric", length = length(numerator))
  
  for(i in 1:length(numerator)) {
    #one item at a time
    result <- numerator[i]
    
    #keep going until the numerator is less than or equal to denominator
    while(result > limit[i]){
      prior = rbeta(1,5,15) #it rained 5 times in the 20 days
      likelihood = rbeta(1, 4,1) #of the 5 times it rained, it wasn't completely overcast once
      result <- prior * likelihood
    }
    fillThis[i] <- result
  }
  return(fillThis)
}


new_numerator <- SimNumerator(p_a = p_rain,
                              p_ba = p_overcast_rain,
                              limit = p_overcast)


data %>% tibble(new_numerator) %>% 
  mutate(p_rain_overcast = new_numerator / p_overcast) %>%
  reframe(
    quantiles = c(0,.1,.25,.5,.75,.9,1),
    values = quantile(p_rain_overcast, c(0,.1,.25,.5,.75,.9,1)))


resimulate

3. Is pretty much #2, but I resimulate until the denominator is greater than numerator This is just changing what gets resampled for rbeta.

#comparison#

different results p(rain|overcast)

The interpretations are similar (after all only 22 / 1000 values changed). The limiting approach creates a spike at 1. The resampling approach has a downside in that it is more likely for low p(overcast) or denominator values d OR high numerator values getting resampled, which might misrepresent some of the sampling distributions.

0

There are 0 best solutions below