Building the prior, likelihood, and posterior in Bayesian linear regression in R using mtcars, tidy method

Question

Building the prior, likelihood, and posterior in Bayesian linear regression in R using mtcars, tidy method

53 Views Asked by hachiko At 04 February 2024 at 20:11

I'm trying to understand something about how the prior distribution is combined with the likelihood to get the posterior distribution in Bayesian linear regression.

For context, I used stan_glm() against the mtcars dataset and got these parameter values for the intercept and slope: 37.3 and -5.3, respectively.

data(mtcars)
fit <- stan_glm(mpg ~ wt, data = mtcars)

prior_summary(fit)

And here is the output.

stan_glm
 family:       gaussian [identity]
 formula:      mpg ~ wt
 observations: 32
 predictors:   2
------
            Median MAD_SD
(Intercept) 37.3    2.0  
wt          -5.3    0.6  

Auxiliary parameter(s):
      Median MAD_SD
sigma 3.1    0.4

Now I created a new dataframe, which I'm calling new_cars_1990s.

new_cars_1990s <- data.frame(
  car = c("Toyota Camry", "Honda Accord", "Ford Taurus", "Chevrolet Lumina", "Nissan Maxima", 
          "Mitsubishi Galant", "Subaru Legacy", "Volkswagen Passat", "Buick Regal", "Mercedes-Benz E-Class",
          "Acura Legend", "BMW 3 Series", "Lexus LS 400", "Ford Mustang", "Chevrolet Corvette", 
          "Mazda RX-7", "Toyota Supra", "Nissan 240SX", "Honda Prelude", "Volkswagen Golf",
          "Volvo 850", "Audi 100", "Mitsubishi Eclipse", "Pontiac Grand Am", "Ford Explorer",
          "Jeep Cherokee", "Toyota 4Runner", "Chevrolet Blazer", "GMC Jimmy", "Nissan Pathfinder"),
  wt = c(3.2, 2.9, 3.5, 3.8, 3.2, 3.1, 3.0, 3.2, 3.6, 3.9, 
         3.3, 3.0, 3.7, 3.5, 3.4, 2.8, 3.1, 3.2, 2.9, 2.8, 
         2.5, 3.4, 3.0, 2.7, 3.1, 4.0, 4.2, 3.8, 4.5, 4.3),  # Hypothetical weights in tons
  mpg = c(28, 30, 25, 23, 27, 26, 29, 28, 24, 22, 
          26, 28, 24, 18, 15, 19, 22, 21, 25, 26, 
          32, 23, 26, 28, 24, 17, 16, 18, 15, 16))

And I'm wondering if it's possible to build up the likelihood and the priors by myself, and then combine the likelihood and the priors to get two posterior distributions.

I know the approximate answer I'm looking for because I can use the stan_glm() function against the new_cars_1990s dataset and use the prior_intercept and prior arguments to set my prior. The posterior should be centered at -5.6 for the wt coefficient and 43.6 for the intercept.

stan_glm(mpg ~ wt, data = new_cars_1990s,
  prior_intercept = normal(location = 37.3, scale = 2),
  prior = normal(location = -5.3, scale = 0.6))
stan_glm
 family:       gaussian [identity]
 formula:      mpg ~ wt
 observations: 30
 predictors:   2
------
            Median MAD_SD
(Intercept) 43.6    2.0  
wt          -5.6    0.6  

Auxiliary parameter(s):
      Median MAD_SD
sigma 3.7    0.6

Here is the part where I have a question and get lost. I create a regression using the parameters in my priors to predict mpg. And then I try to understand the likelihood of the predicted values given the distribution from my priors.

prior_sd_for_regression <- 3.1 
prior_slope <- -5.3
prior_sd_for_slope <- 0.8895613 

new_cars_1990s %>%
  mutate(predicted_mpg = prior_intercept + prior_slope * wt) %>%
  mutate(likelihood = dnorm(mpg, mean = predicted_mpg, sd = prior_sd_for_regression)) %>% 
  mutate(log_likelihood = log(likelihood)) %>%
  mutate(prior_for_slope_as_prob = dnorm(-5.6, mean = prior_slope, sd = prior_sd_for_slope)) %>%
  mutate(log_prior_for_slope_as_prob = log(prior_for_slope_as_prob)) %>%
  mutate(log_posterior_for_slope = log_likelihood + log_prior_for_slope_as_prob) %>%
  mutate(posterior_for_slope = exp(log_posterior_for_slope)) %>%
  summarize(median_log_posterior_for_slope = median(posterior_for_slope))

So far this output is giving me a number that is way off. In this case, I'm expecting -5.6 but I'm getting a number close to zero.

median_log_posterior_for_slope
0.01645194

I'm not expecting to get the exact same number as stan_glm() because I'm assuming that what I'm doing is not the most advanced technique. I'm not trying to re-create whatever stan_glm() is doing under the hood. But I'm trying to develop some confidence that I understand what is happening here.

Is there a way to change this final section of code so I can accurately get a likelihood estimate for each point, given the parameters of the slope and intercept, and compare that to the distribution from my prior, to get the posterior, within one chunk of dplyr code? I know that practically speaking, this is in no man's land, but I'm doing this for learning!

As a final note, you could see that I'm putting the sd in for the dnorm() argument instead of the mad, because a normal distribution needs the mean and standard deviation. I find this from the mad using z-scores. Interesting that stan_glm() is using a normal distribution for the prior but reporting the statistics as median and mad.

median_value <- -5.3
mad_value <- 0.6

z_score_75 <- qnorm(0.75)

sigma_estimate <- mad_value / z_score_75

mean_estimate <- median_value

cat("Estimated Mean:", mean_estimate, "\n")
cat("Estimated Standard Deviation:", sigma_estimate, "\n")

Estimated Mean: -5.3 
Estimated Standard Deviation: 0.8895613

Original Q&A

Building the prior, likelihood, and posterior in Bayesian linear regression in R using mtcars, tidy method

There are 0 best solutions below

Related Questions in R

Related Questions in BAYESIAN

Trending Questions

Popular # Hahtags

Popular Questions