I'm trying to understand something about how the prior distribution is combined with the likelihood to get the posterior distribution in Bayesian linear regression.
For context, I used stan_glm() against the mtcars dataset and got these parameter values for the intercept and slope: 37.3 and -5.3, respectively.
data(mtcars)
fit <- stan_glm(mpg ~ wt, data = mtcars)
prior_summary(fit)
And here is the output.
stan_glm
family: gaussian [identity]
formula: mpg ~ wt
observations: 32
predictors: 2
------
Median MAD_SD
(Intercept) 37.3 2.0
wt -5.3 0.6
Auxiliary parameter(s):
Median MAD_SD
sigma 3.1 0.4
Now I created a new dataframe, which I'm calling new_cars_1990s.
new_cars_1990s <- data.frame(
car = c("Toyota Camry", "Honda Accord", "Ford Taurus", "Chevrolet Lumina", "Nissan Maxima",
"Mitsubishi Galant", "Subaru Legacy", "Volkswagen Passat", "Buick Regal", "Mercedes-Benz E-Class",
"Acura Legend", "BMW 3 Series", "Lexus LS 400", "Ford Mustang", "Chevrolet Corvette",
"Mazda RX-7", "Toyota Supra", "Nissan 240SX", "Honda Prelude", "Volkswagen Golf",
"Volvo 850", "Audi 100", "Mitsubishi Eclipse", "Pontiac Grand Am", "Ford Explorer",
"Jeep Cherokee", "Toyota 4Runner", "Chevrolet Blazer", "GMC Jimmy", "Nissan Pathfinder"),
wt = c(3.2, 2.9, 3.5, 3.8, 3.2, 3.1, 3.0, 3.2, 3.6, 3.9,
3.3, 3.0, 3.7, 3.5, 3.4, 2.8, 3.1, 3.2, 2.9, 2.8,
2.5, 3.4, 3.0, 2.7, 3.1, 4.0, 4.2, 3.8, 4.5, 4.3), # Hypothetical weights in tons
mpg = c(28, 30, 25, 23, 27, 26, 29, 28, 24, 22,
26, 28, 24, 18, 15, 19, 22, 21, 25, 26,
32, 23, 26, 28, 24, 17, 16, 18, 15, 16))
And I'm wondering if it's possible to build up the likelihood and the priors by myself, and then combine the likelihood and the priors to get two posterior distributions.
I know the approximate answer I'm looking for because I can use the stan_glm() function against the new_cars_1990s dataset and use the prior_intercept and prior arguments to set my prior. The posterior should be centered at -5.6 for the wt coefficient and 43.6 for the intercept.
stan_glm(mpg ~ wt, data = new_cars_1990s,
prior_intercept = normal(location = 37.3, scale = 2),
prior = normal(location = -5.3, scale = 0.6))
stan_glm
family: gaussian [identity]
formula: mpg ~ wt
observations: 30
predictors: 2
------
Median MAD_SD
(Intercept) 43.6 2.0
wt -5.6 0.6
Auxiliary parameter(s):
Median MAD_SD
sigma 3.7 0.6
Here is the part where I have a question and get lost. I create a regression using the parameters in my priors to predict mpg. And then I try to understand the likelihood of the predicted values given the distribution from my priors.
prior_sd_for_regression <- 3.1
prior_slope <- -5.3
prior_sd_for_slope <- 0.8895613
new_cars_1990s %>%
mutate(predicted_mpg = prior_intercept + prior_slope * wt) %>%
mutate(likelihood = dnorm(mpg, mean = predicted_mpg, sd = prior_sd_for_regression)) %>%
mutate(log_likelihood = log(likelihood)) %>%
mutate(prior_for_slope_as_prob = dnorm(-5.6, mean = prior_slope, sd = prior_sd_for_slope)) %>%
mutate(log_prior_for_slope_as_prob = log(prior_for_slope_as_prob)) %>%
mutate(log_posterior_for_slope = log_likelihood + log_prior_for_slope_as_prob) %>%
mutate(posterior_for_slope = exp(log_posterior_for_slope)) %>%
summarize(median_log_posterior_for_slope = median(posterior_for_slope))
So far this output is giving me a number that is way off. In this case, I'm expecting -5.6 but I'm getting a number close to zero.
median_log_posterior_for_slope
0.01645194
I'm not expecting to get the exact same number as stan_glm() because I'm assuming that what I'm doing is not the most advanced technique. I'm not trying to re-create whatever stan_glm() is doing under the hood. But I'm trying to develop some confidence that I understand what is happening here.
Is there a way to change this final section of code so I can accurately get a likelihood estimate for each point, given the parameters of the slope and intercept, and compare that to the distribution from my prior, to get the posterior, within one chunk of dplyr code? I know that practically speaking, this is in no man's land, but I'm doing this for learning!
As a final note, you could see that I'm putting the sd in for the dnorm() argument instead of the mad, because a normal distribution needs the mean and standard deviation. I find this from the mad using z-scores. Interesting that stan_glm() is using a normal distribution for the prior but reporting the statistics as median and mad.
median_value <- -5.3
mad_value <- 0.6
z_score_75 <- qnorm(0.75)
sigma_estimate <- mad_value / z_score_75
mean_estimate <- median_value
cat("Estimated Mean:", mean_estimate, "\n")
cat("Estimated Standard Deviation:", sigma_estimate, "\n")
Estimated Mean: -5.3
Estimated Standard Deviation: 0.8895613