I am trying to reproduce this excellent data visualisation from the Economist https://www.economist.com/interactive/uk-general-election/build-a-voter
They have used multinomial logistic regression, and have created a "feature contribution" water-fall style chart to build-up voting intentions of defined demographic groups. I have a similar set of data (unrelated to politics!), and am trying to perform a similar analysis. The example below is a made-up, reproducible example in similar vein to the economist version. (Big thanks to chatgpt for writing the example quickly!)
Can you help me interpret the co-efficients from the model output so that I can calculate the "waterfall values" of each level of the predictor variables?
(NB. My understanding is that this is done in comparison with a reference class, which is what the model has done here. What I like about the economist version is that it presents as agnostic to a reference class, at least in the viz. I expect there is some dummy variable shenanigans going on, but I can't figure it out, and would love someone smarter to help!)
###### Load packages ######
require(tidyverse)
require(nnet)
###### Create example data set ######
set.seed(123) # Set seed for reproducibility
# Function to generate voting intention with different probabilities for each party
generate_voting_intention <- function(age_bracket, gender) {
# Define base probabilities for Labour by age bracket
base_prob_labour <- c("18-29"=0.45, "30-39"=0.40, "40-49"=0.38, "50-64"=0.36, "65+"=0.35)
# Adjust Labour probability by gender
gender_mod_labour <- ifelse(gender == "Female", 0.05, 0)
prob_labour <- base_prob_labour[age_bracket] + gender_mod_labour
# Conservative base probabilities (smaller than Labour's to ensure Labour majority)
base_prob_conservative <- c("18-29"=0.25, "30-39"=0.30, "40-49"=0.32, "50-64"=0.34, "65+"=0.35)
# Adjust Conservative probability by gender (no gender modification in this example)
prob_conservative <- base_prob_conservative[age_bracket]
# The remaining probability is for Lib Dems
prob_lib_dems <- 1 - prob_labour - prob_conservative
# Sample voting intention based on calculated probabilities
sample(c("Labour", "Conservative", "Lib Dems"), size = 1, prob = c(prob_labour, prob_conservative, prob_lib_dems))
}
# Create a data frame with the specified variables
create_survey_data <- function(n) {
age_brackets <- c("18-29", "30-39", "40-49", "50-64", "65+")
genders <- c("Male", "Female")
# Sample age and gender for each respondent
ages <- sample(age_brackets, n, replace = TRUE)
genders <- sample(genders, n, replace = TRUE)
# Generate voting intention for each respondent
voting_intention <- mapply(generate_voting_intention, age_bracket = ages, gender = genders)
# Combine into a data frame
data.frame(
Age = ages,
Gender = genders,
VotingIntention = voting_intention,
stringsAsFactors = FALSE
)
}
# Generate survey data for 1000 respondents
survey_data <- create_survey_data(1000)
###### Build Voting Intention model ######
vote_mod<-multinom(VotingIntention~Age + Gender, data=survey_data)
tidy_vote_mod<-tidy(vote_mod,exponentiate = TRUE)
I suspect this will be closed as it is mainly a programming question, but here is a quick answer.
That plot on the left side of that article is showing the actual model predictions, not just the coefficients. In order to get those, you need to do two steps. First you need to create a grid of data on which to evaluate your predictions.
You can do that with
expand.grid()(or the slightly more versatiletidyr::expand_grid()if you usetidyverse). This is like a data frame, but it "crosses" all of the inputs to create a data frame with all the combinations. For example,expand.grid(a = 1:2, b = 1:2)will give you a data frame with four rows that have all possible combinations of the values ofaandb. So for your problem, you would want something like this:and so on. Then you pass this as the
newdataargument topredict()like so:predict(vote_mod, newdata = prediction_grid, type = "response"). The argumenttype = responseis necessary to get outcomes that are probabilities so you don't have to backtransform them. This will give you a numeric matrix with one row for each row ofprediction_grid(this would be equal to the number of points in the triangle diagram on the right side of that article) and one column for each level of the outcome, where thei, jvalue is the probability of a person with the characteristics of theith row ofprediction_gridto get outcomej.You could also use
broom::augment()to get the predictions in a tidy format (rather thantidy()giving you the coefficients in a tidy format).