A neat way to visualize 3 categorical variables (with more than 10 levels each) is to plot a stacked bar showing the (weighted) proportion of the levels in var1 for every combination of var2 and var3. You will have a grid with a number of cells equal to length(levels(var2)) x length(levels(var3)) and as many colors as length(levels(var1)).
Let's call these variables fct1, fct2, fct3. An easy solution would be something like:
data <- tibble(
a = c(5, 6, 7, 12, 5, 6, 7),
fct1 = paste0('type',c("a","b","c","d", "a","b","c")),
fct2 = paste0('lvl',c(1,1,1,1,2,2,2)),
fct3 = paste0('system', c(1,2,2,2,1,2,2)),
) %>%
crossing(fct2_suffix = 0:4, fct3_suffix = 0:9) %>%
mutate(
fct2 = paste0(fct2, fct2_suffix),
fct3 = paste0(fct3, fct3_suffix)
) %>%
select(-c(fct2_suffix, fct3_suffix)) %>%
uncount(a)
data %>%
ggplot() +
geom_bar(aes(y = 0, fill=fct1), position = "fill") +
facet_grid(fct3~fct2)
However, for many levels, facetting is slow. I would like to produce such a chart while completely avoid using facetting (this would also leave it free for a potential 4th and 5th cat variable).
I would like to have a geom_*() function in order for it to be more flexible, and I don't know where to start.
Ideally, it would look like:
data %>%
ggplot() +
geom_col_grid(aes(x=fct1, y=fct2, fill=fct3), position = "fill")# +
#facet_grid(fct4~fct5) #potentially 4th and 5th var
I wrote a function that manually calculates the position at which each bar would start and end, and then it is passed to geom_rect(). This works, it's just not as flexible as a geom. Here is the code (note that there is also a padding argument to define the distance between bars). var4 and var5 are for facetting (can be left blank).
plot_crosstab <- function(data, var1, var2, var3, var4, var5, padding = 0.1){
if(!("weight" %in% names(data))){
data <- data %>% mutate(weight = 1)
cli::cli_alert_info("No variable 'weight' in data: assumed equal weights")
}
if(missing(var4)) {
var4 <- quo(var4)
data <- data %>% mutate(var4 = "total")
}
if(missing(var5)){
var5 <- quo(var5)
data <- data %>% mutate(var5 = "total")
}
build_data =
data %>%
mutate(across(c({{var5}}, {{var4}}, {{var3}}, {{var2}}, {{var1}}), as.factor)) %>%
group_by(across(c({{var5}}, {{var4}}, {{var3}}, {{var2}}, {{var1}}))) %>%
summarise(
n = sum(weight, na.rm = T)
) %>%
mutate(
frac = n/sum(n, na.rm = T)*(1-padding) #so that it spans the right amount
) %>%
arrange(desc(frac)) %>%
ungroup() %>%
complete({{var5}}, {{var4}}, {{var3}}, {{var2}}, fill = list(n=0, frac=1-padding)) %>%
group_by(across(c({{var5}}, {{var4}}, {{var3}}, {{var2}}))) %>%
mutate(
v_padding = if_else(row_number()==1, padding, 0)
) %>%
group_by(across(c({{var5}}, {{var4}}, {{var3}}))) %>%
mutate(
pos_left = -0.5 -(padding/2) + lag(cumsum(frac), default = 0) + cumsum(v_padding),
pos_right = -0.5 -(padding/2) + cumsum(frac) + cumsum(v_padding)
) %>%
ungroup() %>%
mutate(
pos_low = as.numeric(factor({{var3}})) + (padding/2),
pos_high = pos_low + (1-padding)
)
build_data %>%
ggplot() +
geom_rect(aes(xmin = pos_left, xmax = pos_right, ymin = pos_low, ymax = pos_high, fill = {{var1}})) +
facet_grid(rows = vars({{var4}}), cols = vars({{var5}})) +
scale_x_continuous(breaks = 1:length(levels(build_data %>% pull({{var2}})))-1, labels = levels(build_data %>% pull({{var2}}))) +
scale_y_continuous(breaks = 1:(length(levels(build_data %>% pull({{var3}}))))+0.5, labels = levels(build_data %>% pull({{var3}})))
}
data %>%
plot_crosstab(fct2, fct1, fct3)
This looks similar to the original chart, while being much faster. It is, however, not integrated in the ggplot2 workflow.