Neat visualization of 3 categorical variables (up to even 5!)

42 Views Asked by At

A neat way to visualize 3 categorical variables (with more than 10 levels each) is to plot a stacked bar showing the (weighted) proportion of the levels in var1 for every combination of var2 and var3. You will have a grid with a number of cells equal to length(levels(var2)) x length(levels(var3)) and as many colors as length(levels(var1)).

Let's call these variables fct1, fct2, fct3. An easy solution would be something like:

data <- tibble(
  a = c(5, 6, 7, 12, 5, 6, 7),
  fct1 = paste0('type',c("a","b","c","d", "a","b","c")),
  fct2 = paste0('lvl',c(1,1,1,1,2,2,2)),
  fct3 = paste0('system', c(1,2,2,2,1,2,2)),
) %>% 
  crossing(fct2_suffix = 0:4, fct3_suffix = 0:9) %>% 
  mutate(
    fct2 = paste0(fct2, fct2_suffix),
    fct3 = paste0(fct3, fct3_suffix)
  ) %>% 
  select(-c(fct2_suffix, fct3_suffix)) %>% 
  uncount(a)

data %>% 
  ggplot() +
  geom_bar(aes(y = 0, fill=fct1), position = "fill") +
  facet_grid(fct3~fct2)

bar chart grid using facets

However, for many levels, facetting is slow. I would like to produce such a chart while completely avoid using facetting (this would also leave it free for a potential 4th and 5th cat variable).

I would like to have a geom_*() function in order for it to be more flexible, and I don't know where to start.

Ideally, it would look like:

data %>% 
  ggplot() +
  geom_col_grid(aes(x=fct1, y=fct2, fill=fct3), position = "fill")# +
  #facet_grid(fct4~fct5) #potentially 4th and 5th var 

I wrote a function that manually calculates the position at which each bar would start and end, and then it is passed to geom_rect(). This works, it's just not as flexible as a geom. Here is the code (note that there is also a padding argument to define the distance between bars). var4 and var5 are for facetting (can be left blank).

plot_crosstab <- function(data, var1, var2, var3, var4, var5, padding = 0.1){
  
  if(!("weight" %in% names(data))){
    data <- data %>% mutate(weight = 1) 
    cli::cli_alert_info("No variable 'weight' in data: assumed equal weights")
  }
  if(missing(var4)) {
    var4 <- quo(var4)
    data <- data %>% mutate(var4 = "total") 
  }
  if(missing(var5)){
    var5 <- quo(var5)
    data <- data %>% mutate(var5 = "total") 
  } 

  build_data = 
    data %>%
    mutate(across(c({{var5}}, {{var4}}, {{var3}}, {{var2}}, {{var1}}), as.factor)) %>% 
    group_by(across(c({{var5}}, {{var4}}, {{var3}}, {{var2}}, {{var1}}))) %>% 
    summarise(
      n = sum(weight, na.rm = T)
    ) %>% 
    mutate(
      frac = n/sum(n, na.rm = T)*(1-padding) #so that it spans the right amount
    ) %>% 
    arrange(desc(frac)) %>% 
    ungroup() %>% 
    complete({{var5}}, {{var4}}, {{var3}}, {{var2}}, fill = list(n=0, frac=1-padding)) %>% 
    group_by(across(c({{var5}}, {{var4}}, {{var3}}, {{var2}}))) %>%
    mutate(
      v_padding = if_else(row_number()==1, padding, 0)
    ) %>% 
    group_by(across(c({{var5}}, {{var4}}, {{var3}}))) %>% 
    mutate(
      pos_left = -0.5 -(padding/2) + lag(cumsum(frac), default = 0) + cumsum(v_padding),
      pos_right = -0.5 -(padding/2) + cumsum(frac) + cumsum(v_padding)
    ) %>% 
    ungroup() %>% 
    mutate(
      pos_low = as.numeric(factor({{var3}})) + (padding/2),
      pos_high = pos_low + (1-padding)
    )

  build_data %>% 
    ggplot() +
    geom_rect(aes(xmin = pos_left, xmax = pos_right, ymin = pos_low, ymax = pos_high, fill = {{var1}})) +
    facet_grid(rows = vars({{var4}}), cols = vars({{var5}})) +
    scale_x_continuous(breaks = 1:length(levels(build_data %>% pull({{var2}})))-1, labels = levels(build_data %>% pull({{var2}}))) +
    scale_y_continuous(breaks = 1:(length(levels(build_data %>% pull({{var3}}))))+0.5, labels = levels(build_data %>% pull({{var3}})))
}

data %>% 
  plot_crosstab(fct2, fct1, fct3)

This looks similar to the original chart, while being much faster. It is, however, not integrated in the ggplot2 workflow.

bar chart grid NOT using facets

0

There are 0 best solutions below