I am seeing several related answers which discuss finding medians or the median bin from data frames where frequencies are available by row, such as:
I am working with Census data where observations are geographies (e.g. states, counties), and columns are numbers of households falling into 16 income buckets. I can find the median income bucket by manually creating cumulative sums of the number of households in those 16 columns (example below with 5 columns), but it's pretty ugly. The suggested solutions for calculating grouped means/medians over rows of observations don't really work, and I can't come up with a simpler way.
library(dplyr)
library(magrittr)
geoid = seq(100)
df_example = data.frame(
geoid = geoid,
inc1 = sample(0:200, 100),
inc2 = sample(0:200, 100),
inc3 = sample(0:200, 100),
inc4 = sample(0:200, 100),
inc5 = sample(0:200, 100)
)
df_example$total = df_example$inc1 + df_example$inc2 + df_example$inc3 + df_example$inc4 + df_example$inc5
df_example = df_example %>% mutate(
inc1_cumsum = inc1,
inc2_cumsum = inc1 + inc2,
inc3_cumsum = inc1 + inc2 + inc3,
inc4_cumsum = inc1 + inc2 + inc3 + inc4,
inc5_cumsum = inc1 + inc2 + inc3 + inc4 + inc5
)
df_example = df_example %>% mutate(
median_inc_bin = case_when(
total/2 < inc1_cumsum ~ "inc1",
total/2 < inc2_cumsum ~ "inc2",
total/2 < inc3_cumsum ~ "inc3",
total/2 < inc4_cumsum ~ "inc4",
total/2 < inc5_cumsum ~ "inc5"
)
)
If we pivot longer, then we can select the first bin per
geoidwhere the cumulative sum exceeds the median.When I use this fake data:
...I get this output:
...which corresponds with the output using your code: