How can I find median bin of data binned by column in R?

72 Views Asked by At

I am seeing several related answers which discuss finding medians or the median bin from data frames where frequencies are available by row, such as:

I am working with Census data where observations are geographies (e.g. states, counties), and columns are numbers of households falling into 16 income buckets. I can find the median income bucket by manually creating cumulative sums of the number of households in those 16 columns (example below with 5 columns), but it's pretty ugly. The suggested solutions for calculating grouped means/medians over rows of observations don't really work, and I can't come up with a simpler way.

library(dplyr)
library(magrittr)

geoid = seq(100)
df_example = data.frame(
  geoid = geoid,
  inc1 = sample(0:200, 100),
  inc2 = sample(0:200, 100),
  inc3 = sample(0:200, 100),
  inc4 = sample(0:200, 100),
  inc5 = sample(0:200, 100)
)
df_example$total = df_example$inc1 + df_example$inc2 + df_example$inc3 + df_example$inc4 + df_example$inc5

df_example = df_example %>% mutate(
  inc1_cumsum = inc1,
  inc2_cumsum = inc1 + inc2,
  inc3_cumsum = inc1 + inc2 + inc3,
  inc4_cumsum = inc1 + inc2 + inc3 + inc4,
  inc5_cumsum = inc1 + inc2 + inc3 + inc4 + inc5
)

df_example = df_example %>% mutate(
  median_inc_bin = case_when(
    total/2 < inc1_cumsum ~ "inc1",
    total/2 < inc2_cumsum ~ "inc2",
    total/2 < inc3_cumsum ~ "inc3",
    total/2 < inc4_cumsum ~ "inc4",
    total/2 < inc5_cumsum ~ "inc5"
  )
)
1

There are 1 best solutions below

0
Jon Spring On BEST ANSWER

If we pivot longer, then we can select the first bin per geoid where the cumulative sum exceeds the median.

df_example |>
  tidyr::pivot_longer(-geoid) |>
  filter(cumsum(value) > sum(value)/2, .by = geoid) |>
  slice_head(n = 1, by = geoid)

When I use this fake data:

set.seed(42)
geoid = seq(10)
df_example = data.frame(
  geoid = geoid,
  inc1 = sample(0:200, 10),
  inc2 = sample(0:200, 10),
  inc3 = sample(0:200, 10),
  inc4 = sample(0:200, 10),
  inc5 = sample(0:200, 10)
)

#   geoid inc1 inc2 inc3 inc4 inc5
#1      1   48   70   40   57   67
#2      2   64   99   88   41  145
#3      3  152   88   26   23  108
#4      4   73  164  163  157   91
#5      5  145  109  108   42  196
#6      6  121   19    4  142    3
#7      7  200  153  161  149  177
#8      8  127  113   91  200   98
#9      9   46  110  103  135  176
#10    10   23  130    2   35  153

...I get this output:

# A tibble: 10 × 3
   geoid name  value
   <int> <chr> <int>
 1     1 inc3     40
 2     2 inc3     88
 3     3 inc2     88
 4     4 inc3    163
 5     5 inc3    108
 6     6 inc4    142
 7     7 inc3    161
 8     8 inc3     91
 9     9 inc4    135
10    10 inc4     35

...which corresponds with the output using your code:

   geoid inc1 inc2 inc3 inc4 inc5 total inc1_cumsum inc2_cumsum inc3_cumsum inc4_cumsum inc5_cumsum median_inc_bin
1      1   48   70   40   57   67   282          48         118         158         215         282           inc3
2      2   64   99   88   41  145   437          64         163         251         292         437           inc3
3      3  152   88   26   23  108   397         152         240         266         289         397           inc2
4      4   73  164  163  157   91   648          73         237         400         557         648           inc3
5      5  145  109  108   42  196   600         145         254         362         404         600           inc3
6      6  121   19    4  142    3   289         121         140         144         286         289           inc4
7      7  200  153  161  149  177   840         200         353         514         663         840           inc3
8      8  127  113   91  200   98   629         127         240         331         531         629           inc3
9      9   46  110  103  135  176   570          46         156         259         394         570           inc4
10    10   23  130    2   35  153   343          23         153         155         190         343           inc4