Group by and sum/count of columns issues

71 Views Asked by At

I am trying to to group_by program names and count the total column amounts per program. There are NA's in the mix.

Taking a data frame like this: dataframe

and getting something like in return returned data

The following code only counts all non NA observations. It doesn't actually add up the numbers. Do I need to do some sort of ifelse here? I also wonder if the !is.na causes it to count all non NA observations, however, if I remove it, I get all NA's as my totals.

df %>%  
  group_by(ProgramName) %>%
  summarise(ED = sum(!is.na(HighSchool)), EMP = sum(!is.na(Employment)))

alternatively, is there is a way to group by program name and count the observations ONLY if they had a 1 in either column, not to tally up the total? That is closer to what I want anyway. Any support would be appreciated.

2

There are 2 best solutions below

0
L Tyrone On BEST ANSWER

To answer both questions:

library(dplyr)

df <- structure(list(ProgramName = c("Program A", "Program A", "Program A", 
"Program A", "Program B", "Program B", "Program B", "Program B", 
"Program C", "Program C", "Program C", "Program C"), HighSchool = c(1L, 
0L, 0L, 1L, 1L, 0L, 1L, NA, 1L, 1L, NA, 1L), Employment = c(0L, 
0L, 1L, 0L, 1L, 1L, 1L, NA, 0L, 1L, NA, 1L)), class = "data.frame", row.names = c(NA, 
-12L))


df %>%  
  group_by(ProgramName) %>%
  summarise(across(HighSchool:Employment, ~ if(all(is.na(.))) NA else sum(., na.rm = TRUE)))

# A tibble: 3 × 3
  ProgramName HighSchool Employment
  <chr>            <int>      <int>
1 Program A            2          1
2 Program B            2          3
3 Program C            3          2

# This is the one you state that you actually want
df %>%  
  group_by(ProgramName) %>%
  summarise(ED = sum(HighSchool == 1, na.rm = TRUE),
            EMP = sum(Employment == 1, na.rm = TRUE))

A tibble: 3 × 3
  ProgramName    ED   EMP
  <chr>       <int> <int>
1 Program A       2     1
2 Program B       2     3
3 Program C       3     2
0
Andy Baxter On

You can sum all non-NA values without needing to check for NA's or checking for values equalling 1:

library(tidyverse)

df <- tibble::tribble(
        ~ProgramName, ~School, ~Employed,
         "Program A",       1,         0,
         "Program A",       0,         0,
         "Program A",       0,         1,
         "Program A",       1,         0,
         "Program B",       1,         1,
         "Program B",       0,         1,
         "Program B",       1,         1,
         "Program B",      NA,        NA,
         "Program C",       1,         0,
         "Program C",       1,         1,
         "Program C",      NA,        NA,
         "Program C",       1,         1
        )
df |> 
  summarise(
    ED = sum(School, na.rm = TRUE),
    EMP = sum(Employed, na.rm = TRUE),
    .by = ProgramName
  )
#> # A tibble: 3 × 3
#>   ProgramName    ED   EMP
#>   <chr>       <dbl> <dbl>
#> 1 Program A       2     1
#> 2 Program B       2     3
#> 3 Program C       3     2