My (simplified) dataset consists of donor occupation and contribution amounts. I'm trying to determine what the average contribution amount by occupation is (note: donor occupations are often repeated in the column, so I use that as a grouping variable). Right now, I'm using two dplyr statements -- one to get a sum of contributions amount by each occupation and another to get a count of the number of donations from that specific occupation. I am then binding the dataframes with cbind and creating a new column with mutate, where I can divide the sum by the count.
Data example:
contributor_occupation contribution_receipt_amount
1 LISTING COORDINATOR 5.00
2 NOT EMPLOYED 2.70
3 TEACHER 2.70
4 ELECTRICAL DESIGNER 2.00
5 STUDENT 50.00
6 SOFTWARE ENGINEER 10.00
7 TRUCK DRIVER 2.70
8 NOT EMPLOYED 50.00
9 CONTRACTOR 5.00
10 ENGINEER 6.00
11 FARMER 2.70
12 ARTIST 50.00
13 CIRCUS ARTIST 100.00
14 CIRCUS ARTIST 27.00
15 INFORMATION SECURITY ANALYST 2.00
16 LAWYER 5.00
occupation2 <- b %>%
select(contributor_occupation, contribution_receipt_amount) %>%
group_by(contributor_occupation) %>%
summarise(total = sum(contribution_receipt_amount)) %>%
arrange(desc(contributor_occupation))
occupation3 <- b %>%
select(contributor_occupation) %>%
count(contributor_occupation) %>%
group_by(contributor_occupation) %>%
arrange(desc(contributor_occupation))
final_occ <- cbind(occupation2, occupation3[, 2]) # remove duplicate column
occ_avg <- final_occ %>%
select(contributor_occupation:n) %>%
mutate("Average Donation" = total/n) %>%
rename("Number of Donations"= n, "Occupation" = contributor_occupation, "Total Donated" = total)
occ_avg %>%
arrange(desc(`Average Donation`))
This gives me the result I want but seems like a very cumbersome process. It seems I get the same result by using the following code; however, I am confused as to why it works:
avg_donation_occupation <- b %>%
select(contributor_occupation, contribution_receipt_amount) %>%
group_by(contributor_occupation) %>%
summarize(avg_donation_by_occupation = sum(contribution_receipt_amount)/n()) %>%
arrange(desc(avg_donation_by_occupation))
Wouldn't dividing by n divide by the number of rows (i.e., number of occupations) as opposed to the number of people in that occupation (which is what I used the count function for previously)?
Thanks for the help clearing up any confusion!
We may need both
sumandmeanalong withn()which gives the number of observations in the grouped data. According to?contextand `?mean
which is basically the
sumof observations divided by the number of observations-output
data