I'm working on psychological assessment data that was collected in a non-random sample design. My dataframe is formed of "sex" (male and female) and "level of education"("elementary", "high school", "college"). However, my empirical distribution is different from the true distribution.
I know that true parameter for sex is 0.7 female and 0.3 male. I know as well that the true parameter for schooling is elementary equals 0.5, high school = 0.3, and college equal 0.2
I would like to have a code that could "cut" (adjust?) my dataframe to match these characteristics. I know my final dataframe will have fewer participants than my current one. I'm wondering if a for / loop solution is duable in this case.
Dat:
df2 = data.frame(
sex = rep(c("m","f"),135),
schooling = c("elementary","highschool","college")
)
prop.table(table(df2$sex))
prop.table(table(df2$schooling))
You could weight your observations by your desired proportions, then use
dplyr::slice_sample():Depending on the level of precision you want, you could iterate until the proportions fall within a certain tolerance of your targets.
You may also want to look into the survey or srvyr packages for working with weighted data.