I'm struggling with what I imagine is a multi-level sampling procedure in R. Let's say I have a dataset composed of a very biased sampling method. Therefore, the results obtained with the participants are biased. I would like to adjust the dataset to match two demographic variables (sex and age), which are coded as factor in the dataset. The following image described the situation.
I assume that I'll need to perform a "loop" calculation. As an example: to adjust the sample size of the first age interval (15-19), I'll need to define a new total in which this final sample fits the 50% 50% definition. The same procedure will be needed for all other age intervals.
That's the most related topic I've found.
x<-structure(list(age_cat = c("25-29", "30-34", "25-29", "20-24",
"25-29", "20-24", "35-39", "30-34", "25-29", "30-34", "25-29",
"30-34", "35-39", "45-49", "40-45", "20-24", "20-24", "25-29",
"35-39", "35-39", "25-29", "20-24", "30-34", "30-34", "40-45",
"25-29", "25-29", "25-29", "20-24", "40-45", "20-24", "40-45",
"30-34", "25-29", "45-49", "30-34", "45-49", "40-45", "25-29",
"35-39", "40-45", "25-29", "45-49", "35-39", "45-49", "40-45",
"20-24", "45-49", "40-45", "25-29", "35-39", "30-34", "30-34",
"25-29", "20-24", "20-24", "40-45", "35-39", "25-29", "25-29",
"20-24", "40-45", "20-24", "20-24", "45-49", "20-24", "35-39",
"20-24", "35-39", "45-49", "15-19", "45-49", "35-39", "35-39",
"30-34", "35-39", "45-49", "35-39", "30-34", "20-24", "35-39",
"40-45", "40-45", "40-45", "30-34", "45-49", "20-24", "30-34",
"45-49", "35-39", "20-24", "20-24", "20-24", "45-49", "20-24",
"45-49", "35-39", "25-29", "40-45", "40-45", "25-29", "35-39",
"45-49", "30-34", "45-49", "45-49", "45-49", "15-19", "30-34",
"45-49", "30-34", "30-34", "35-39", "25-29", "40-45", "15-19",
"20-24", "20-24", "40-45", "40-45", "45-49", "45-49", "35-39",
"40-45", "30-34", "35-39", "35-39", "25-29", "25-29", "20-24",
"20-24", "40-45", "20-24", "35-39", "20-24", "20-24", "30-34",
"25-29", "45-49", "25-29", "35-39", "20-24", "35-39", "35-39",
"35-39", "40-45", "35-39", "35-39", "20-24", "30-34", "25-29",
"15-19", "30-34", "35-39", "15-19", "20-24", "20-24", "35-39",
"25-29", "25-29", "25-29", "25-29", "30-34", "40-45", "35-39",
"30-34", "35-39", "40-45", "25-29", "30-34", "25-29", "25-29",
"45-49", "30-34", "30-34", "25-29", "15-19", "25-29", "20-24",
"15-19", "20-24", "30-34", "20-24", "40-45", "25-29", "25-29",
"30-34", "30-34", "25-29", "20-24", "40-45", "45-49", "25-29",
"25-29", "40-45", "35-39", "25-29", "45-49", "35-39", "30-34",
"45-49", "30-34", "30-34", "45-49", "35-39", "20-24", "45-49",
"30-34", "25-29", "45-49", "45-49", "40-45", "25-29", "20-24",
"40-45", "30-34", "35-39", "30-34", "20-24", "35-39", "20-24",
"30-34", "20-24", "35-39", "35-39", "30-34", "45-49", "40-45",
"45-49", "25-29", "35-39", "40-45", "30-34", "35-39", "30-34",
"35-39", "20-24", "25-29", "35-39", "30-34", "30-34", "25-29",
"45-49", "45-49", "40-45", "40-45", "35-39", "30-34", "25-29",
"35-39", "20-24", "40-45", "20-24", "30-34", "40-45", "20-24",
"45-49", "20-24", "40-45", "25-29", "40-45", "25-29", "45-49",
"30-34", "30-34", "45-49", "40-45", "30-34", "30-34", "20-24",
"20-24", "35-39", "30-34", "15-19", "35-39", "25-29", "45-49",
"30-34", "25-29", "35-39", "15-19", "40-45", "45-49", "15-19",
"35-39", "45-49", "45-49", "25-29"), sex_cat = structure(c(1L,
2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L,
2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L,
2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L,
1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L,
1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L,
1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L,
2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L,
1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L,
1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L,
2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L,
1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("M",
"F"), class = "factor")), row.names = c(NA, -288L), class = c("tbl_df",
"tbl", "data.frame"))

Okay so this was a bit of a doozie! Here is what I did:
There are people better than me at using
data.tablebut what I did here, was first create anidcolumn andsex_cats.sex_catsis used later but keep this here for now.x_ctswas created to check and make sure the data you sent was copied and pasted correctly.Then I create
x_rawwhich is a simulated version of the request; here we include for eachage_catandsex_catapercentsfor eachsex_catwithin eachage_cat. These have to add up to 100%.Then I
pivot_widerto get thepercentsinto wide format across eachsex_cat. Then I simulate the number of samples you want from eachage_cat: this is manually inserted so if you need to change the number for eachage_cat, feel free to. From here we calculate for eachsex_catthe total number of samples inx_raw_wd.Then we get this in long format because of the requirements for the function
stratifiedfromsplitstackshape. If you look at thenames_tooption, this is shifted toN_MorN_F, which is different thansex_cat(sex_cat = 'M', 'F'). That's why in the beginning we createdsex_cats.Finally, we submit everything into
stratified. We create aKEYcolumn to link ourx_raw_wd_fin$value, which is total number of samples required byage_catandsex_cat, to the combination ofage_catandsex_catfor each observation inx.Based on my percentages, mostly made-up for demonstration purposes, I need 146 samples.
Here is my output: