Calculating weights for a large data frame based on multiple variables with several levels

304 Views Asked by At

I need to calculate some weights for my large data frame based on two different variables. Let's say they are x = c("a","b","c","d","e") and y = c("v","w","x","y","z"). There are 25 different combinations of these two variables.

The weight I will use depends on how frequently each level of x and y appear in my data frame. So it could be something like "a" shows up 34% of the time, "b" shows up 12% of the time, etc. Let's say I have saved the values of those proportions to x_prop and y_prop. My weight is calculated by taking the proportion of the level of x divided by the level of y.

My question is, is there a better, more efficient way to go about this than to brute force it with something like:

weights <- c()
for (i in 1:nrow(df)){
  if (df$x[i] == "a" & df$y[i] == "v"){weights[i] <- (x_prop[1] / y_prop[1])}
  else if (df$x[i] == "a" & df$y[i] == "w"){weights[i] <- (x_prop[1] / y_prop[2])}
  ...
  else if (df$x[i] == "e" & df$y[i] == "z"){weights[i] <- (x_prop[5] / y_prop[5])}
}

I actually need to do this for multiple combinations of variables. Doing this for a single pair results in 25 lines of fairly redundant code with these if statements and I can't help but think that there's a lot more efficient way to do this.

Any advice?

Edit: my data frame looks like the following:

df <- data.frame(x = c("a", "a", "c", "e", "e", "b", "a", "d"),
                 y = c("v", "w", "z", "x", "x", "x", "y", "v"))

There are other variables in there also (and in reality, my data frame has thousands of data points), but what I need to do is take an action on another variable based on every possible combination of x and y.

1

There are 1 best solutions below

0
Jon Spring On

One way to do this would be to make a lookup table with all the possible x/y combinations and their respective weights.

df <- data.frame(x = c("a", "a", "c", "e", "e", "b", "a", "d"),
                 y = c("v", "w", "z", "x", "x", "x", "y", "v"))

library(dplyr)
lookup <- tidyr::crossing(
  count(df, x) |> transmute(x, x_share = n / sum(n)),
  count(df, y) |> transmute(y, y_share = n / sum(n))) |>
  mutate(weight = x_share / y_share)

df |>
  left_join(lookup)

Or we could skip the lookup table and make it two joins. Either way should be quite fast, and is pretty legible.

df |>
  left_join(count(df, x) |> transmute(x, x_share = n / sum(n))) |>
  left_join(count(df, y) |> transmute(y, y_share = n / sum(n))) |>
  mutate(weight = x_share / y_share)

Result

Joining with `by = join_by(x, y)`
  x y x_share y_share    weight
1 a v   0.375   0.250 1.5000000
2 a w   0.375   0.125 3.0000000
3 c z   0.125   0.125 1.0000000
4 e x   0.250   0.375 0.6666667
5 e x   0.250   0.375 0.6666667
6 b x   0.125   0.375 0.3333333
7 a y   0.375   0.125 3.0000000
8 d v   0.125   0.250 0.5000000