Consolidating multiple OR and AND conditions in R

360 Views Asked by At

I want to consolidate multiple OR and AND conditions in R. I think x1 == 1 | x1 == 2 can be consolidated as x1 %in% c(1, 2). I'm wondering how x1 == 1 | y1 == 1 and x1 == 1 & y1 == 1 can be consolidated into more compact R code.

x1 <- c(1, 2, 3)
y1 <- c(1, 2, 4)

x1 == 1 | x1 == 2
#> [1]  TRUE  TRUE FALSE
x1 %in% c(1, 2)
#> [1]  TRUE  TRUE FALSE

x1 == 1 | y1 == 1
#> [1]  TRUE FALSE FALSE

intersect(x1, y1) == 1
#> [1]  TRUE FALSE
intersect(x1, y1) == 2
#> [1] FALSE  TRUE

intersect(x1, y1) %in% c(1, 2)
#> [1] TRUE TRUE

> x1 == 1 & y1 == 1
[1]  TRUE FALSE FALSE

Edited

The code (x1 == 1 | x1 == 2) & (y1 == 1 | y1 == 2) is equal to Reduce(any, lapply(list(x1), %in%, c(1, 2))) & Reduce(any, lapply(list(y1), %in%, c(1, 2))). Wondering to write this in more compact way.

5

There are 5 best solutions below

5
moodymudskipper On BEST ANSWER

I think you need lapply() and Reduce() for a clean "hackless" abstraction:

x1 <- c(1, 2, 3)
y1 <- c(1, 2, 4)
y1 == 1 | x1 == 1
#> [1]  TRUE FALSE FALSE
Reduce(`|`, lapply(list(x1, y1), `==`, 1))
#> [1]  TRUE FALSE FALSE

You can win a few characters with apply(cbind(x1, y1) == 1, 1, all) (using matrix as an intermediate shape) but I don't know if it's worth it.

Created on 2024-03-26 with reprex v2.0.2

8
SamR On

Create some data

As x1 == 1 | y1 == 1 doesn't seem too verbose, let's create more vectors:

set.seed(1)
lapply(1:10, \(.) sample(10, 10, replace = TRUE)) |>
    setNames(paste0(c("x", "y"), c(1:5, 1:5))) |>
    list2env(.GlobalEnv)
ls()
#  [1] "x1" "x2" "x3" "x4" "x5" "y1" "y2" "y3" "y4" "y5"

Now it does become quite tedious to write:

x1 == 1 | y2 == 1 | x3 == 1 | y4 == 1 | x5 == 1 | y1 == 1 | x2 == 1 | y3 == 1 | x4 == 1 | y5 == 1 
#  [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE

Use a data frame

We're using element-wise logical operators, so the vectors are guaranteed to be the same length. Let's put them into a data frame to perform the comparison:

compare <- function(..., val = 1, op = c("==", "<", ">", ">=", "<=", "!=")) {
    fun <- match.fun(match.arg(op))
    rowSums(fun(data.frame(...), val)) > 0
}

We can then do:

compare(x1, y2, x3, y4, x5, y1, x2, y3, x4, y5)
#  [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE

Or alternatively if this is still too much typing, we can put the vectors in list (which is better anyway), and supply that to the function:

# Create a list of x1:x5 and y1:y5
l  <- mget(ls(pattern = "[xy]\\d"))
compare(l)
#  [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE

Some tests

This works with == and other operators as well.

identical(
    compare(l, val = 1),
    x1 == 1 | y2 == 1 | x3 == 1 | y4 == 1 | x5 == 1 | y1 == 1 | x2 == 1 | y3 == 1 | x4 == 1 | y5 == 1
)
# [1] TRUE

identical(
    compare(l, op = ">", val = 9),
    x1 > 9 | y2 > 9 | x3 > 9 | y4 > 9 | x5 > 9 | y1 > 9 | x2 > 9 | y3 > 9 | x4 > 9 | y5 > 9
)
# [1] TRUE

Performance

An advantage of using a data frame (rather than, say, a matrix) is that it is just a list of pointers to each vector. This means no copies of the data are made. Additionally, data frames support element-wise comparison operations. As Jenny Bryan said, Of course, someone has to write loops. It doesn’t have to be you.

A note on coercion and copies

It has been pointed out in the comments that rowSums(dat), where dat is a data.frame, will coerce dat to a matrix i.e. make a copy.

This is true (and interesting!) but the input data is not copied in this case. The above code does not call rowSums(data.frame(...)), which would trigger coercion to a matrix. It runs rowSums(data.frame(...)==val). We can see an equivalent difference by using the debug browser:

debug(as.matrix)
rowSums(mtcars) # Triggers debug browser; debugging in: as.matrix(x)
rowSums(mtcars == 1) # Does not trigger browser

This is because mtcars==1 is already a matrix. Now, if you dig into the Ops.data.frame() source, which is the function that is called when you use == on a data frame, you may see that it calls matrix(value, nrow = nr, dimnames = list(rn,cn)). However, matrix() is called on the return value to ensure that mtcars==1 returns a matrix, and not on mtcars (or whatever the input data is) which is not copied.

Some benchmarks

Here are some benchmarks for the three approaches:

  1. reduce: The Reduce() approach by moodymudskipper.
  2. df: The compare() function which uses a data frame.
  3. cbind_mat; The %==.|% <- function(x, y) apply(t(x) == y, 2, any) function defined in the answer from G. Grothendieck.

I varied both the number of vectors and length of vectors from 10 to 10k elements. The actual picture about whether a data frame or matrix is better is complicated. For shorter vectors, using a matrix is much quicker than a data frame. For longer vectors, a data frame is considerably quicker than a matrix, with much less memory allocation.

What is unequivocally clear is that Reduce() is the quickest option:

enter image description here

Similarly, as the data gets larger, the data frame approach uses a lot less RAM than coercing the vectors to matrices. However, the Reduce() approach uses the least RAM in all cases.

enter image description here

Benchmark and plot code

results <- bench::press(
    num_vectors = 10^(1:4),
    vec_length = 10^(1:4),
    {
        l <- lapply(seq(num_vectors), \(.) sample(vec_length, vec_length, replace = TRUE))
        bench::mark(
            min_iterations = 10,
            max_iterations = 1000,
            relative = TRUE,
            df = {
                compare(l)
            },
            cbind_mat = {
                do.call(cbind, l) %==.|% 1
            },
            reduce = {
                Reduce(`|`, lapply(l, `==`, 1))
            }
        )
    }
)

# Plots

library(ggplot2)

# beeswarm plot
autoplot(results) +
    ggh4x::facet_grid2(
        vars(num_vectors), vars(vec_length),
        scales = "free_x", independent = "x",
        labeller = label_both
    ) +
    scale_y_continuous() +
    theme_minimal(base_size = 13) +
    theme(legend.position = "bottom")

# RAM usage plot
results |>
    dplyr::mutate(
        expr = attr(expression, "description"),
        mem_alloc = unclass(mem_alloc),
        size = as.factor(num_vectors)
    ) |>
    ggplot() +
    geom_col(aes(
        x = reorder(expr, mem_alloc),
        y = mem_alloc,
        fill = expr
    ), color = "black") +
    ggh4x::facet_grid2(
        vars(num_vectors), vars(vec_length),
        scales = "free", independent = "y"
    ) +
    labs(
        title = "Total RAM usage",
        y = "Relative RAM usage",
        x = "Expression"
    ) +
    theme_minimal(base_size = 14) +
    theme(legend.position = "bottom")
4
G. Grothendieck On

1) The following checks that x1[i] and y1[i] are both 1, i.e. AND.

x1 <- c(1, 2, 3, 1)
y1 <- c(1, 2, 4, 3)

paste(x1, y1) == "1 1"
## [1]  TRUE FALSE FALSE FALSE

2) If we knew that the vectors are composed of integers >= 1, as in the question, then we can use pmin and pmax. Note that they have optional na.rm= arguments which can be set to handle NAs.

pmin(x1, y1) == 1        # OR
## [1]  TRUE FALSE FALSE TRUE

pmax(x1, y1) == 1        # AND
## [1]  TRUE FALSE FALSE FALSE

All of the above readily generalize to more than 2 vectors.

3) The expression in the question involving just x1 can be shortened slightly

x1 %in% 1:2
## [1]  TRUE  TRUE FALSE  TRUE

4) We can regard the case with x1 and y1 as a generalized matrix multiplication where * is replaced with == and + is replaced with either | or & . In the APL language this is =.v and =.^ . (See https://rpubs.com/deleeuw/158476 and blockmodeling::genMatrixMult for other implementations.)

`%==.|%` <- function(x, y) apply(t(x) == y, 2, any)
`%==.&%` <- function(x, y) apply(t(x) == y, 2, all)

cbind(x1, y1) %==.&% c(1, 1)
## [1]  TRUE FALSE FALSE FALSE

cbind(x1, y1) %==.|% c(1, 1)
## [1]  TRUE FALSE FALSE  TRUE
0
ThomasIsCoding On

Here is a generalization based on the answers by @moodymudskipper and @SamR

f <- function(
    ...,
    val = 1,
    op = c("==", "<", ">", ">=", "<=", "!="),
    how = c("|", "&")) {
    Reduce(as.symbol(how), lapply(list(...), as.symbol(op), val))
}
0
bert On

Paste the pieces together, and evaluate it as an expression, for example

x1 <- c(1, 2, 3)
y1 <- c(1, 2, 4)
string <- paste0(c('x','y'), 1, '==', c(1,2), collapse='&')
eval(parse(text=string))
# [1] FALSE FALSE FALSE