Is there a limit to the number of observations that dplyr filter can successfully detect?

254 Views Asked by At

I am working with a large dataset with over 200 million rows. I load the dataset using the vroom package to speed up processing time. When I filter the dataset using an %in% condition, the process misses observations. I am wondering if there is a limit that exists on how many rows dplyr will successfully filter. The dataset is too large to load for a reproducible example, but the code I use to conduct the filter process is (roughly):

    library(tidyverse)
    library(vroom)
    Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 10)
    data <- vroom("data.csv", delim = ",")
    
    subset_data <- data %>%
    filter(ID %in% list) 

Where the dataset 'data.csv' contains 200 million observations, "ID" is a column name in the "data" dataframe, and "list" is a vector of ID numbers that fit the desired search criteria.

I expect about 6 million rows to meet the criteria, but a little over 3 million are returned. I am wondering if there is a limitation on the number of rows that filter can search. For example, if I can only search 100 million rows, it would explain why I am missing about half of the expected observations. Or, does loading the data using vroom impact the number of rows I can successfully filter?

0

There are 0 best solutions below