Function has 100-fold increase in duration for each 10-fold increase in input data

86 Views Asked by At

I wrote a simple function using sapply() to define the water year (Oct 1st - Sept 30th) for a vector of dates. It works well except that for every 10-fold increase in the length of vector supplied, the function takes 100 times longer, making it prohibitive for large date vectors (I need to apply it to 300k dates). My understanding was that the apply family of functions are vectorized and should be an efficient way to work with large datasets. What am I missing here and can I make it more effecient?

wateryear <- function(dates){
  y <- year(dates)
  m <- month(dates)
  sapply(m, function(x) {ifelse(x <= 9, paste0(y-1,"-",y), paste0(y, "-", y+1))})
}
d<-c("2020/01/01")
system.time(wateryear(rep(d,100))) # 0.008
system.time(wateryear(rep(d,1000))) # 0.651
system.time(wateryear(rep(d,10000))) # 63.854
2

There are 2 best solutions below

0
Jon Spring On BEST ANSWER

This vectorized approach runs 300x faster for n = 1000. Looks like it takes about 0.4sec for n = 300k.

library(lubridate)
wateryear2 <- function(dates){
  d <- ymd(dates)
  y <- year(d) + (month(d) > 9)
  paste(y-1, y, sep = "-")
}

bench::mark(
  wateryear(rep(d,1000)),
  wateryear2(rep(d,1000))
)


# A tibble: 2 × 13
  expression                    min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result        memory                 time       gc      
  <bch:expr>               <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>        <list>                 <list>     <list>  
1 wateryear(rep(d, 1000))     968ms    968ms      1.03      32MB     0        1     0      968ms <chr [1,000]> <Rprofmem [4,296 × 3]> <bench_tm> <tibble>
2 wateryear2(rep(d, 1000))   3.04ms   3.09ms    322.       711KB     6.23   155     3      482ms <chr [1,000]> <Rprofmem [380 × 3]>   <bench_tm> <tibble>

enter image description here

1
William Doane On

Following up on a question from Jon Spring via Mastodon,

test_rep <- function() {
  rep(1, 1000)
  return(NULL)
}

test_sapply <- function() {
  sapply(1:1000, function(x) { 1 })
  return(NULL)
}

test_lapply <- function() {
  unlist(lapply(1:1000, function(x) { 1 }))
  return(NULL)
}

test_vapply <- function() {
  vapply(1:1000, function(x) { 1 }, 1)
  return(NULL)
}

test_for_obj <- function() {
  result <- list()
  
  for (i in 1:1000) {
    result[[length(result) + 1]] <- 1
  }
  
  unlist(result)
  return(NULL)
}

test_for_no_obj <- function() {
  for (i in 1:1000) {
    1
  }
  
  return(NULL)
}

bench::mark(
  test_rep(),
  test_sapply(),
  test_lapply(),
  test_vapply(),
  test_for_obj(),
  test_for_no_obj()
)

It's the object creation that gets you... although a 5-fold difference between a vectorized function and a simple for loop is still noteworthy!

# A tibble: 6 × 13
  expression             min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
  <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
1 test_rep()          1.07µs   1.27µs   591625.    7.86KB     59.2  9999     1     16.9ms <NULL> <Rprofmem> <bench_tm> <tibble>
2 test_sapply()     234.31µs  244.4µs     4008.   31.67KB     27.6  1888    13      471ms <NULL> <Rprofmem> <bench_tm> <tibble>
3 test_lapply()     221.69µs 234.68µs     4119.   15.72KB     29.9  1929    14    468.3ms <NULL> <Rprofmem> <bench_tm> <tibble>
4 test_vapply()     210.58µs 220.74µs     4404.    7.86KB     32.6  2027    15    460.3ms <NULL> <Rprofmem> <bench_tm> <tibble>
5 test_for_obj()    276.46µs 289.56µs     3391.  198.84KB     19.2  1588     9    468.3ms <NULL> <Rprofmem> <bench_tm> <tibble>
6 test_for_no_obj()   5.04µs   5.17µs   188676.   11.29KB      0   10000     0       53ms <NULL> <Rprofmem> <bench_tm> <tibble>