Why does apply seem to drop the CRS?

93 Views Asked by At

The title really sums up my question. Apply seems to drop the CRS, but other functions don't. What is the best way to calculate a geographic function on a vector of points?

library(tidyverse)
library(sf)

# Generate 1000 lat longs, save as sf, and set crs
df1 <- data.frame(lat = runif(1000, 30, 33.4), long = runif(1000, -95, -82)) %>%
  st_as_sf(coords = c("long", "lat"),
           crs = 4326)

# Single point, with identical crs
df2 <- data.frame(lat = 32, long = -96) %>%
  st_as_sf(coords = c("long", "lat"),
           crs = 4326)

apply(df1, 1, function(x) st_distance(x, df2))

This gives the error: Error in st_distance(x, df2) : st_crs(x) == st_crs(y) is not TRUE

But these both work fine:

st_distance(df1[1,], df2)

final.df <- NULL
for(i in 1:nrow(df1)){
  ith.distance <- st_distance(df1[i,], df2)
  
  final.df <- rbind(final.df, ith.distance)
}

The for loop is certainly not the most efficient way to do this. ...Is it??

1

There are 1 best solutions below

1
r2evans On BEST ANSWER

Points:

  1. for and apply loops have no real difference in performance, that's an old condition (it used to be true) that was fixed many years ago. For the most part, the decision to use one over the other should be made on data-wrangling and proficiency reasons, not "speed". Typically when there are problems with for-loop performance, it has everything to do with what is done inside the loop, and nothing to do with for itself.

  2. Iteratively calling final.df <- rbind(final.df, ith.distance) is going to perform very poorly in the long haul. With each iteration, it makes a complete copy of all data; this is fine when you have a few rows, but as you get into the 100s and 1000s, it takes just a little bit more time each pass through the loop. Don't do this, it's better to append the results to a list and do a single call to rbind. I'll demonstrate this later.

  3. It is generally more efficient to loop over the smaller of the frames; in this case, that would be df2 instead of df1. (Perhaps this is just a matter of the sample data you generated, but I'm saying it in case.)

I suggest iterating over row numbers and doing your distance calculation that way.

### iterating over the LARGER frame
system.time(res1 <- sapply(seq_len(nrow(df1)), function(rn) as.numeric(sf::st_distance(df1[rn,], df2))))
#    user  system elapsed 
#   4.782   0.000   4.783 

### iterating over the SMALLER frame
system.time(res2 <- sapply(seq_len(nrow(df2)), function(rn) as.numeric(sf::st_distance(df1, df2[rn,]))))
#    user  system elapsed 
#   0.008   0.000   0.009 

identical(res1, c(res2))
# [1] TRUE
head(res1)
# [1]  974915.8  184703.5 1215753.1  260598.1 1063713.6  745808.6
head(res2)
#           [,1]
# [1,]  974915.8
# [2,]  184703.5
# [3,] 1215753.1
# [4,]  260598.1
# [5,] 1063713.6
# [6,]  745808.6

res1 is a vector, res2 is a matrix (1 column), and the values are identical. The latter ran much faster, this is the results of using R's vectorized calculations when available (and they are here).

For the record, here is a for loop but avoiding the per-pass rbind.

lst1 <- list()
system.time({
  for (rn in 1:nrow(df1)) {
    dist <- sf::st_distance(df1[rn,], df2)
    lst1 <- c(lst1, list(dist))
  }
  lst1 <- do.call(rbind, lst1)
})
#    user  system elapsed 
#   4.717   0.000   4.717 

lst2 <- list()
system.time({
  for (rn in 1:nrow(df2)) {
    dist <- sf::st_distance(df1, df2[rn,])
    lst2 <- c(lst2, list(dist))
  }
  lst2 <- do.call(rbind, lst2)
})
#    user  system elapsed 
#   0.010   0.000   0.011 

identical(lst1, lst2)
# [1] TRUE
identical(res2, lst2)
# [1] TRUE

As to why apply doesn't work: let's look at what is provided internally:

apply(df1, 1, function(x) {browser();1;})
Browse[1]> debug at #1: [1] 1
Browse[2]> x
$geometry
POINT (-85.94798 30.29108)

If you look at the start of apply (src/library/base/R/apply.R#29), one of the first things it does is convert the input to a matrix:

    ## Ensure that X is an array object
    dl <- length(dim(X))
    if(!dl) stop("dim(X) must have a positive length")
    if(is.object(X))
    X <- if(dl == 2L) as.matrix(X) else as.array(X)
###                   ^^^^^^^^^^^^  this is a problem for sf-objects

Among other things, this strips the CRS from the "sf"-class object.

(For the record, this call to as.matrix(X) is also a problem when doing row-wise operations on a frame with mixed character and non-string columns; if X is not sufficiently subsetted before apply, then all values may be converted to strings.)