This is my code. I run it in databricks.
library(sparklyr)
library(dplyr)
library(arrow)
sc <- spark_connect(method = "databricks")
tbl_change_db(sc, "prod")
trip_ids <- spark_read_table(sc, "signals",memory=F) %>%
slice_sample(10) %>%
pull(trip_identifier)
The code is extremly slow and takes in hour to run albeit I am only querying 10 samples. Why is that? Is there a way to improve the performance?
Thank you!
It seems like you're using
dplyr'sslice_samplefunction to sample your dataset and then selecting some single column from there. The problem is that the Spark engine does not know about this: your sampling happens in R. This means that the full dataset is completely read from wherever it is stored, and completely sent to your R engine to only be subsampled in there.What you need to do is to get your subset and column within Spark itself. You can do that with the
select(to grab a single column) and thehead(to grab N rows) functions: