R and sparklyr: Why is a simple query so slow?

Question

R and sparklyr: Why is a simple query so slow?

650 Views Asked by Funkwecker At 30 March 2023 at 13:19

This is my code. I run it in databricks.

library(sparklyr)
library(dplyr)
library(arrow)

sc <- spark_connect(method = "databricks")
tbl_change_db(sc, "prod")
trip_ids <- spark_read_table(sc, "signals",memory=F) %>% 
            slice_sample(10) %>% 
            pull(trip_identifier)

The code is extremly slow and takes in hour to run albeit I am only querying 10 samples. Why is that? Is there a way to improve the performance?

Thank you!

Original Q&A

There are 3 best solutions below

Andrew On 04 April 2023 at 12:45

The other answer and comment already covered why the query was taking so long (you were pulling the entire table into the driver/R earlier than you wanted to), but I wanted to include an example that truly samples the data and another approach that gives you more control (i.e., one that uses SparkSQL). When working with Spark, I try to do all my heavy lifting as actual SQL queries so I would prefer option 2, but I included both in-case one is more helpful than the other.

library(sparklyr)
library(dplyr)

sc = spark_connect(method = "databricks")
tbl_change_db(sc, "prod")

# Option 1, using a fraction (proportion in this case) to pull a random sample
spark_read_table(sc, "signals", memory = FALSE) %>%
  select(trip_identifier) %>%
  sdf_sample(fraction = .0001, replacement = FALSE, seed = NULL) %>%
  collect() %>% #this is not necessary, but it makes the pull-down to R explicit
  pull(trip_identifier)


# Option 2, using SparkSQL to run the query as you intended (sampling 10 rows)
sc %>%
  sdf_sql("SELECT trip_identifier FROM signals TABLESAMPLE (10 ROWS)") %>%
  collect() %>% #this is not necessary, but it makes the pull-down to R explicit
  pull(trip_identifier)

Kishor Ahir On 09 April 2023 at 05:44

I think you should use Use spark_tbl() instead of spark_read_table(): The spark_tbl() function in sparklyr provides a more efficient way to create a Spark DataFrame by directly referencing the. table in Spark's catalog, without reading the entire table into R memory
And also you use sample_n() instead of slice_sample() for directly sample a fixed number of rows from Spark DataFrame

Like this example

trip_ids <- spark_tbl(sc, "prod.signals") %>%
  sample_n(10) %>% 
  pull(trip_identifier)

Also you can use arrow::arrow_serialize() and arrow::arrow_deserialize() functions to serialize and deserialize data between R and Spark.

I hope it should be usable

**Koedlt** · Accepted Answer · 2023-04-02T09:28:24.537000

It seems like you're using dplyr's slice_sample function to sample your dataset and then selecting some single column from there. The problem is that the Spark engine does not know about this: your sampling happens in R. This means that the full dataset is completely read from wherever it is stored, and completely sent to your R engine to only be subsampled in there.

What you need to do is to get your subset and column within Spark itself. You can do that with the select (to grab a single column) and the head (to grab N rows) functions:

trip_ids <- head(select(spark_read_table(sc, "signals",memory=F), trip_identifier), 10L)

R and sparklyr: Why is a simple query so slow?

There are 3 best solutions below

Related Questions in R

Related Questions in APACHE-SPARK

Related Questions in SPARKLYR

Trending Questions

Popular # Hahtags

Popular Questions