Connect to bigquery from databricks using sparklyr

65 Views Asked by At

I'm pretty new to databricks. I have it running with Gloogle Cloud.

I have been able to connect to bigquery where I have my tables, and bring some data using SparkR:

library(SparkR)

sparkR.session()

df <- loadDF(
  path = "project_id.dataset_id.table_id",
  source = "bigquery",
  header = "true",
  inferSchema = "true"
)

This works and returns a SparkDataFrame:

df
SparkDataFrame[variable:string, position:bigint, label:string, measurement_level:string, wave_id:bigint, country_id:bigint]

I'm trying to do it using sparklyr but something is not working:

library(sparklyr)


sc <- spark_connect(method  = "databricks")
df <- spark_read_bigquery(
  sc,
  name = "my_table",
  projectId = project_id
  datasetId = dataset_id,
  tableId = table_id
)

This returns an error:

com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException:
Unable to provision, see the following errors:

1) Error in custom provider, java.lang.NullPointerException: null value in
entry: bq.staging_dataset.gcs_bucket=null at
com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule.provideSparkBigQueryConfig(SparkBigQueryConnectorModule.java:65)
while locating com.google.cloud.spark.bigquery.SparkBigQueryConfig

1 error

Run `sparklyr::spark_last_error()` to see the full Spark error (multiple lines)
To use the previous style of error message set
`options("sparklyr.simple.errors" = TRUE)`

I can see that bq.staging_dataset.gcs_bucket=null. Does it mean it is not finding the bucket? I has worked with SparkR though.

Am I doing something wrong?

1

There are 1 best solutions below

0
dzegpi On

There are two things you have to do for your sparklyr connection to work:

  1. Add library(sparkbq) to your code. You might have to install it, because does not come preinstalled by default on databricks clusters.
  2. Set the bigquery defaults with sparkbq::bigquery_defaults(). Your error says that you didn't specify a gcsBucket to store temporary files. You will also need to specify a serviceAccountKeyFile, datasetLocation ("US" or "EU") and type ("direct" and "avro" are preferable).

You probably are not a GCP admin, so go to the GCP console and navigate to Buckets, which you can do by typing in the search bar. There you can select a bucket and copy its name. Analogously, navigate to Service Accounts, select and click on an account and copy/paste its ID.

Your code should look like this:

install.packages("sparkbq")
library(sparklyr)
library(sparkbq)

sc <- spark_connect(method  = "databricks")

# Set BigQuery defaults
bigquery_defaults(
  billingProjectId = project_id,
  gcsBucket = "your-preferred-gcs-bucket",
  datasetLocation = "US",
  serviceAccountKeyFile = "service-account-id",
  type = "direct"
)

my_table <- spark_read_bigquery(
  sc,
  name = "my_table",
  projectId = project_id
  datasetId = dataset_id,
  tableId = table_id
)

Finally, in the docs of spark_read_bigquery() you can find a similar example with explanation of the arguments passed to bigquery_defaults.