How to setup venv or setup to run pyspark jobs for GCP Dataproc Serverless Spark without installing packages in container image

536 Views Asked by ash_ketchum12 At 21 April 2023 at 05:16

I am working on a project where we wanted to release Serverless Spark Container Image to set of customers to use this Image to run their Serverless Spark workloads.

But to run pyspark jobs in order to install the packages manually on the image(as the list will be endless coming from all customers), I am trying to figure out a way to install required packages.

I tried following this document. https://cloud.google.com/sdk/gcloud/reference/dataproc/batches/submit/pyspark

and used --archive or --py-files option(Bundling all the group of python files a zip file) but facing issues with few packages like elasticsearch/numpy/xgboost. It is working fine for small packages.

Can anybody suggest any other solutions

Original Q&A

There are 1 best solutions below

Igor Dvorzhak On 23 April 2023 at 23:52

If you do not want to create container image, then you need to follow one of the options in the Spark Python Package Management documentation.

Note, that using custom container image is the most optimal solution, as Datparoc Serverless supports image streaming, which allows to avoid virtual env download/pulling on each Spark node (dirver/executor).

How to setup venv or setup to run pyspark jobs for GCP Dataproc Serverless Spark without installing packages in container image

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in GOOGLE-CLOUD-PLATFORM

Related Questions in PYSPARK

Related Questions in GOOGLE-CLOUD-DATAPROC

Related Questions in GOOGLE-CLOUD-DATAPROC-SERVERLESS

Trending Questions

Popular # Hahtags

Popular Questions