How to setup venv or setup to run pyspark jobs for GCP Dataproc Serverless Spark without installing packages in container image

536 Views Asked by At

I am working on a project where we wanted to release Serverless Spark Container Image to set of customers to use this Image to run their Serverless Spark workloads.

But to run pyspark jobs in order to install the packages manually on the image(as the list will be endless coming from all customers), I am trying to figure out a way to install required packages.

I tried following this document. https://cloud.google.com/sdk/gcloud/reference/dataproc/batches/submit/pyspark

and used --archive or --py-files option(Bundling all the group of python files a zip file) but facing issues with few packages like elasticsearch/numpy/xgboost. It is working fine for small packages.

Can anybody suggest any other solutions

1

There are 1 best solutions below

0
Igor Dvorzhak On

If you do not want to create container image, then you need to follow one of the options in the Spark Python Package Management documentation.

Note, that using custom container image is the most optimal solution, as Datparoc Serverless supports image streaming, which allows to avoid virtual env download/pulling on each Spark node (dirver/executor).