How to run a Spark job on Dataproc with custom conda env file

16 Views Asked by At

I'm trying to run a Spark job on Dataproc with a custom conda environment. Here's my environment yaml file:

name: parallel-jobs-on-dataproc
channels:
  - default
dependencies:
  - python=3.11
  - pyspark=3.5.0
  - prophet~=1.1.2

I follow the official documentation and start the cluster with:

gsutil cp "environment.yaml" gs://my-bucket-1212/my_folder/environment.yaml

gcloud dataproc clusters create my_cluster \
    --region=us-east1 \
    --image-version=2.2-debian12 \
    --properties='dataproc:conda.env.config.uri=gs://my-bucket-1212/my_folder/environment.yaml'

When I run this I get an error on each of the nodes, and all seem to have the same cause, that conda environment can't be activated. First I get for the master node:

Failed to initialize node my_cluster-m: Component miniconda3 failed to activate See output in: gs://dataproc-staging-us-east1-<project_id>-<job_id>/google-cloud-dataproc-metainfo/<another_id>/my_cluster-m/dataproc-startup-script_output

Then I download the file above and the relevant lines inside it are:

<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: ++++ newval=/opt/conda/miniconda3/envs/parallel-jobs-on-dataproc/bin/x86_64-conda-linux-gnu-addr2line
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: ++++ '[' '!' -x /opt/conda/miniconda3/envs/parallel-jobs-on-dataproc/bin/x86_64-conda-linux-gnu-addr2line ']'
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: ++++ '[' apply = apply ']'
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: +++++ echo addr2line
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: +++++ tr a-z+-. A-ZX__
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: ++++ thing=ADDR2LINE
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: ++++ eval 'oldval=$ADDR2LINE'
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: /opt/conda/miniconda3/envs/parallel-jobs-on-dataproc/etc/conda/activate.d/activate-binutils_linux-64.sh: line 68: ADDR2LINE: unbound variable
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: + exit_code=1
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: ++ date +%s.%N
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: + local -r end=1710855099.903674478
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: + local -r runtime_s=255
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: + echo 'Component miniconda3 took 255s to activate'
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: Component miniconda3 took 255s to activate
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: + local -r time_file=/tmp/dataproc/components/activate/miniconda3.time
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: + touch /tmp/dataproc/components/activate/miniconda3.time
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: + cat
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: + [[ 1 -ne 0 ]]
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: + echo 1
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: + log_and_fail miniconda3 'Component miniconda3 failed to activate' 1
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: + local component=miniconda3
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: + local 'message=Component miniconda3 failed to activate'
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: + local error_code=1
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: + local client_error_indicator=
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: + [[ 1 -eq 2 ]]
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: + echo 'StructuredError{miniconda3, Component miniconda3 failed to activate}'
<13>Mar 19 13:31:39 startup-script[1179]: <13>Mar 19 13:31:39 activate-component-miniconda3[3744]: StructuredError{miniconda3, Component miniconda3 failed to activate}

Any idea why this might happen? I have tried older Dataproc debian images, the ubuntu image, but none of it seems to work. Am I doing something wrong? Can I fix this on my side?

Edit: It seems the problem is with the prophet package, as removing it from the environment created the cluster. However, I still don't know why that happens or how to fix it.

0

There are 0 best solutions below