The code reads a CSV of 628360 rows from GCS, applies a transformation to the created Dataframe with the method withColumn and writes to a partitioned Bigquery table.
Despite this simple workflow the job took 19h 42min hours to be processed. What can I do to process this faster?
I am using an Autoscaling Policy and I know it is not scaling up because there is no Yarn Memory Pending as you can see in the following screenshot.
The configuration of the cluster is the following:
gcloud dataproc clusters create $CLUSTER_NAME \
--project $PROJECT_ID_PROCESSING \
--region $REGION \
--image-version 2.0-ubuntu18 \
--num-masters 1 \
--master-machine-type n2d-standard-2 \
--master-boot-disk-size 100GB \
--confidential-compute \
--num-workers 4 \
--worker-machine-type n2d-standard-2 \
--worker-boot-disk-size 100GB \
--secondary-worker-boot-disk-size 100GB \
--autoscaling-policy $AUTOSCALING_POLICY \
--secondary-worker-type=non-preemptible \
--subnet $SUBNET \
--no-address \
--shielded-integrity-monitoring \
--shielded-secure-boot \
--shielded-vtpm \
--labels label\
--gce-pd-kms-key $KMS_KEY \
--service-account $SERVICE_ACCOUNT \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--zone "" \
--max-idle 3600s
As it was discussed in Google Dataproc Pyspark - BigQuery connector is super slow I processed the same job without the deidentifier transformation
it took 23 seconds to process a file with approximately 20.000 rows. I conclude this was the transformation that was delaying the job but I still don't know if I should use
withColumnmethod.