Tensorflow Canned Estimator problems running with multiple workers on Google cloud ml engine

134 Views Asked by At

I am trying to train a model using the canned DNNClassifier estimator on the google cloud ml-engine.

I am able to successfully train the model locally in single and distributed mode. Further I am able to train the model on the cloud with the provided BASIC and BASIC_GPU scale-tier.

I am now trying to pass my own custom config file. When I only specify "masterType: standard" in the config file without mentioning workers, parameter servers, the job runs successfully.

However, whenever I try adding workers, the job fails:

trainingInput:
  scaleTier: CUSTOM
  masterType: standard
  workerType: standard
  workerCount: 4

Here is how I run the job (I get the same error without mentioning the staging bucket):

SCALE_TIER=CUSTOM
JOB_NAME=chasingdatajob_10252017_13
OUTPUT_PATH=gs://chasingdata/$JOB_NAME
STAGING_BUCKET=gs://chasingdata
gcloud ml-engine jobs submit training $JOB_NAME --staging-bucket "$STAGING_BUCKET" --scale-tier $SCALE_TIER --config $SIMPLE_CONFIG --job-dir $OUTPUT_PATH --module-name trainer.task --package-path trainer/ --region $REGION -- ...

My job log shows that the job exited with a non-zero status of 1. I see the following error for worker-replica-3:

Command '['gsutil', '-q', 'cp', u'gs://chasingdata/chasingdatajob_10252017_13/e476e75c04e89e4a0f2f5f040853ec21974ae0af2289a2563293d29179a81199/trainer-0.1.tar.gz', u'trainer-0.1.tar.gz']' returned non-zero exit status 1

Ive checked my bucket (gs://chasingdata). I see chasingdatajob_10252017_13 directory created by the engine but there is no trainer-0.1.tar.gz file. Another thing to mention - I am passing "tensorflow==1.4.0rc0" as a PyPi package to the cloud in my setup.py file. I dont think this is the cause of the problem but thought Id mention it anyway.

Is there any reason for this error? Can someone please help me out?

Perhaps I am doing something stupid. I have tried to find an answer (unsuccesfully) for this.

Thanks a lot!!

1

There are 1 best solutions below

1
Guoqing Xu On

The user code has the logic to delete existing job-dir, which deleted the staged user code package in GCS as well, so that the workers started late were not able to download the package.

We recommend each job has a separate job-dir to avoid similar issue.