Vertex pipeline model training component stuck running forever because of metadata issue

470 Views Asked by At

I'm attempting to run a Vertex pipeline (custom model training) which I was able to run successfully in a different project. As far as I'm aware, all the pieces of infrastructure (service accounts, buckets, etc.) are identical.

The error appears in a gray box in the pipeline UI when I click on the model training component and reads the following:

Retryable error reported. System is retrying.
com.google.cloud.ai.platform.common.errors.AiPlatformException: code=ABORTED, message=Specified Execution `etag`: `1662555654045` does not match server `etag`: `1662555533339`, cause=null System is retrying.

I've looked into the log explorer and found that the error logs are audit logs have the following associated tags with them:

protoPayload.methodName="google.cloud.aiplatform.internal.MetadataService.RefreshLineageSubgraph"

protoPayload.resourceName="projects/724306335858/locations/europe-west4/metadataStores/default

Leading me to think that there's an issue with the Vertex Metadatastore or the way my pipeline is using it. The audit logs are automatic though, so I'm not sure.

I've tried purging the metadata store as well as deleting it completely. I've also tried running a different model training pipeline that worked before in a different project as well but with no luck.

screenshot of ui

1

There are 1 best solutions below

0
Prajna Rai T On

Retryable error which you were getting is the temporary issue, the issue is resolved now.

You can now be able to rerun the pipeline and it is not expected to enter the infinite retry loop.