Unable to view Vertex AI pipeline node logs

982 Views Asked by At

I created a Vertex AI pipeline to perform a simple ML flow of creating a dataset, training a model on it and then predicting on the test set. There is a python function based component (train-logistic-model) where I train the model. However, in the component I specify an invalid package and hence the step in the pipeline fails. I know this because when I corrected the package name the step worked fine. However, for the failed pipeline I am unable to see any logs. When I click on the "VIEW JOB" under "Execution Info" on the pipeline Runtime Graph (pic attached) it takes me to the "CUSTOM JOB" page which the pipeline ran. There is a message:

Custom job failed with error message: The replica workerpool0-0 exited with a non-zero status of 1 ...

When I click the VIEW LOGS button, it takes me to the Logs Explorer where there are NO logs. Why are there no logs? Do I need to enable logging somewhere in the pipeline for this? Or could it be a permission issue (it does not mention anything about it though, just this message on the Logs Explorer and 0 logs below it.

Showing logs for time specified in query. To view more results update your query

enter image description here

2

There are 2 best solutions below

0
Avinash Gunda On

Find the pipeline job id in the component logs and paste it in the below code

from google.cloud import aiplatform

from collections import namedtuple

import json

import time

def get_status_helper(client):

response = client.get_hyperparameter_tuning_job(
        name=training_job.metadata["resource_name"])

job_status = str(response.state)

return job_status

api_endpoint = f"{location}-aiplatform.googleapis.com"

client_options = {"api_endpoint": api_endpoint}

client = aiplatform.gapic.JobServiceClient(client_options=client_options)

client.get_custom_job(name="projects/{project-id}/locations/{your-location}/customJobs/{pipeline-id}")

Sample name or pipeline job id for reference:

========================================

projects/123456789101/locations/us-central1/customJobs/23456789101234567892

Above name can be found in the component logs

0
Robbe On

I ran into this as well. Apparently logging doesn't work on Vertex for steps with a small machine with a GPU. You need to increase the size of your machine for this to work.

From the docs:

Additionally, using smaller machines types like n1-highmem-2 with GPUs might cause logging to fail for some workloads because of CPU constraints. If your training job stops returning logs, consider selecting a larger machine type