Unable to properly register model and create Sagemaker Endpoint using Sagemaker Pipelines

49 Views Asked by At

I'm trying to register and deploy a custom model using a Pytorch container in Sagemaker Pipelines inside Sagemaker Studio but the endpoint fails when sending a response using invoke_endpoint:

The code snippet is:

##### PYTORCH CONTAINER
# Step 1: Train Model
# create model training instance
model = PyTorch(
    entry_point="inference.py",
    framework_version='1.13',
    py_version='py39',
    source_dir="code",
    # sagemaker_session=pipeline_session, # I've tried this but doesn't work
    role=role,
    instance_type=training_instance,
    instance_count=1,
    base_job_name=f"{base_job_prefix}-{training_job_name}",
    output_path=s3_output_path,
    code_location=s3_training_output_path,
    # script_mode=True,
    hyperparameters={
        "model_name": model_name,
        "model_type": model_type,
        "bucket": bucket,
        'epsilon': 0.3
    },
    model_name=model_name + workflow_time
)

# put it on the outside because fitting it inside TrainingStep isn't work
model.fit()

step_train = TrainingStep(
    name=training_step_name,
    # step_args=model.fit(),  # I've tried this but it fails
    estimator=model,
)

# Step 2: Register Model to Model Registry
logger.info('Registering to model to Model Registry')

step_register = RegisterModel(
    name=register_model_step_name,
    estimator=model,
    # model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
    content_types=["application/json"],
    response_types=["application/json"],
    inference_instances=inference_instances,
    model_package_group_name=model_package_group_name,
    approval_status=model_approval_status,
    depends_on=[training_step_name]
)

This part registers the model in the model registry. I then get the most recent version, build the endpoint configuration and deploy using:

# create an endpoint using model registry model config previosly created
sm_client = boto3.client('sagemaker', region_name=AWS_REGION) 

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=ENDPOINT_NAME,
    EndpointConfigName=endpoint_config_name
)

The endpoint just times out ReadTimeoutError: Read timeout on endpoint URL: "https://runtime.sagemaker.eu-west-1.amazonaws.com/endpoints/nba-vw-base-endpoint-TEST/invocations"

I've tried different combinations of:

  • using Pipeline Session
  • adding .fit() inside step, outside, or using estimator arg
  • using RegisterModel() or model.register()

I've checked the logs for the endpoint and don't see an error.

But same issues. I've followed many examples, such as this one, but when adding model.fit() without pipeline_session, it states TrainingStep() needs an estimator or step_args argument, meaning .fit() is returning nothing.

UPDATE: Using .fit() inside TrainingStep() example: When following many of the examples like below:

step_train = TrainingStep(
    name=training_step_name,
    step_args=model.fit(), 
)

The training job runs fine in the logs, but I get this error:

2024-02-14 13:16:00 Completed - Training job completed
Training seconds: 112
Billable seconds: 112
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[22], line 1
----> 1 step_train = TrainingStep(
      2     name=training_step_name,
      3     step_args=model.fit(),  # need to fit the model to ensure it properly trains and creates inference logic
      4     # estimator=model,  # seems to be getting deprecated in future
      5 )

File /opt/conda/lib/python3.10/site-packages/sagemaker/workflow/steps.py:417, in TrainingStep.__init__(self, name, step_args, estimator, display_name, description, inputs, cache_config, depends_on, retry_policies)
    412 super(TrainingStep, self).__init__(
    413     name, StepTypeEnum.TRAINING, display_name, description, depends_on, retry_policies
    414 )
    416 if not (step_args is not None) ^ (estimator is not None):
--> 417     raise ValueError("Either step_args or estimator need to be given.")
    419 if step_args:
    420     from sagemaker.workflow.utilities import validate_step_args_input

ValueError: Either step_args or estimator need to be given.

Meaning the .fit() isn't returning a value, so its putting it as None. Not sure how all the other examples don't have the same issue.

Sorry for all the info but I'm not sure what to try next.

1

There are 1 best solutions below

0
Cris Pineda On

SOLVED

I found a solution on gokul-pv github. It seems that you can't use the same PyTorch model for training and registration for some reason.

You need to create a new instance using PyTorchModel() then register it. It works now. Updated code below:

##### PYTORCH CONTAINER
# Step 1: Train Model
# create model training instance
model = PyTorch(
    entry_point="train.py",
    image_uri=pytorch_image_uri_training,
    source_dir="code",
    sagemaker_session=pipeline_session,
    role=role,
    instance_type=training_instance,
    instance_count=1,
    base_job_name=f"{base_job_prefix}-{training_job_name}",
    output_path=s3_output_path,
    code_location=s3_training_output_path,
    # script_mode=True,
    hyperparameters={
        "model_name": model_name,
        "model_type": model_type,
        "bucket": bucket,
        'epsilon': 0.3
    },
    model_name=model_name + workflow_time
)

training_step_args = model.fit()

step_train = TrainingStep(
    name=training_step_name,
    step_args=training_step_args,
)

# Step 2: Register Model to Model Registry
logger.info('Registering to model to Model Registry')
model = PyTorchModel(
    entry_point="inference.py",
    source_dir="code",
    image_uri=pytorch_image_uri_inference,
    sagemaker_session=pipeline_session,
    role=role,
    model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
    framework_version="1.11.0",
)


reg_model_args = model.register(
    content_types=["application/json"],
    response_types=["application/json"],
    model_package_group_name=model_package_group_name,
    inference_instances=inference_instances,
    approval_status=model_approval_status,
    description="pipeline - nba vw model test"
)

# Register model step that will be conditionally executed
step_register = ModelStep(
    name=register_model_step_name,
    step_args=reg_model_args,
    # depends_on=[training_step_name]
)

Only major change I needed to do is make the training.py independent from the inference.py, not dependent.