I'm trying to register and deploy a custom model using a Pytorch container in Sagemaker Pipelines inside Sagemaker Studio but the endpoint fails when sending a response using invoke_endpoint:
The code snippet is:
##### PYTORCH CONTAINER
# Step 1: Train Model
# create model training instance
model = PyTorch(
entry_point="inference.py",
framework_version='1.13',
py_version='py39',
source_dir="code",
# sagemaker_session=pipeline_session, # I've tried this but doesn't work
role=role,
instance_type=training_instance,
instance_count=1,
base_job_name=f"{base_job_prefix}-{training_job_name}",
output_path=s3_output_path,
code_location=s3_training_output_path,
# script_mode=True,
hyperparameters={
"model_name": model_name,
"model_type": model_type,
"bucket": bucket,
'epsilon': 0.3
},
model_name=model_name + workflow_time
)
# put it on the outside because fitting it inside TrainingStep isn't work
model.fit()
step_train = TrainingStep(
name=training_step_name,
# step_args=model.fit(), # I've tried this but it fails
estimator=model,
)
# Step 2: Register Model to Model Registry
logger.info('Registering to model to Model Registry')
step_register = RegisterModel(
name=register_model_step_name,
estimator=model,
# model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
content_types=["application/json"],
response_types=["application/json"],
inference_instances=inference_instances,
model_package_group_name=model_package_group_name,
approval_status=model_approval_status,
depends_on=[training_step_name]
)
This part registers the model in the model registry. I then get the most recent version, build the endpoint configuration and deploy using:
# create an endpoint using model registry model config previosly created
sm_client = boto3.client('sagemaker', region_name=AWS_REGION)
create_endpoint_response = sm_client.create_endpoint(
EndpointName=ENDPOINT_NAME,
EndpointConfigName=endpoint_config_name
)
The endpoint just times out ReadTimeoutError: Read timeout on endpoint URL: "https://runtime.sagemaker.eu-west-1.amazonaws.com/endpoints/nba-vw-base-endpoint-TEST/invocations"
I've tried different combinations of:
- using Pipeline Session
- adding
.fit()inside step, outside, or usingestimatorarg - using
RegisterModel()ormodel.register()
I've checked the logs for the endpoint and don't see an error.
But same issues. I've followed many examples, such as this one, but when adding model.fit() without pipeline_session, it states TrainingStep() needs an estimator or step_args argument, meaning .fit() is returning nothing.
UPDATE: Using .fit() inside TrainingStep() example:
When following many of the examples like below:
step_train = TrainingStep(
name=training_step_name,
step_args=model.fit(),
)
The training job runs fine in the logs, but I get this error:
2024-02-14 13:16:00 Completed - Training job completed
Training seconds: 112
Billable seconds: 112
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[22], line 1
----> 1 step_train = TrainingStep(
2 name=training_step_name,
3 step_args=model.fit(), # need to fit the model to ensure it properly trains and creates inference logic
4 # estimator=model, # seems to be getting deprecated in future
5 )
File /opt/conda/lib/python3.10/site-packages/sagemaker/workflow/steps.py:417, in TrainingStep.__init__(self, name, step_args, estimator, display_name, description, inputs, cache_config, depends_on, retry_policies)
412 super(TrainingStep, self).__init__(
413 name, StepTypeEnum.TRAINING, display_name, description, depends_on, retry_policies
414 )
416 if not (step_args is not None) ^ (estimator is not None):
--> 417 raise ValueError("Either step_args or estimator need to be given.")
419 if step_args:
420 from sagemaker.workflow.utilities import validate_step_args_input
ValueError: Either step_args or estimator need to be given.
Meaning the .fit() isn't returning a value, so its putting it as None. Not sure how all the other examples don't have the same issue.
Sorry for all the info but I'm not sure what to try next.
SOLVED
I found a solution on gokul-pv github. It seems that you can't use the same PyTorch model for training and registration for some reason.
You need to create a new instance using PyTorchModel() then register it. It works now. Updated code below:
Only major change I needed to do is make the
training.pyindependent from theinference.py, not dependent.