I am working on building a machine learning pipeline for time series data where the goal is to retrain and update the model frequently to make predictions.
- I have written a preprocessing code that handles the time series variables and transforms them.
I am confused about how to use the same preprocessing code for both training and inference? Should I write a lambda function to preprocess my data or is there any other way
Sources looked into:
The two examples given by the aws sagemaker team use AWS Glue to do the ETL tranform.
inference_pipeline_sparkml_xgboost_abalone
inference_pipeline_sparkml_blazingtext_dbpedia
I am new to aws sagemaker trying to learn, understand and build the flow. Any help is appreciated!
Answering the problems in a backwards fashion.
From your example, The below piece of code is the inference pipeline where 2 models are put together. In here we need to remove sparkml_model and get our sklearn model.
Before placing the sklearn model, we need the SageMaker version of SKLearn model.
script_path - this is python code that contains all the preprocessing logic or transformation logic. 'sklearn_abalone_featurizer.py' in the link given below.
For more details, refer the below link.
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_inference_pipeline/Inference%20Pipeline%20with%20Scikit-learn%20and%20Linear%20Learner.ipynb