How to save transformers model in AzureML pipeline component

205 Views Asked by At

I'm trying to save a transformers model within a component in an Azure pipeline. Including parts of the prep.py file that's called by the component yaml file below:

def main(args):

    mlflow.autolog()

    def tokenize_function(example):
        return tokenizer(example['text'], padding="max_length", truncation=True, max_length = args.max_length)

    #dataset_path = os.path.join("..", "data", "dataset_hf_train_test.pkl")
    with open("data.pkl", "rb") as f:
        # Read the data from the file
        data_prepped = pickle.load(f)

    tokenizer = AutoTokenizer.from_pretrained(args.model_checkpoint)
    tokenizer.pad_token = tokenizer.eos_token
    model = AutoModelForCausalLM.from_pretrained(args.model_checkpoint, trust_remote_code=True).to(args.device)

    tokenized_dataset = data_prepped.map(tokenize_function)

    mlflow.transformers.save_model(model, args.model_output)

I get the following error when running the pipeline: AttributeError: 'CodeGenForCausalLM' object has no attribute 'model'

So it seems mlflow.transformers.save_model isn't appopriate here, but what is? Note I also want to save (or return) the tokenizer and the tokenized_data but seeking to resolve just saving the model first.

Also, 2 side questions (less important but would help):

  1. I don't actually need to include an output section in the component yaml in order to save this (i.e. if I just hardcode the directory)?
  2. Why do we need to use mlflow to save the models - seems it's the default option for saving objects within a AzureML pipeline?
2

There are 2 best solutions below

0
Vaibhav Patil On
def main(args):

    mlflow.autolog()

    def tokenize_function(example):
        return tokenizer(example['text'], padding="max_length", truncation=True, max_length=args.max_length)

    with open("data.pkl", "rb") as f:
        data_prepped = pickle.load(f)

    tokenizer = AutoTokenizer.from_pretrained(args.model_checkpoint)
    tokenizer.pad_token = tokenizer.eos_token
    model = AutoModelForCausalLM.from_pretrained(args.model_checkpoint, trust_remote_code=True).to(args.device)

    tokenized_dataset = data_prepped.map(tokenize_function)

    mlflow.transformers.save_model(model, args.model_output)
3
Rishabh Meshram On

To fix the issue, you can use the mlflow.pytorch.save_model() function instead.

Below is a sample code snippet I used to test this:

import pickle
import mlflow.pytorch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Sample data
data_prepped = ["Hello, how are you?", "I'm doing great, thank you!"]

# Model and tokenizer
model_checkpoint = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

# Tokenization function
def tokenize_function(example):
    return tokenizer(example, padding="max_length", truncation=True, max_length=128)

# Tokenize the dataset
tokenized_dataset = [tokenize_function(example) for example in data_prepped]

# Save the model, tokenizer, and tokenized data
with mlflow.start_run():
    # Save the model (Modified)
    mlflow.pytorch.save_model(model, "model")
    #mlflow.transformers.save_model(model,"model")
    # Save the tokenizer
    tokenizer.save_pretrained("tokenizer")
    mlflow.log_artifact("tokenizer", "tokenizer")

    # Save the tokenized data
    with open("tokenized_data.pkl", "wb") as f:
        pickle.dump(tokenized_dataset, f)
    mlflow.log_artifact("tokenized_data.pkl", "tokenized_data")

With above code I was able to save model, tokenizer and tokenized data. enter image description here