I'm trying to save a transformers model within a component in an Azure pipeline. Including parts of the prep.py file that's called by the component yaml file below:
def main(args):
mlflow.autolog()
def tokenize_function(example):
return tokenizer(example['text'], padding="max_length", truncation=True, max_length = args.max_length)
#dataset_path = os.path.join("..", "data", "dataset_hf_train_test.pkl")
with open("data.pkl", "rb") as f:
# Read the data from the file
data_prepped = pickle.load(f)
tokenizer = AutoTokenizer.from_pretrained(args.model_checkpoint)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(args.model_checkpoint, trust_remote_code=True).to(args.device)
tokenized_dataset = data_prepped.map(tokenize_function)
mlflow.transformers.save_model(model, args.model_output)
I get the following error when running the pipeline:
AttributeError: 'CodeGenForCausalLM' object has no attribute 'model'
So it seems mlflow.transformers.save_model isn't appopriate here, but what is?
Note I also want to save (or return) the tokenizer and the tokenized_data but seeking to resolve just saving the model first.
Also, 2 side questions (less important but would help):
- I don't actually need to include an output section in the component yaml in order to save this (i.e. if I just hardcode the directory)?
- Why do we need to use
mlflowto save the models - seems it's the default option for saving objects within a AzureML pipeline?
