import sagemaker
import boto3
from sagemaker.huggingface import HuggingFace
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
hyperparameters = {
'model_name_or_path':'t5-base',
'output_dir':'/opt/ml/model'
# add your remaining hyperparameters
# more info here https://github.com/huggingface/transformers/tree/v4.26.0/examples/pytorch/question-answering
}
# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.26.0'}
# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
entry_point='run_qa.py',
source_dir='./examples/pytorch/question-answering',
instance_type='ml.p3.2xlarge',
instance_count=1,
role=role,
git_config=git_config,
transformers_version='4.26.0',
pytorch_version='1.13.1',
py_version='py39',
hyperparameters = hyperparameters
)
# starting the train job
huggingface_estimator.fit()
Given the above script (launch_training.py) which can be found here: https://huggingface.co/t5-base, how should my data be structured for a generative question-answering task?
For context: I am training T5 on some synthetic company text data so that I can then prompt it with questions such as "How can CompanyX improve sales?" or "How can CompanyX reduce the turnover rate?"
I have tried formatting my data as question-answer pairs, e.g. {"question": "How can CompanyX improve the performance of their marketing campaigns?", "answer": "The recent marketing campaign of CompanyX attracted a 20% increase in new customers. It suggests that if CompanyX focuses on customer-centric strategies and amplifies their digital marketing efforts, they might achieve even better results."} but this gives ValueError: Need either a dataset name or a training/validation file
I am passing an S3 URI to huggingface_estimator.fit(), namely huggingface_estimator.fit({"train_data_uri": "s3://fine-tuning/q-a_pairs.json"})
The code snippet you're using with Sagemaker and the Huggingface example comes from https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering
The example uses the dataset formatted as how it is in the squad dataset, https://huggingface.co/datasets/squad
Each example should look like this:
The actual data file from squad would come from https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json and looks something like:
Breaking it down a little, if you have a data in JSON format that looks like this:
Then to train a model, the easiest way out is to push to huggingface hub, https://huggingface.co/docs/datasets/upload_dataset#upload-with-python
After that you can use
load_datasetwhen you change the script on https://github.com/huggingface/transformers/blob/main/examples/pytorch/question-answering/run_qa.py#LL286C1-L293C10 and save a local script on your machine, e.g. on./scripts/run_qa.pyFinally, instead of using the the git_config, you can do this: