Currently, I'm trying to build a Extractive QA pipeline, following the Huggingface Course on the matter. There, they show how to create a compute_metrics() function to evaluate the model after training. However, I was wondering if there's a way to obtain those metrics on training, and pass the compute_metrics() function directly to the trainer. They are training using only the training loss, and I would like to have the evaluation f1 score on training.
But, as I see it, it might be a little bit tricky, because they need the original spans to calculate the squad metrics, but you don't get those original spans passed on your tokenized training dataset.
predicted_answer = {'id': '56be4db0acb8001400a502ec', 'prediction_text': 'Denver Broncos'}
theoretical_answer = {'id': '56be4db0acb8001400a502ec', 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answer_start': [177, 177, 177]}}
metric.compute(predictions=predicted_answers, references=theoretical_answers)
That's why they make the whole compute_metrics() function, taking a few extra parameters than the prediction outputted in the evaluation loop, as they need to rebuild those spans.
The
compute_metricsfunction can be passed into theTrainerso that it validating on the metrics you need, e.g.I'm not sure if it works out of the box with the code to process the
train_datasetandvalidation_datasetin the course code https://huggingface.co/course/chapter7But this ones shows how the
Trainer+compute_metricswork https://huggingface.co/course/chapter3/3Before proceeding to read the rest of the answer, here's some disclaimers:
Try to get through the full course Chapter 1-9 and the
compute_metricsand usage ofevaluate.metricwould make a sense why you can't plug inevaluate.metricdirectly to the Trainer object. https://huggingface.co/course/Alternatively, walking through this book would help too: https://www.oreilly.com/library/view/natural-language-processing/9781098136789/
And now, here goes...
Firstly, lets take a look at what the
evaluatelibrary is/doesFrom https://huggingface.co/spaces/evaluate-metric/squad
[out]:
Next, we take a look at what the
compute_metricsargument in theTrainerexpectsFrom Line 600 https://github.com/huggingface/transformers/blob/main/examples/pytorch/question-answering/run_qa.py
The
compute_metricsargument in theQuestionAnsweringTraineris expecting a function that:EvalPredictionobject as inputUn momento! (Wait a minute!) What are these
QuestionAnsweringTrainerandEvalPredictionobjects?Q: Why are you not using the normal
Trainerobject?A: The
QuestionAnsweringTraineris a specific sub-class of the Trainer object that is used for the QA task. If you're going to train a model to evaluate on the SQUAD dataset, then theQuestionAnsweringTraineris the most appropriateTrainerobject to use.[Suggestion]: Most probably HuggingFace devs and dev-advocate should add some notes on the object in
QuestionAnsweringTrainerhttps://huggingface.co/course/chapter7/7?fw=ptQ: What is this
EvalPredictionobject then?A: Officially, I guess it's this: https://discuss.huggingface.co/t/what-does-evalprediction-predictions-contain-exactly/1691/5
If we look at the doc: https://huggingface.co/docs/transformers/internal/trainer_utils and the code, it looks like the object is a custom container class that holds the (i) predictions, (ii) label_ids and (iii) inputs
np.ndarray. These are what the model's inference function need to return in order for thecompute_metricsto work as expected.Hey, you still haven't answer the question of how I can use the
evaluate.metrics('squad')directly to the thecompute_metricsargs!Yes, for now, you can't directly use it but it's a simple wrapper.
Step 1. Make sure the model you want to use outputs the required EvalPrediction object that contains, predictions and label_ids
If you're using most the models supported for QA in Huggingface's
transformerslibrary, they should already output the expected EvalPrediction.Otherwise, take a look at models supported by https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering
Step 2: Since the model inference outputs
EvalPredictionbut the compute_metrics expects a dictionary outputs, _you have to wrap theevaluate.metricsfunctionE.g.
Q: Do we really always need to write that wrapper function?
A: For now, yes, it is by design not directly integrated with the outputs of the
evaluate.metricsto give the different metrics' developers freedom to define how they want their inputs/outputs to look like.But there might be hope to make
compute_metricsmore integrated withevaluate.metricif someone picks this feature request up! https://discuss.huggingface.co/t/feature-request-adding-default-compute-metrics-to-popular-evaluate-metrics/33909/3