I am working on Chinese sequence-to-sequence generation. I have the following HuggingFace Transformers codes to train a sequence-to-sequence model.
trainer = Seq2SeqTrainer(
model = model,
args = training_args,
train_dataset = tokenized_testset,
eval_dataset = tokenized_evalset,
tokenizer = tokenizer,
data_collator = data_collator,
compute_metrics = compute_metrics,
callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)
And the compute_metrics() function is as follows:
def compute_metrics(eval_preds):
preds, labels = eval_preds
if isinstance(preds, tuple):
preds = preds[0]
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
result = metric.compute(predictions=decoded_preds, references=decoded_labels)
results = {"bleu": result["sacrebleu_score"], "chrf": result["chr_f_score"]}
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
results["gen_len"] = np.mean(prediction_lens)
results = {k: round(v, 4) for k, v in results.items()}
return results
def postprocess_text(preds, labels):
preds = [pred.strip() for pred in preds]
labels = [[label.strip()] for label in labels]
return preds, labels
where the tokenizer is BertTokenizer.from_pretrained('fnlp/bart-base-chinese')
For the training results, BLEU & chrF++ scores are around 24.126 and 22.440 respectively. However, when I run the following codes to evaluate one of the sentence pairs in the same dataset, the BLEU and chrF++ scores are higher by 10 to 20 points. The codes are in a class, thus the self. in the functions.
def calculate_bleu(self, reference, input_sentence):
bleu = self.sacrebleu.compute(predictions=[input_sentence], references=[reference], tokenize='zh')
return bleu["score"]
def calculate_chrf(self, reference, input_sentence):
chrf = self.chrf.compute(predictions=[input_sentence], references=[reference], word_order=2, lowercase=True)
return chrf["score"]
where self.chrf and self.sacrebleu are defined as:
self.sacrebleu = evaluate.load("sacrebleu")
self.chrf = evaluate.load("chrf")
which evaluate is the evaluate library from import evaluate
I want to investigate the reason behind and I suspect it's the problem within the compute_metrics() function. For the latter approach, it does not pass through the encoding-decoding process at all.
Is there a chance that I can execute the compute_metrics() individually? What should I insert for the eval_preds parameter?