Which BLEU smoothing function is commonly used for Image Captioning evaluation?

1.2k Views Asked by João Gondim At 03 October 2022 at 16:04

I'm studying and running some experiments on the Image Captioning field, and one thing I'm not being able to fully figure out is when I have to evaluate the models I train: which of the NLTK smoothing functions I should use.

When I try to run BLEU tests without a Smoothin function, I receive a warning telling me to do so, but there are 7 of them. Since no Image captioning paper specifies how they perform their bleu metric I'm kind of lost in this point.

Which one should use and why?

Original Q&A

There are 1 best solutions below

Jindřich On 04 October 2022 at 08:30

The standard BLEU score from 2002 is a corpus-level score and is implemented in nltk.translate.bleu_score.corpus_bleu and it typically does not need for smoothing because it computes the n-gram precisions over the entire corpus and zeros are unlikely. The metric reported in machine translation and image captioning papers is corpus-level BLEU. The warning in NLTK is triggered when n-gram precision is zero. It only happens when the output quality is low (or there is some bug) and the score should not be trusted much then.

The sentence-level variant of BLEU from 2014, implemented in nltk.translate.bleu_score.sentence_bleu, computes the n-gram precisions at the sentence level which often leads to zeros and thus the high variance of the scores and low correlation with human judgment. Therefore some kind of smoothing is typically necessary. Sentence-level BLEU is however not a good sentence-level metric and there are better alternatives, such as chrF score.

Please note that the NLTK implementation of BLEU is not the reference implementation used in most research papers (it uses different tokenization). For comparison with research papers, the SacreBLEU implementation should be used. Especially in machine translation, this is a de facto standard.

Which BLEU smoothing function is commonly used for Image Captioning evaluation?

There are 1 best solutions below

Related Questions in MACHINE-LEARNING

Related Questions in NLP

Related Questions in ARTIFICIAL-INTELLIGENCE

Related Questions in MACHINE-TRANSLATION

Related Questions in BLEU

Trending Questions

Popular # Hahtags

Popular Questions