I'm studying and running some experiments on the Image Captioning field, and one thing I'm not being able to fully figure out is when I have to evaluate the models I train: which of the NLTK smoothing functions I should use.
When I try to run BLEU tests without a Smoothin function, I receive a warning telling me to do so, but there are 7 of them. Since no Image captioning paper specifies how they perform their bleu metric I'm kind of lost in this point.
Which one should use and why?
The standard BLEU score from 2002 is a corpus-level score and is implemented in
nltk.translate.bleu_score.corpus_bleuand it typically does not need for smoothing because it computes the n-gram precisions over the entire corpus and zeros are unlikely. The metric reported in machine translation and image captioning papers is corpus-level BLEU. The warning in NLTK is triggered when n-gram precision is zero. It only happens when the output quality is low (or there is some bug) and the score should not be trusted much then.The sentence-level variant of BLEU from 2014, implemented in
nltk.translate.bleu_score.sentence_bleu, computes the n-gram precisions at the sentence level which often leads to zeros and thus the high variance of the scores and low correlation with human judgment. Therefore some kind of smoothing is typically necessary. Sentence-level BLEU is however not a good sentence-level metric and there are better alternatives, such as chrF score.Please note that the NLTK implementation of BLEU is not the reference implementation used in most research papers (it uses different tokenization). For comparison with research papers, the SacreBLEU implementation should be used. Especially in machine translation, this is a de facto standard.