NLTK sentence_bleu() returns 0 while evaluating Chinese sentences

223 Views Asked by At

I'm trying to evaluate Chinese sentence BLEU scores with NLTK's sentence_bleu() function. The code is as follows:

import nltk
import jieba

from transformers import AutoTokenizer, BertTokenizer, BartForConditionalGeneration

src = '樓上漏水耍花招不處理可以怎麼做'
ref = '上層漏水耍手段不去處理可以怎麼做'

checkpoint = 'fnlp/bart-base-chinese'
tokenizer = BertTokenizer.from_pretrained(checkpoint)
model = BartForConditionalGeneration.from_pretrained(checkpoint)

hypothesis_translations = []

for sentence in [src]:
    inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=100, return_token_type_ids=False)
    outputs = model.generate(**inputs)
    translated_sentence = tokenizer.decode(outputs[0], skip_special_tokens=True)
    hypothesis_translations.append(translated_sentence)

# for Reference tokenization
inputs_ref = tokenizer(ref, return_tensors="pt", truncation=True, max_length=100, return_token_type_ids=False)
outputs_ref = model.generate(**inputs_ref)
tokenized_ref = tokenizer.decode(outputs_ref[0], skip_special_tokens=True)

nltk_bleu = nltk.translate.bleu_score.sentence_bleu(tokenized_ref, hypothesis_translations)
print(nltk_bleu)

The output of printing nltk_bleu is 0.

But when I use the corpus_score() of SacreBLEU library, it returns normal and expected results:

import evaluate
from sacrebleu.metrics import BLEU

bleu = BLEU()
bleu_score = bleu.corpus_score(references=tokenized_ref, hypotheses=hypothesis_translations)
print(bleu_score)

which returns:

BLEU = 4.79 73.3/3.6/1.9/1.0 (BP = 1.000 ratio = 15.000 hyp_len = 15 ref_len = 1)

How can I make the NLTK sentence_score return correct results?


UPDATE After adding NLTK's Method 3 into consideration:

from nltk.translate.bleu_score import SmoothingFunction
smooth_fn = SmoothingFunction()
nltk_bleu = nltk.translate.bleu_score.sentence_bleu(tokenized_ref, hypothesis_translations, smoothing_function=smooth_fn.method3)

the value of nltk_bleu is still 0.

2

There are 2 best solutions below

0
igrinis On BEST ANSWER

The function sentence_bleu expects a list of list of tokens as reference, and a list of tokens as hypothesis. Your supplied input just does not correlate with the expectations.

Once you fix it, you will get:

smooth_fn = SmoothingFunction()
nltk_bleu = nltk.translate.bleu_score.sentence_bleu([tokenized_ref.split(' ')], hypothesis_trans
lations[0].split(' '), smoothing_function=smooth_fn.method3)
print(nltk_bleu)

>>>
0.43560338053780967

Also, you should take into account that by default it calculates BLEU-4 (for 4-grams) and also consider difference from the smoothing functions.

1
newtover On

It's somewhat obvious that SacreBLEU uses some kind of smoothing, while NLTK doesn't.

I downloaded SacreBLEU and looked into the defaults of BLEU:

    def __init__(self, lowercase: bool = False,
             force: bool = False,
             tokenize: Optional[str] = None,
             smooth_method: str = 'exp',
             smooth_value: Optional[float] = None,
             max_ngram_order: int = MAX_NGRAM_ORDER,
             effective_order: bool = False,
             trg_lang: str = '',
             references: Optional[Sequence[Sequence[str]]] = None):
    ...
    @staticmethod
    def compute_bleu(correct: List[int],
                     total: List[int],
                     sys_len: int,
                     ref_len: int,
                     smooth_method: str = 'none',
                     smooth_value=None,
                     effective_order: bool = False,
                     max_ngram_order: int = MAX_NGRAM_ORDER) -> BLEUScore:
        """Computes BLEU score from its sufficient statistics with smoothing.

        Smoothing methods (citing "A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU",
        Boxing Chen and Colin Cherry, WMT 2014: http://aclweb.org/anthology/W14-3346)

        - none: No smoothing.
        - floor: Method 1 (requires small positive value (0.1 in the paper) to be set)
        - add-k: Method 2 (Generalizing Lin and Och, 2004)
        - exp: Method 3 (NIST smoothing method i.e. in use with mteval-v13a.pl)

From that we see that SacreBLEU uses "Method 3" for soothing by default.

Now let's look into the NLTK's version:

help(nltk.translate.bleu_score.sentence_bleu)

...

To avoid this harsh behaviour when no ngram overlaps are found a smoothing
function can be used.

    >>> chencherry = SmoothingFunction()
    >>> sentence_bleu([reference1, reference2, reference3], hypothesis2,
    ...     smoothing_function=chencherry.method1) # doctest: +ELLIPSIS
    0.0370...

...

This SmoothingFunction object implements all the smoothing methods from the referenced article. As above, you will need method3:

help(nltk.translate.bleu_score.SmoothingFunction.method3)

Help on function method3 in module nltk.translate.bleu_score:

method3(self, p_n, *args, **kwargs)
    Smoothing method 3: NIST geometric sequence smoothing
    The smoothing is computed by taking 1 / ( 2^k ), instead of 0, for each
    precision score whose matching n-gram count is null.
    k is 1 for the first 'n' value for which the n-gram match count is null/

    For example, if the text contains:

    - one 2-gram match
    - and (consequently) two 1-gram matches

    the n-gram count for each individual precision score would be:

    - n=1  =>  prec_count = 2     (two unigrams)
    - n=2  =>  prec_count = 1     (one bigram)
    - n=3  =>  prec_count = 1/2   (no trigram,  taking 'smoothed' value of 1 / ( 2^k ), with k=1)
    - n=4  =>  prec_count = 1/4   (no fourgram, taking 'smoothed' value of 1 / ( 2^k ), with k=2)