nltk.translate.bleu_score gives false result when ngram larger than maximum ngrams of given sentence #1539

StarWang · 2016-12-09T03:38:08Z

Given weight = [0.25, 0.25, 0.25, 0.25] (default value),
sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'c']) = 0
While sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'd']) = 0.7598
Obviously the previous score should be larger than the latter, or both scores should be 0

alvations · 2016-12-09T04:19:27Z

Which version of the code are you using?

$ python
>>> import nltk
>>> nltk.__version__
'3.2.1'

The BLEU implementation has been just recently fixed with #1330 resolved. If you're using the develop branch of nltk, this should be the output:

>>> import nltk
>>> from nltk import bleu
>>> ref = hyp = 'abc'
>>> bleu([ref], hyp)
1.0
>>> from nltk import bleu
>>> ref, hyp = 'abc', 'abd'
>>> bleu([ref], hyp)
0.7598356856515925

Since a string is a list of chars and nltk imports the sentence_bleu() to the top-level imports, the code above is the same as:

>>> from nltk.translate.bleu_score import sentence_bleu
>>> sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'c'])
1.0
>>> sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'd'])
0.7598356856515925

To install the latest develop branch, try:

pip install https://github.com/nltk/nltk/archive/develop.zip

(Do note that the develop branch is subjected to more unexpected bugs and it is recommended that users install the master or official release)

On a related note but not directly involved with the current nltk implementation of bleu, the previous implementation without the #1330 fix is subjected to the same flaws of the popular multi-bleu.perl. Maybe you might find it interesting to know why it returned 0 without the recent fix: https://gist.github.com/alvations/e5922afa8c91472d25c58b2d712a93e7

StarWang · 2016-12-09T05:22:06Z

Thanks @alvations . The original version of nltk I used was 3.2. I have updated it to 3.2.1 now and it's now raising ZeroDivisionError. And I used Python 3.5.2

alvations · 2016-12-09T05:33:27Z

The only stable version of BLEU is in the develop branch. Please wait for it to be release in NLTK 3.2.2 or install the develop branch (but do note that the development branch might be subjected to untested bugs).

StarWang · 2016-12-09T05:37:37Z

OK. I will wait. But in the case you mentioned above, if the weight is [0.25, 0.25, 0.25, 0.25], the results of sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'c']) and sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'd']) should both be 0, according to the original paper

alvations · 2016-12-09T06:08:09Z

The original paper didn't account for the fact that p_n can be 0 if the length of reference/hypothesis is less than n, see equation in Section 2.3 of http://www.aclweb.org/anthology/P02-1040.pdf. Because it was meant to be a corpus score, the possibility that there are references/hypotheses less than length n was not covered in the paper.

If we look at the formula in Section 2.3, it takes the exp(log(p_n)) and when p_n is 0, it gets into a math domain error because the logarithm function (i.e. y = log x) has an asymptote at x = 0 , such that the range of x must be more than 0.

So if we were to implement the original BLEU, the user should receive a warning that says something like "BLEU can't be computed" whenever there is a the math domain error. So the later versions of BLEU tries to fix it with several different hacks, the history of the versions can be found on https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl#L17

Please note that the latest rendition of BLEU comes with the smoothing functions from Chen and Cherry (2014) paper is not in the Moses version of mteval.pl.

I hope the explanation helps.

StarWang closed this as completed Dec 9, 2016

This was referenced Dec 15, 2016

Hotfix/bug on bleu score #1545

Closed

Separating code issues from questions dear-github/dear-github#216

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nltk.translate.bleu_score gives false result when ngram larger than maximum ngrams of given sentence #1539

nltk.translate.bleu_score gives false result when ngram larger than maximum ngrams of given sentence #1539

StarWang commented Dec 9, 2016

alvations commented Dec 9, 2016 •

edited

StarWang commented Dec 9, 2016 •

edited

alvations commented Dec 9, 2016

StarWang commented Dec 9, 2016

alvations commented Dec 9, 2016 •

edited

nltk.translate.bleu_score gives false result when ngram larger than maximum ngrams of given sentence #1539

nltk.translate.bleu_score gives false result when ngram larger than maximum ngrams of given sentence #1539

Comments

StarWang commented Dec 9, 2016

alvations commented Dec 9, 2016 • edited

StarWang commented Dec 9, 2016 • edited

alvations commented Dec 9, 2016

StarWang commented Dec 9, 2016

alvations commented Dec 9, 2016 • edited

alvations commented Dec 9, 2016 •

edited

StarWang commented Dec 9, 2016 •

edited

alvations commented Dec 9, 2016 •

edited