Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nltk.translate.bleu_score gives false result when ngram larger than maximum ngrams of given sentence #1539

Closed
StarWang opened this issue Dec 9, 2016 · 5 comments

Comments

@StarWang
Copy link

StarWang commented Dec 9, 2016

Given weight = [0.25, 0.25, 0.25, 0.25] (default value),
sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'c']) = 0
While sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'd']) = 0.7598
Obviously the previous score should be larger than the latter, or both scores should be 0

@alvations
Copy link
Contributor

alvations commented Dec 9, 2016

Which version of the code are you using?

$ python
>>> import nltk
>>> nltk.__version__
'3.2.1'

The BLEU implementation has been just recently fixed with #1330 resolved. If you're using the develop branch of nltk, this should be the output:

>>> import nltk
>>> from nltk import bleu
>>> ref = hyp = 'abc'
>>> bleu([ref], hyp)
1.0
>>> from nltk import bleu
>>> ref, hyp = 'abc', 'abd'
>>> bleu([ref], hyp)
0.7598356856515925

Since a string is a list of chars and nltk imports the sentence_bleu() to the top-level imports, the code above is the same as:

>>> from nltk.translate.bleu_score import sentence_bleu
>>> sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'c'])
1.0
>>> sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'd'])
0.7598356856515925

To install the latest develop branch, try:

pip install https://github.com/nltk/nltk/archive/develop.zip

(Do note that the develop branch is subjected to more unexpected bugs and it is recommended that users install the master or official release)


On a related note but not directly involved with the current nltk implementation of bleu, the previous implementation without the #1330 fix is subjected to the same flaws of the popular multi-bleu.perl. Maybe you might find it interesting to know why it returned 0 without the recent fix: https://gist.github.com/alvations/e5922afa8c91472d25c58b2d712a93e7

@StarWang
Copy link
Author

StarWang commented Dec 9, 2016

Thanks @alvations . The original version of nltk I used was 3.2. I have updated it to 3.2.1 now and it's now raising ZeroDivisionError. And I used Python 3.5.2

@alvations
Copy link
Contributor

The only stable version of BLEU is in the develop branch. Please wait for it to be release in NLTK 3.2.2 or install the develop branch (but do note that the development branch might be subjected to untested bugs).

@StarWang
Copy link
Author

StarWang commented Dec 9, 2016

OK. I will wait. But in the case you mentioned above, if the weight is [0.25, 0.25, 0.25, 0.25], the results of sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'c']) and sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'd']) should both be 0, according to the original paper

@alvations
Copy link
Contributor

alvations commented Dec 9, 2016

The original paper didn't account for the fact that p_n can be 0 if the length of reference/hypothesis is less than n, see equation in Section 2.3 of http://www.aclweb.org/anthology/P02-1040.pdf. Because it was meant to be a corpus score, the possibility that there are references/hypotheses less than length n was not covered in the paper.

If we look at the formula in Section 2.3, it takes the exp(log(p_n)) and when p_n is 0, it gets into a math domain error because the logarithm function (i.e. y = log x) has an asymptote at x = 0 , such that the range of x must be more than 0.

So if we were to implement the original BLEU, the user should receive a warning that says something like "BLEU can't be computed" whenever there is a the math domain error. So the later versions of BLEU tries to fix it with several different hacks, the history of the versions can be found on https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl#L17

Please note that the latest rendition of BLEU comes with the smoothing functions from Chen and Cherry (2014) paper is not in the Moses version of mteval.pl.

I hope the explanation helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants