Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hotfix/bug on bleu score #1545

Closed
wants to merge 2 commits into from

Conversation

liuq901
Copy link

@liuq901 liuq901 commented Dec 14, 2016

The total bleu score is the geometric mean of n-gram count. If one n-gram count equals zero, the bleu score should be zero.
The original code will ignore the zero value to avoid log(0.0), but it will return a non-zero value, I think it is wrong.

The uni-gram check is also wrong. Even with no unigrams, the fraction can be non-zero due to the smoothing function.

Sorry for my poor English

@alvations
Copy link
Contributor

alvations commented Dec 14, 2016

Could you explain the problem further with some test cases / examples and why the changes you've made would fix that?

From a quick look, your changes will revert the fixes from #1330 that tries to emulate 'mteval-v13a.pl to emulate the original BLEU scores from the original paper. I think your commits would also break fringe cases test in nltk.test.unit.translate.test_bleu.py.


The total bleu score is the geometric mean of n-gram count. If one n-gram count equals zero, the bleu score should be zero.

Yes, the code point at https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py#L163 , checks that. Although this is not described in the original BLEU paper, this is a logical thing to do. There's really no need to check for other order of ngrams and end up in math domain error later on when summing the geometric mean of the modified ngram precision.

The original code will ignore the zero value to avoid log(0.0), but it will return a non-zero value, I think it is wrong.

Is it wrong because it's not emulating the original BLEU paper? We're not following the original BLEU paper because it didn't cover many fringe cases, which the MT community has found to be perplexing, esp. when using BLEU at sentence level, so the historical account of how BLEU has changed can be found on https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl#L17

The uni-gram check is also wrong. Even with no unigrams, the fraction can be non-zero due to the smoothing function.

Please explain further on this claim. Without the unigram counts, there should be 0 counts on all higher n-gram orders. E.g. if the hypothesis sentences is `this is a foo bar sentence' and the reference sentence is 'the cats are on the mat', there should be no n-gram overlaps which should naturally return n-gram of zero:

>>> from nltk import everygrams, ngrams

# 1 grams.
>>> hyp = list(ngrams('this is a foo bar sentence'.split(), n=1))
>>> ref = list(ngrams('the cats are on the mat'.split(), n=1))
>>> set(hyp).intersection(ref)
set([])

# 4 grams.
>>> hyp = list(everygrams('this is a foo bar sentence'.split(), 1, 4))
>>> ref = list(everygrams('the cats are on the mat'.split(), 1, 4))
>>> hyp
[('this',), ('is',), ('a',), ('foo',), ('bar',), ('sentence',), ('this', 'is'), ('is', 'a'), ('a', 'foo'), ('foo', 'bar'), ('bar', 'sentence'), ('this', 'is', 'a'), ('is', 'a', 'foo'), ('a', 'foo', 'bar'), ('foo', 'bar', 'sentence'), ('this', 'is', 'a', 'foo'), ('is', 'a', 'foo', 'bar'), ('a', 'foo', 'bar', 'sentence')]
>>> ref
[('the',), ('cats',), ('are',), ('on',), ('the',), ('mat',), ('the', 'cats'), ('cats', 'are'), ('are', 'on'), ('on', 'the'), ('the', 'mat'), ('the', 'cats', 'are'), ('cats', 'are', 'on'), ('are', 'on', 'the'), ('on', 'the', 'mat'), ('the', 'cats', 'are', 'on'), ('cats', 'are', 'on', 'the'), ('are', 'on', 'the', 'mat')]
>>> set(hyp).intersection(ref)
set([])

There would be cases where there are 0 unigram matches but still an okay translation, and I think those cases are the ones that you wanted to cover in your PR:

>>> ref = list(ngrams('the cats are on the mat'.split(), n=1))
>>> hyp = list(ngrams('those feline at carpet'.split(), n=1))
>>> set(hyp).intersection(ref)
set([])

I suppose the hypothesis above should get some credit although there's zero matches. But I think METEOR would be a better metric for such cases. It's because if we allow this case of unigram match to slip though and get a smoothed score, non-related sentence would also be smoothed. But I'm not sure which is a better approach, allow all hypothesis to have a non-zero score (i.e. smooth every possible sentence) or be tough and enforce zero scores for no unigram matches (i.e. allow zero BLEU score).

Even with no unigrams, the fraction can be non-zero due to the smoothing function.

In the case that there are no unigram matches, I don't think we should smooth the BLEU score at all. This is partly because there is no reason to do at a corpus level. At a sentence level, when there are no unigram matches, assigning a smoothed BLEU score would be too permissive.

Furthermore, some of the smooth methods summarized in Chen and Cherry (2014), requires the previous order of ngrams to have non-zero counts, e.g. method 3 or method 6.

Possibly the hard constraint for skipping the smoothing if there's no unigram counts is wrong. I may not be the best person to vouch for this, perhaps someone more experienced with MT research should advise us whether the fixes you suggested are more appropriate.


You might find the discussion on #1539 related to this issue.

return bp * math.exp(math.fsum(s))
except ValueError, e:
# ValueError means log(0.0). Return 0
return 0.0

Copy link
Contributor

@alvations alvations Dec 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This try-except with ValueError should not happen. The purpose of the smoothing was such that the summation over the logs don't contain any 0 values. Although the default smoothing is no-smoothing, and to avoid the log(0.0), the if condition at https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py#L170 would skip that order of ngram with 0 counts. So that should still not result in ValueError.

Could you write some test cases or examples that cause the ValueError?

Also, with your PR could you help to run the unittest to check that it doesn't break other parts of the BLEU:

~/git-stuff/nltk$ ls
AUTHORS.md              LICENSE.txt             RELEASE-HOWTO.txt       nltk/                   pip-req.txt             tox.ini
CONTRIBUTING.md         MANIFEST.in             coverage.xml            nosetests.xml           setup.cfg               unittest.out
ChangeLog               Makefile                jenkins-job-config.xml  nosetests_scrubbed.xml  setup.py                web/
INSTALL.txt             README.md               jenkins.sh*             output                  tools/
~/git-stuff/nltk$ python -c 'import nltk; nltk.download('wmt15_eval')'
~/git-stuff/nltk$ python -m unittest nltk.test.unit.translate.test_bleu
..........
----------------------------------------------------------------------
Ran 10 tests in 0.631s

OK

@liuq901
Copy link
Author

liuq901 commented Dec 15, 2016

I have some commits to delete the change about unigram check.

The problem is that, for example, we have

image

The hyp1 is better than hyp2, but if we calculate the bleu score with weights = (0.5, 0.5), the original code gives

image

the hyp2 get a higher score.

This because the 2-gram counts of hyp2 is 0, the original code will ignore it only calculate the 1-gram score, it is unfair to other sentence which count 1-gram and 2-gram score together.

If all the n-gram counts in p_n are non-zero, my code and the original code are the same. If some values in p_n equals zero, the original code will IGNORE it, this will give a very high bleu score just like my example. I think the right way is return 0. So my code doesn't have the zero value check at https://github.com/liuq901/nltk/blob/hotfix/bug-on-bleu_score/nltk/translate/bleu_score.py#L174. Instead I catch the exception to return 0.

And my code passed the unit test.
image

@alvations
Copy link
Contributor

alvations commented Dec 15, 2016

I don't think returning zero in this example is the right thing to do. hyp2 does have a unigram match so the score must be non-zero. But You're right that hyp2 should not a higher score than hyp1 with uniform weights.

Imagine you have a reference and hypothesis as follows, with the fixes you suggested, bleu([ref], hyp) would have returned 0 but that shouldn't be the case:

>>> from random import shuffle
>>> ref = range(10)
>>> hyp = range(10)
>>> shuffle(hyp)
>>> ref
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> hyp
[1, 0, 6, 3, 2, 7, 9, 5, 4, 8]

Maybe someone else more experienced in MT should advise on this. I would just enforce smoothing and possibly setting default to always smooth might be the right thing to do when there is a unigram count but no higher ngram counts.


About the commits to undo your deletes of the unigram counts. Please rebase your git branch to the main develop branch and then recommit your changes without the deletes and re-adding of the code you've deleted.

@liuq901
Copy link
Author

liuq901 commented Dec 15, 2016

I am also not familiar with MT. But some people says one disadvantage of bleu score is that if the sentence is short, maybe 3-gram and 4-gram will have zero counts and will make the final bleu score to 0. This means even hyp2 have a unigram match, the total score is still zero due to it is a geometric means. I think the right way to avoid this is use appropriate weights and smoothing function.

And I will rebase my commits. Thanks

@alvations
Copy link
Contributor

alvations commented Dec 15, 2016

But some people says one disadvantage of bleu score is that if the sentence is short, maybe 3-gram and 4-gram will have zero counts and will make the final bleu score to 0. This means even hyp2 have a unigram match, the total score is still zero due to it is a geometric means.

This is not logical, especially when distortion and reordering is taken into account. Within MT applications or any other sequential tasks that needs to use BLEU, assigning zero for hypothesis that comprises n-grams order below the n set by the user is not rational. E.g., imagine the sentence:

>>> from nltk import bleu
>>> src = "eat the sushi".split()
>>> ref = ["寿司", "を", "食べろ"] # "the", "object-marker", "eat"
>>> hyp = ["食べろ" ,"寿司"] # "eat", "sushi"
>>> bleu([ref], hyp)
0.6065306597126334

In this case the unigram chain is totally adequate though not fluent. And giving it a zero BLEU should be wrong.

P/S: Don't just look at the geometric mean formula in the original BLEU paper, even the mteval-v13a.perl heavily edited their implementation. The users of the functions will end up raising more issues if the use cases of our BLEU implementations don't fit the typical usage =)

@liuq901
Copy link
Author

liuq901 commented Dec 15, 2016

But according to the original formula from paper,
image
when some p_i = 0, the sum in the parenthese is -inf, then exp(-inf) = 0.

I think for short sentence, calculate high order n-gram is not suit. I think give 0 bleu score have no problem, because it can be non-zero by adjust the weights and smoothing function parameters. I think we should not accept a wrong bleu score even if it make sense.

@alvations
Copy link
Contributor

alvations commented Dec 15, 2016

P/S: Don't just look at the geometric mean formula in the original BLEU paper, even the mteval-v13a.perl heavily edited their implementation. The users of the functions will end up raising more issues if the use cases of our BLEU implementations don't fit the typical usage =)

It's strict implementation that follows the paper vs functional implementation that suits typical usage. We need to arbitrate between these. I will go for the latter.

Please take a look at https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl#L17

@liuq901
Copy link
Author

liuq901 commented Dec 15, 2016

Sorry I cannot understand perl script. I think the problem is that the default setting of BLEU count up to 4-gram and have no smoothing function.

In your case, the value makes sense because it ignore the value of 2-gram to 4-gram and only count the 1-gram score. But it is not known by the user that the code ONLY calculate the 1-gram value, by the documents it should calculate the value form 1-gram to 4-gram.

In my case, one sentence calculate 1-gram and 2-gram, another only calculate the 1-gram. It is unfair. And
I believe this is a bigger problem. Because the original code can give a higher bleu score on bad sentence. And my code may make some good sentence to 0, but bad one is also 0 in this case.

I understand your issue. Some good examples may get 0 value due to it is too short. But the key problem is using the same weights with no smoothing function, how to avoid the unfair behavior. Could you give me any other comments?

@alvations
Copy link
Contributor

alvations commented Dec 16, 2016

When using BLEU, I think it's best to smooth by default with method4, esp. for short translations and esp. for the examples raised in this issue:

    def method4(self, p_n, references, hypothesis, hyp_len):
        """
        Smoothing method 4:
        Shorter translations may have inflated precision values due to having
        smaller denominators; therefore, we give them proportionally
        smaller smoothed counts. Instead of scaling to 1/(2^k), Chen and Cherry
        suggests dividing by 1/ln(len(T)), where T is the length of the translation.
        """
        incvnt = 1
        for i, p_i in enumerate(p_n):
            if p_i.numerator == 0 and hyp_len != 0:
                p_n[i] = incvnt * self.k / math.log(hyp_len) # Note that this K is different from the K from NIST.
                incvnt+=1
        return p_n

I'm not sure what should be the default we should set from NLTK though.

@liuq901
Copy link
Author

liuq901 commented Dec 16, 2016

All the problems are because of 0 n-gram counts. In the original code, ignore it will lead to wrong result. In my code, return 0.0 will lead to score which not make sense.

I think using a smoothing function is the best way.
But change the default value may cause other problems.

What about throwing an exception when we found 0 n-gram counts? This can help users to know what happened and suggest them to use some smoothing function in the error message

@alvations
Copy link
Contributor

alvations commented Dec 19, 2016

After some more testing, here's a clear example of why this bug is quite serious. It seems like discarding higher ngram when it don't exist leads to some problem with regards to distortion (i.e. word ordering issues).

With current BLEU implementation

>>> from nltk import bleu
>>> ref = 'let it go'.split()
>>> hyp = 'let go it'.split()
>>> bleu([ref], hyp)
1.0

But the fix in this PR wouldn't really solve the problem since it will return:

>>> from nltk import bleu
>>> ref = 'let it go'.split()
>>> hyp = 'let go it'.split()
>>> bleu([ref], hyp)
0.0

Distorted hypothesis should have a non-zero score but surely not a 1.0 score. We'll have to look at how mteval-v13a.pl and Moses' BLEU handles such cases and get back to this PR / issue.

@stevenbird
Copy link
Member

stevenbird commented Dec 29, 2016

Thanks @liuq901 and @alvations – I agree #1554 is a good solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants