New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hotfix/bug on bleu score #1545
Hotfix/bug on bleu score #1545
Conversation
Could you explain the problem further with some test cases / examples and why the changes you've made would fix that?
Yes, the code point at https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py#L163 , checks that. Although this is not described in the original BLEU paper, this is a logical thing to do. There's really no need to check for other order of ngrams and end up in math domain error later on when summing the geometric mean of the modified ngram precision.
Is it wrong because it's not emulating the original BLEU paper? We're not following the original BLEU paper because it didn't cover many fringe cases, which the MT community has found to be perplexing, esp. when using BLEU at sentence level, so the historical account of how BLEU has changed can be found on https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl#L17
Please explain further on this claim. Without the unigram counts, there should be 0 counts on all higher n-gram orders. E.g. if the hypothesis sentences is `this is a foo bar sentence' and the reference sentence is 'the cats are on the mat', there should be no n-gram overlaps which should naturally return n-gram of zero: >>> from nltk import everygrams, ngrams
# 1 grams.
>>> hyp = list(ngrams('this is a foo bar sentence'.split(), n=1))
>>> ref = list(ngrams('the cats are on the mat'.split(), n=1))
>>> set(hyp).intersection(ref)
set([])
# 4 grams.
>>> hyp = list(everygrams('this is a foo bar sentence'.split(), 1, 4))
>>> ref = list(everygrams('the cats are on the mat'.split(), 1, 4))
>>> hyp
[('this',), ('is',), ('a',), ('foo',), ('bar',), ('sentence',), ('this', 'is'), ('is', 'a'), ('a', 'foo'), ('foo', 'bar'), ('bar', 'sentence'), ('this', 'is', 'a'), ('is', 'a', 'foo'), ('a', 'foo', 'bar'), ('foo', 'bar', 'sentence'), ('this', 'is', 'a', 'foo'), ('is', 'a', 'foo', 'bar'), ('a', 'foo', 'bar', 'sentence')]
>>> ref
[('the',), ('cats',), ('are',), ('on',), ('the',), ('mat',), ('the', 'cats'), ('cats', 'are'), ('are', 'on'), ('on', 'the'), ('the', 'mat'), ('the', 'cats', 'are'), ('cats', 'are', 'on'), ('are', 'on', 'the'), ('on', 'the', 'mat'), ('the', 'cats', 'are', 'on'), ('cats', 'are', 'on', 'the'), ('are', 'on', 'the', 'mat')]
>>> set(hyp).intersection(ref)
set([]) There would be cases where there are 0 unigram matches but still an okay translation, and I think those cases are the ones that you wanted to cover in your PR: >>> ref = list(ngrams('the cats are on the mat'.split(), n=1))
>>> hyp = list(ngrams('those feline at carpet'.split(), n=1))
>>> set(hyp).intersection(ref)
set([]) I suppose the hypothesis above should get some credit although there's zero matches. But I think METEOR would be a better metric for such cases. It's because if we allow this case of unigram match to slip though and get a smoothed score, non-related sentence would also be smoothed. But I'm not sure which is a better approach, allow all hypothesis to have a non-zero score (i.e. smooth every possible sentence) or be tough and enforce zero scores for no unigram matches (i.e. allow zero BLEU score).
In the case that there are no unigram matches, I don't think we should smooth the BLEU score at all. This is partly because there is no reason to do at a corpus level. At a sentence level, when there are no unigram matches, assigning a smoothed BLEU score would be too permissive. Furthermore, some of the smooth methods summarized in Chen and Cherry (2014), requires the previous order of ngrams to have non-zero counts, e.g. method 3 or method 6. Possibly the hard constraint for skipping the smoothing if there's no unigram counts is wrong. I may not be the best person to vouch for this, perhaps someone more experienced with MT research should advise us whether the fixes you suggested are more appropriate. You might find the discussion on #1539 related to this issue. |
return bp * math.exp(math.fsum(s)) | ||
except ValueError, e: | ||
# ValueError means log(0.0). Return 0 | ||
return 0.0 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This try-except with ValueError
should not happen. The purpose of the smoothing was such that the summation over the logs don't contain any 0 values. Although the default smoothing is no-smoothing, and to avoid the log(0.0)
, the if condition at https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py#L170 would skip that order of ngram with 0 counts. So that should still not result in ValueError
.
Could you write some test cases or examples that cause the ValueError
?
Also, with your PR could you help to run the unittest to check that it doesn't break other parts of the BLEU:
~/git-stuff/nltk$ ls
AUTHORS.md LICENSE.txt RELEASE-HOWTO.txt nltk/ pip-req.txt tox.ini
CONTRIBUTING.md MANIFEST.in coverage.xml nosetests.xml setup.cfg unittest.out
ChangeLog Makefile jenkins-job-config.xml nosetests_scrubbed.xml setup.py web/
INSTALL.txt README.md jenkins.sh* output tools/
~/git-stuff/nltk$ python -c 'import nltk; nltk.download('wmt15_eval')'
~/git-stuff/nltk$ python -m unittest nltk.test.unit.translate.test_bleu
..........
----------------------------------------------------------------------
Ran 10 tests in 0.631s
OK
I have some commits to delete the change about unigram check. The problem is that, for example, we have The hyp1 is better than hyp2, but if we calculate the bleu score with weights = (0.5, 0.5), the original code gives the hyp2 get a higher score. This because the 2-gram counts of hyp2 is 0, the original code will ignore it only calculate the 1-gram score, it is unfair to other sentence which count 1-gram and 2-gram score together. If all the n-gram counts in p_n are non-zero, my code and the original code are the same. If some values in p_n equals zero, the original code will IGNORE it, this will give a very high bleu score just like my example. I think the right way is return 0. So my code doesn't have the zero value check at https://github.com/liuq901/nltk/blob/hotfix/bug-on-bleu_score/nltk/translate/bleu_score.py#L174. Instead I catch the exception to return 0. |
I don't think returning zero in this example is the right thing to do. Imagine you have a reference and hypothesis as follows, with the fixes you suggested,
Maybe someone else more experienced in MT should advise on this. I would just enforce smoothing and possibly setting default to always smooth might be the right thing to do when there is a unigram count but no higher ngram counts. About the commits to undo your deletes of the unigram counts. Please rebase your git branch to the main develop branch and then recommit your changes without the deletes and re-adding of the code you've deleted. |
I am also not familiar with MT. But some people says one disadvantage of bleu score is that if the sentence is short, maybe 3-gram and 4-gram will have zero counts and will make the final bleu score to 0. This means even hyp2 have a unigram match, the total score is still zero due to it is a geometric means. I think the right way to avoid this is use appropriate weights and smoothing function. And I will rebase my commits. Thanks |
6e561fa
to
1eaaf99
Compare
This is not logical, especially when distortion and reordering is taken into account. Within MT applications or any other sequential tasks that needs to use BLEU, assigning zero for hypothesis that comprises n-grams order below the n set by the user is not rational. E.g., imagine the sentence: >>> from nltk import bleu
>>> src = "eat the sushi".split()
>>> ref = ["寿司", "を", "食べろ"] # "the", "object-marker", "eat"
>>> hyp = ["食べろ" ,"寿司"] # "eat", "sushi"
>>> bleu([ref], hyp)
0.6065306597126334 In this case the unigram chain is totally adequate though not fluent. And giving it a zero BLEU should be wrong. P/S: Don't just look at the geometric mean formula in the original BLEU paper, even the |
But according to the original formula from paper, I think for short sentence, calculate high order n-gram is not suit. I think give 0 bleu score have no problem, because it can be non-zero by adjust the weights and smoothing function parameters. I think we should not accept a wrong bleu score even if it make sense. |
It's strict implementation that follows the paper vs functional implementation that suits typical usage. We need to arbitrate between these. I will go for the latter. Please take a look at https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl#L17 |
Sorry I cannot understand perl script. I think the problem is that the default setting of BLEU count up to 4-gram and have no smoothing function. In your case, the value makes sense because it ignore the value of 2-gram to 4-gram and only count the 1-gram score. But it is not known by the user that the code ONLY calculate the 1-gram value, by the documents it should calculate the value form 1-gram to 4-gram. In my case, one sentence calculate 1-gram and 2-gram, another only calculate the 1-gram. It is unfair. And I understand your issue. Some good examples may get 0 value due to it is too short. But the key problem is using the same weights with no smoothing function, how to avoid the unfair behavior. Could you give me any other comments? |
When using BLEU, I think it's best to smooth by default with method4, esp. for short translations and esp. for the examples raised in this issue: def method4(self, p_n, references, hypothesis, hyp_len):
"""
Smoothing method 4:
Shorter translations may have inflated precision values due to having
smaller denominators; therefore, we give them proportionally
smaller smoothed counts. Instead of scaling to 1/(2^k), Chen and Cherry
suggests dividing by 1/ln(len(T)), where T is the length of the translation.
"""
incvnt = 1
for i, p_i in enumerate(p_n):
if p_i.numerator == 0 and hyp_len != 0:
p_n[i] = incvnt * self.k / math.log(hyp_len) # Note that this K is different from the K from NIST.
incvnt+=1
return p_n I'm not sure what should be the default we should set from NLTK though. |
All the problems are because of 0 n-gram counts. In the original code, ignore it will lead to wrong result. In my code, return 0.0 will lead to score which not make sense. I think using a smoothing function is the best way. What about throwing an exception when we found 0 n-gram counts? This can help users to know what happened and suggest them to use some smoothing function in the error message |
After some more testing, here's a clear example of why this bug is quite serious. It seems like discarding higher ngram when it don't exist leads to some problem with regards to distortion (i.e. word ordering issues). With current BLEU implementation >>> from nltk import bleu
>>> ref = 'let it go'.split()
>>> hyp = 'let go it'.split()
>>> bleu([ref], hyp)
1.0 But the fix in this PR wouldn't really solve the problem since it will return: >>> from nltk import bleu
>>> ref = 'let it go'.split()
>>> hyp = 'let go it'.split()
>>> bleu([ref], hyp)
0.0 Distorted hypothesis should have a non-zero score but surely not a 1.0 score. We'll have to look at how |
Thanks @liuq901 and @alvations – I agree #1554 is a good solution. |
The total bleu score is the geometric mean of n-gram count. If one n-gram count equals zero, the bleu score should be zero.
The original code will ignore the zero value to avoid log(0.0), but it will return a non-zero value, I think it is wrong.
The uni-gram check is also wrong. Even with no unigrams, the fraction can be non-zero due to the smoothing function.
Sorry for my poor English