New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KneserNeyInterpolated has problem with OOV words during testing and perplexity is always inf #3211
Comments
I will look into this. I get the same error as well. |
I tried some things and this is what I got. the second set of numbers is MLE none out of context1.38310805592236030.4679138720730883infinf#first bigram positon out of context 1.74814170529019290.8058221352008101infinf#second bigram position out of context infinfinfinf#all out of context I am not too sure if out of context words have a specific way they are supposed to be handled. Can you do some more testing to figure out if this is math issue or something else? maybe provide expected and actual outcomes. From what I can tell KneserNeyInterpolated it is challenging to tell what the expect functionality for OOV is supposed to be. |
This also might be related to issue #2727 |
Hello, I had the same issue. I solved it downgrading NLTK to version 3.6.1. Seems like there is some bug on version 3.8.1, because the exact same code works on version 3.6.1. |
I can confirm that this error does not appear when using nltk 3.6.1. What is different between 3.6.1 and 3.8.1 is the implementation of KneserNey: in 3.8.1 the implementation has discounting implemented together with count continuations which are important because of different orders of n-grams (lower or higher), e.g. I believe this is an improvement for Kneser-Ney algorithm so it makes sense to have it as uni-/ and bi-grams should have different discount factor compared to higher-order n-grams. Does it mean that the problem is in the implementation of the continuation counts? If so, then the code in 3.6.1 does absolute discounting and not Kneser-Ney discounting, so it would be incorrect to use the code from nltk==3.6.1 and treat it as Kneser-Ney method, because it's not complete. The code in 3.8.1 looks more complete, but something is clearly wrong in calculating continuation counts. |
I am taking some of my words back: nltk 3.8.1 seems to correctly implement Kneser-Ney with discounting. These would be However, it does not seem to handle OOV words during testing - this is where the problem occurs. In Jurafsky they talk about this issue and possible solution in Eq. 3.42. Can someone point to part in the code which implements that? |
I'm not super familiar with this, but it looks to me like the part of the code that implements the unknown word handling as described by Jurafsky was captured by the 3.6.1 fixed discount factor of 1.0/vocab_length in unigram_score, whereas now, unigram_score doesn't include that term any longer, and so unknown words which of course have a continuation count of 0 get a score of 0. I think it might be that it should, in addition to dividing continuation count by total count, also add the lambda(E)*1/V term as seen in Jurafsky 3.41? That way unknown unigrams wouldn't be zero, just close to it, and all other unigrams would have their score very slightly increased by the uniform distribution? |
nltk 3.8.1
I am training and testing a language model on my corpus of sentences using KneserNeyInterpolated.
It looks like the nltk implementation of this smoothing algorithm does not know what to do with out-of-vocabulary words as during testing the model's perplexity on them is infinity. But the problem appears only in specific situations, more is below.
Training:
Then I test it on bigrams (following some suggestions from #3065)
And the result is
I have looked at perplexity of the model for each of these bigrams and found that for
('of', 'livestock')
perplexity if inf, but for('livestock', '</s>')
perplexity is some number, not inf. The problematic word (I suspect it's 'livestock') has not been seen during training:lm.vocab['livestock']
is 0.I tested another OOV word ('radiator') and observed exactly the same situation: perplexity is inf when 'radiator' is the second word in the bigram (the first word can be anything), but when 'radiator' is the first word, perplexity is some number disregarding what the second word is. Why would such behaviour occur?
MLE and Laplace work fine, the problem does not occur.
What could be wrong?
The text was updated successfully, but these errors were encountered: