New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sentence tokenizer not splitting correctly #1210
Comments
This looks very hard to fix in sentence tokenizer if you consider that S. Fits may be a first and a last name of a person. |
I think the way to go is to subclass or copy-paste default NLTK sentence tokenizer and modify it to fit your application. E.g. if you don't expect such person names in text then remove rules which handle person names. Another option is to use a workaround like replacing |
Hmmm. Just tried again. So the first case that I presented is not splitting correctly. But if I use different characters then it sometimes splits! That is why I wrote this quick test:
Output:
@kmike, as you can see it is very inconsistent. |
@JernejJerin It's not a rule-based tokenizer so it wouldn't be able to control/explain the "rules" of splitting using regex-like explanation. The algorithm used to train the |
Just want to add a real world example from BookCorpus, extracted from "Three Plays", Published by Mike Suttons at Smashwords.
Output
It confirmed that nltk didn't recognize |
I think there is a bug in standard sentence tokenizer
sent_tokenize
. The problem is, that it is not splitting text into sentences under certain case. Here is this case, where the tokenizer fails to split text into two sentences:This returns
['Model wears size S. Fits size.']
, instead of['Model wears size S.', 'Fits size.']
. The problem seems to appear, when the last string before.
contains only one character. If the number of characters is>= 2
, then it correctly splits the text.The text was updated successfully, but these errors were encountered: