Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentence tokenizer not splitting correctly #1210

Closed
jeryini opened this issue Nov 23, 2015 · 5 comments
Closed

Sentence tokenizer not splitting correctly #1210

jeryini opened this issue Nov 23, 2015 · 5 comments

Comments

@jeryini
Copy link

jeryini commented Nov 23, 2015

I think there is a bug in standard sentence tokenizer sent_tokenize. The problem is, that it is not splitting text into sentences under certain case. Here is this case, where the tokenizer fails to split text into two sentences:

[sent for sent in nltk.sent_tokenize('Model wears size S. Fits size.')]

This returns ['Model wears size S. Fits size.'], instead of ['Model wears size S.', 'Fits size.']. The problem seems to appear, when the last string before . contains only one character. If the number of characters is >= 2, then it correctly splits the text.

@kmike
Copy link
Member

kmike commented Nov 23, 2015

This looks very hard to fix in sentence tokenizer if you consider that S. Fits may be a first and a last name of a person.

@kmike
Copy link
Member

kmike commented Nov 23, 2015

I think the way to go is to subclass or copy-paste default NLTK sentence tokenizer and modify it to fit your application. E.g. if you don't expect such person names in text then remove rules which handle person names. Another option is to use a workaround like replacing size <X> with size_<X> before tokenization and replacing them back again after text is split into sentences.

@jeryini
Copy link
Author

jeryini commented Nov 23, 2015

Hmmm. Just tried again. So the first case that I presented is not splitting correctly. But if I use different characters then it sometimes splits! That is why I wrote this quick test:

import nltk
import pprint

pp = pprint.PrettyPrinter(indent=4)
s = 'Test {}. Test {}.'
[nltk.sent_tokenize(s.format(char, char)) for char in 'abcdefghijklmnopqrstuvwxyz']
[pp.pprint(nltk.sent_tokenize(s.format(char, char))) for char in 'abcdefghijklmnopqrstuvwxyz']

Output:

['Test a.', 'Test a.']
['Test b.', 'Test b.']
['Test c. Test c.']
['Test d. Test d.']
['Test e. Test e.']
['Test f. Test f.']
['Test g. Test g.']
['Test h. Test h.']
['Test i.', 'Test i.']
['Test j.', 'Test j.']
['Test k. Test k.']
['Test l. Test l.']
['Test m. Test m.']
['Test n. Test n.']
['Test o.', 'Test o.']
['Test p. Test p.']
['Test q.', 'Test q.']
['Test r. Test r.']
['Test s. Test s.']
['Test t. Test t.']
['Test u.', 'Test u.']
['Test v. Test v.']
['Test w. Test w.']
['Test x.', 'Test x.']
['Test y.', 'Test y.']
['Test z.', 'Test z.']

@kmike, as you can see it is very inconsistent.

@alvations
Copy link
Contributor

@JernejJerin It's not a rule-based tokenizer so it wouldn't be able to control/explain the "rules" of splitting using regex-like explanation.

The algorithm used to train the sent_tokenizer is Kiss and Strunk (2006) punkt algorithm. It is a statistical system that tries to learn sentence boundary so it's not perfect but it's consistent with the probabilities generated from the model (but not necessary human-like rules).

@yoquankara
Copy link

Just want to add a real world example from BookCorpus, extracted from "Three Plays", Published by Mike Suttons at Smashwords.

sent_tokenize('The weather is terrible, and my day was ok. You are supposed to take your medicine.')

Output

['The weather is terrible, and my day was ok. You are supposed to take your medicine.']

It confirmed that nltk didn't recognize k. as a sentence separator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants