Sentence tokenizer not splitting correctly #1210

jeryini · 2015-11-23T10:57:41Z

I think there is a bug in standard sentence tokenizer sent_tokenize. The problem is, that it is not splitting text into sentences under certain case. Here is this case, where the tokenizer fails to split text into two sentences:

[sent for sent in nltk.sent_tokenize('Model wears size S. Fits size.')]

This returns ['Model wears size S. Fits size.'], instead of ['Model wears size S.', 'Fits size.']. The problem seems to appear, when the last string before . contains only one character. If the number of characters is >= 2, then it correctly splits the text.

The text was updated successfully, but these errors were encountered:

kmike · 2015-11-23T11:04:30Z

This looks very hard to fix in sentence tokenizer if you consider that S. Fits may be a first and a last name of a person.

kmike · 2015-11-23T11:10:20Z

I think the way to go is to subclass or copy-paste default NLTK sentence tokenizer and modify it to fit your application. E.g. if you don't expect such person names in text then remove rules which handle person names. Another option is to use a workaround like replacing size <X> with size_<X> before tokenization and replacing them back again after text is split into sentences.

jeryini · 2015-11-23T11:52:06Z

Hmmm. Just tried again. So the first case that I presented is not splitting correctly. But if I use different characters then it sometimes splits! That is why I wrote this quick test:

import nltk
import pprint

pp = pprint.PrettyPrinter(indent=4)
s = 'Test {}. Test {}.'
[nltk.sent_tokenize(s.format(char, char)) for char in 'abcdefghijklmnopqrstuvwxyz']
[pp.pprint(nltk.sent_tokenize(s.format(char, char))) for char in 'abcdefghijklmnopqrstuvwxyz']

Output:

['Test a.', 'Test a.']
['Test b.', 'Test b.']
['Test c. Test c.']
['Test d. Test d.']
['Test e. Test e.']
['Test f. Test f.']
['Test g. Test g.']
['Test h. Test h.']
['Test i.', 'Test i.']
['Test j.', 'Test j.']
['Test k. Test k.']
['Test l. Test l.']
['Test m. Test m.']
['Test n. Test n.']
['Test o.', 'Test o.']
['Test p. Test p.']
['Test q.', 'Test q.']
['Test r. Test r.']
['Test s. Test s.']
['Test t. Test t.']
['Test u.', 'Test u.']
['Test v. Test v.']
['Test w. Test w.']
['Test x.', 'Test x.']
['Test y.', 'Test y.']
['Test z.', 'Test z.']

@kmike, as you can see it is very inconsistent.

alvations · 2015-11-23T12:29:48Z

@JernejJerin It's not a rule-based tokenizer so it wouldn't be able to control/explain the "rules" of splitting using regex-like explanation.

The algorithm used to train the sent_tokenizer is Kiss and Strunk (2006) punkt algorithm. It is a statistical system that tries to learn sentence boundary so it's not perfect but it's consistent with the probabilities generated from the model (but not necessary human-like rules).

yoquankara · 2019-02-12T09:08:13Z

Just want to add a real world example from BookCorpus, extracted from "Three Plays", Published by Mike Suttons at Smashwords.

sent_tokenize('The weather is terrible, and my day was ok. You are supposed to take your medicine.')

Output

['The weather is terrible, and my day was ok. You are supposed to take your medicine.']

It confirmed that nltk didn't recognize k. as a sentence separator.

nschneid mentioned this issue Nov 30, 2015

Incorporate more accurate sentence-splitter, tokenizer, and/or lemmatizer for English? #1214

Open

alvations added the tokenizer label Oct 13, 2017

stevenbird added the inactive label Mar 13, 2020

stevenbird closed this as completed Mar 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentence tokenizer not splitting correctly #1210

Sentence tokenizer not splitting correctly #1210

jeryini commented Nov 23, 2015

kmike commented Nov 23, 2015

kmike commented Nov 23, 2015

jeryini commented Nov 23, 2015

alvations commented Nov 23, 2015

yoquankara commented Feb 12, 2019

Sentence tokenizer not splitting correctly #1210

Sentence tokenizer not splitting correctly #1210

Comments

jeryini commented Nov 23, 2015

kmike commented Nov 23, 2015

kmike commented Nov 23, 2015

jeryini commented Nov 23, 2015

alvations commented Nov 23, 2015

yoquankara commented Feb 12, 2019