Incorporate more accurate sentence-splitter, tokenizer, and/or lemmatizer for English? #1214

nschneid · 2015-11-30T16:17:56Z

Among open issues, we have (not an exhaustive list):

Sentence tokenizer for TaggedCorpusReader has confusing behavior #135 complains about the sentence tokenizer
Sentence tokenizer not splitting correctly #1210, treebank tokenizer leaves commas at the end of sentences #948 complain about word tokenizer behavior
Tokenizer interface should give access to token locations #78 asks for the tokenizer to provide offsets to the original string
corpus.wordnet.morphy('lest') = 'l' #742 raises some of the foibles of the WordNet lemmatizer. Wordnet _morphy handling of exception words #1196 discusses some counterintuitive behavior and how it might be fixed if POS tags with tense and number were provided to disambiguate.

I'm not an expert on these tasks but I know that Rebecca Dridan, for instance, has recently published methods for some of them. Given that segmentation and lemmatization are so widely used for preprocessing, state-of-the-art methods may deserve some attention.

The issue of genre (news/edited text, informal web text, tweets/SMS) is important as well: hence the separate Twitter tokenizer.

alvations · 2015-11-30T16:43:06Z

+1 @nschneid Most of Rebecca's work is in HPSG which I would love to integrate into NLTK but it's a tough nut. @goodmami, @fcbond and the DELPH-IN group has done quite some work with https://github.com/delph-in/pydelphin

Possibly a python wrapper to Repp might be worth the code =)

nschneid · 2015-11-30T17:11:17Z

There's one on word tokenization, and this and this on sentence splitting.

Looking at citing papers, I see this and this for various genres of web text. Also this for the news domain.

alvations · 2015-12-02T11:09:55Z

@nschneid after some trawling on the REPP code, there're quite a lot of LISP rules written in separate file. Maybe the first thing we could try is to organize all of them into a single text file and then write a wrapper to read these rules. Or possibly just wrapping the whole tool itself might be easier like what we did with nltk.tag.stanford:

alvas@ubi:~/repp$ cat test.txt
Tokenization is widely regarded as a solved problem due to the high accuracy that rulebased tokenizers achieve. 
But rule-based tokenizers are hard to maintain and their rules language specific. 
We show that high accuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning. 
We evaluated our method on three languages and obtained error rates of 0.27 ‰ (English), 0.35 ‰ (Dutch) and 0.76 ‰ (Italian) for our best models.

alvas@ubi:~/repp$ cat test.txt|src/repp -c erg/repp.set 
Tokenization is widely regarded as a solved problem due to the high accuracy that rulebased tokenizers achieve .
But rule-based tokenizers are hard to maintain and their rules language specific .
We show that high accuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning .
We evaluated our method on three languages and obtained error rates of 0.27 ‰ ( English ) , 0.35 ‰ ( Dutch ) and 0.76 ‰ ( Italian ) for our best models .

alvas@ubi:~/repp$ cat test.txt|src/repp -c erg/repp.set --format triple
(0, 12, Tokenization)
(13, 15, is)
(16, 22, widely)
(23, 31, regarded)
(32, 34, as)
(35, 36, a)
(37, 43, solved)
(44, 51, problem)
(52, 55, due)
(56, 58, to)
(59, 62, the)
(63, 67, high)
(68, 76, accuracy)
(77, 81, that)
(82, 91, rulebased)
(92, 102, tokenizers)
(103, 110, achieve)
(110, 111, .)

(0, 3, But)
(4, 14, rule-based)
(15, 25, tokenizers)
(26, 29, are)
(30, 34, hard)
(35, 37, to)
(38, 46, maintain)
(47, 50, and)
(51, 56, their)
(57, 62, rules)
(63, 71, language)
(72, 80, specific)
(80, 81, .)

(0, 2, We)
(3, 7, show)
(8, 12, that)
(13, 17, high)
(18, 26, accuracy)
(27, 31, word)
(32, 35, and)
(36, 44, sentence)
(45, 57, segmentation)
(58, 61, can)
(62, 64, be)
(65, 73, achieved)
(74, 76, by)
(77, 82, using)
(83, 93, supervised)
(94, 102, sequence)
(103, 111, labeling)
(112, 114, on)
(115, 118, the)
(119, 128, character)
(129, 134, level)
(135, 143, combined)
(144, 148, with)
(149, 161, unsupervised)
(162, 169, feature)
(170, 178, learning)
(178, 179, .)

(0, 2, We)
(3, 12, evaluated)
(13, 16, our)
(17, 23, method)
(24, 26, on)
(27, 32, three)
(33, 42, languages)
(43, 46, and)
(47, 55, obtained)
(56, 61, error)
(62, 67, rates)
(68, 70, of)
(71, 75, 0.27)
(76, 77, ‰)
(78, 79, ()
(79, 86, English)
(86, 87, ))
(87, 88, ,)
(89, 93, 0.35)
(94, 95, ‰)
(96, 97, ()
(97, 102, Dutch)
(102, 103, ))
(104, 107, and)
(108, 112, 0.76)
(113, 114, ‰)
(115, 116, ()
(116, 123, Italian)
(123, 124, ))
(125, 128, for)
(129, 132, our)
(133, 137, best)
(138, 144, models)
(144, 145, .)

Possibly there are constraints up the tool chain when it reaches the HPSG parser too. Maybe directly using the ACE http://sweaglesw.org/linguistics/ace/ using pyDelphin interface would be better.

On a side note there's also these from Moses MT toolkit: https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer . For these, it would be good to have for better consistency between nltk.translate and mosesdecoder

For reference sake, there's also Penn Treebank tokenizer, the art of tokenization and bio/medical tokenizers

goodmami · 2015-12-02T19:14:30Z

@alvations I don't think we would gain much by using a full parser like ACE just for tokenization, unless you also want morphological analysis, or pronoun detection, or quantifier scope, or something else. I don't think it would be all that difficult to implement a REPP tokenizer in pyDelphin that doesn't require the full grammars or parsers (see the newly created delph-in/pydelphin#43), or perhaps such an implementation could be a standalone module (i.e. separate from pyDelphin), which would make it easier to incorporate into NLTK.

But REPP is basically just a system of regular expressions with a few extra niceties, so I think the primary gain for NLTK is that it could use the systems developed for the DELPH-IN grammars. However, the paper that @nschneid linked to says that Dridan and Oepen "eliminate[d] two thirds of the remaining tokenization errors" on the PTB data compared to the best-performing off-the-shelf system at the time by using REPP. On a related note, Fokkens et al. 2013 showed that tokenization (among other often-overlooked factors) has significant effects on system performance and, thus, reproducibility of results. So maybe a REPP implementation for NLTK would be useful? In general I'd be happy to help with the implementation, but I don't have spare cycles to do it all myself.

alvations · 2015-12-31T11:57:19Z

@nschneid for now, the simplest solution seems to be wrapping REPP and reading the output files like other third party tools in NLTK. It seems simple enough and there are several implementations. I'm not sure which one to choose given the variety of solutions: http://stackoverflow.com/questions/34416365/unpacking-tuple-like-textfile

Which implementation should we use to wrap REPP in NLTK?

@goodmami Meanwhile the slower but easier to maintain option is to rewrite the LISP + Perl + Boost regexes but I think we can keep this in pyDelphin instead. Some notes: http://stackoverflow.com/questions/34048609/converting-c-boost-regexes-to-python-re-regexes It will be slow for me as well as I'm finishing up my PhD work. I'll try my best when traveling on trains/buses that should kill some time in trying to port regexes =)

nschneid · 2015-12-31T13:52:38Z

I have no particular objection to REPP, but it may be worth exploring other tools that are out there.

For example, https://github.com/armatthews/TokenizeAnything is written in Python and claims to do something reasonable for most languages. (It's a fairly young implementation, but is based on https://github.com/redpony/cdec/blob/master/corpus/tokenize-anything.sh, which has been around for awhile.) @armatthews, do you think it would be a good idea to include it in NLTK?

nschneid · 2015-12-31T13:54:38Z

Another question more broadly is what options/functionality we want the tokenizer to support. E.g., I think it would be useful to have:

an option to separate clitics like "'s" and "n't" or not in English (--no_english_apos in TokenizeAnything)
offsets back into the original string

alvations · 2015-12-31T14:14:07Z

+1 for TokenizeAnything. There's also https://github.com/jonsafari/tok-tok from @jonsafari.

alvations · 2016-01-27T12:19:30Z

I've written a small wrapper for REPP: https://github.com/alvations/nltk/blob/repp/nltk/tokenize/repp.py. Will do a PR once the translate modules are more stable.

alvations · 2016-01-27T22:36:58Z

Ported tok-tok.pl into python: https://github.com/alvations/nltk/blob/repp/nltk/tokenize/toktok.py too.

goodmami · 2016-01-27T22:55:37Z

@alvations so you wrapped a REPP binary instead of implementing a REPP processor? It would then be good to provide a link to where someone could get such a binary. And I haven't used the REPP binary directly---do you know how portable it is?

alvations · 2016-01-27T22:59:58Z

@goodmami I would put up the information on https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software once I get some time to do a PR. I wrote the wrapper while being stuck on a train without wifi =)

I'm not sure whether REPP works outside of linux though, possibly we'll have to ask rebecca or stefan whether they've tried installing REPP on a windows/mac. I would still want to reimplement REPP at some point but I can't commit that much time for a while more.

alvations · 2016-02-10T18:50:30Z

If anyone else is interested in reimplementing/wrapping other tokenizers/stemmers/lemmatizers, I would suggest the following list of tools that would make a good-first-contribution to NLTK.

Sorry I've pressed closed the issue wrongly. It's reopened now.

alvations · 2016-12-22T08:20:06Z

Now with Moses tokenizer and detokenizer working (#1551, #1553), any brave soul want to try reimplementing Elephant with sklearn ?

alvations · 2017-11-03T07:01:52Z

From #1860, it looks like the TreebankTokenizer that we're using as the default word_tokenize() is rather outdated and URL and dates parsing isn't really supported.

Taking a closer look at the TreebankTokenizer regexes and MosesTokenizer regexes, there's no easy solution.

For the TreebankTokenizer, it might be easier to tease apart the colons from the commas at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L76

For Moses, it looks like the default tokenizer didn't care about keeping the numbers together:

$ echo 'This is a sentence with dates like 23/12/1923 and 05/11/2013, and an URL like https://hello.world.com' > test.in

$ ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < test.in 
This is a sentence with dates like 23 / 12 / 1923 and 05 / 11 / 2013 , and an URL like https : / / hello.world.com

Perhaps, we need to consider a more modern tokenizer as NLTK default word_tokenize()?

no-identd · 2018-12-11T09:45:14Z

Perhaps, we need to consider a more modern tokenizer as NLTK default word_tokenize()?

Consider #2202? :)

no-identd · 2018-12-11T16:14:46Z

Also, of note, the current bug (I mean #1214) seems to miss the tokenizer label

no-identd · 2021-01-03T12:33:05Z

@alvations still missing the label :p

andreasvc · 2021-07-12T18:50:43Z

Would there be interest in including syntok as the default sentence splitter and word tokenizer? It is pure Python.

no-identd · 2021-07-12T19:29:18Z

@andreasvc have you seen #2202 (comment) +sequel? I wonder how one could accommodate both simultaneously and/or in a hybrid manner 🤔

andreasvc · 2021-07-12T19:57:54Z

I saw it, it sounds more complicated requiring big data files etc.

syntok is a simple, self-contained multilingual regex sentence splitter and tokenizer, which keeps track of the original string indices. it seems like it would be a good default since it wouldn't require a download.

goodmami · 2021-07-12T20:52:07Z

As another alternative, and following up from earlier comments in this issue, the pure Python REPP implementation in PyDelphin (docs) has been available for about 3 years now and I'll soon be adding masking support (so things like email addresses don't get split on punctuation). A few notes:

It depends on a few other minor modules in PyDelphin but it wouldn't be hard to make it standalone. The main repp module isn't that big.
It depends on the regex module because the tokenizer definition files use some PCRE features. If these files are packaged for the NLTK they could be made compliant with Python's re module.
It has some nice features such as a trace mode that shows each rule application as a diff and sub-tokenizers (e.g., for LaTeX code, HTML tags, etc.) can be selectively disabled.

I'm happy to help someone with porting it to the NLTK, but otherwise I'm not strongly pushing it as it already works well in PyDelphin and other options (syntok, etc.) may be good enough for the NLTK.

andreasvc · 2021-07-13T06:58:56Z

@goodmami that sounds good. Is it multilingual? You can get decent tokenization from a language-independent tokenizer, but good results require some language-specific rules/data. And from your assessment it does sound more difficult to port.

I develop pyre2 so I know about the hassles of different regex flavors...

goodmami · 2021-07-13T08:06:31Z

Is it multilingual?

The implementation is just the engine for applying systems of regexes, tracking original string indices, etc. The rules for tokenization are defined separately, mainly by implemented HPSG grammars. For instance, the English Resource Grammar has a rather advanced tokenizer definition, and the Indonesian grammar INDRA has a significant one, among some others. But, alas, definitions for many languages would need to be created or expanded to make a more multilingual offering. I suspect it (or any regex-based tokenizer) wouldn't work so well for languages that generally use dictionary-based tokenizers/morphological-analyzers, like Chinese and Japanese. The good news is that the sub-tokenizers (for handling LaTeX, HTML, XML, email addresses, etc.) can likely be reused across languages.

stevenbird · 2021-07-13T11:59:02Z

Before we decide to port anything it would be good to consider how it will be maintained as the original is improved. Another solution is to provide recipes for combining functionality from multiple existing packages which are independently maintained.

andreasvc · 2021-07-13T12:24:16Z

My aim with proposing to incorporate syntok is to have a simple default tokenizer/splitter which is (a little) better than the current anglocentric (TreebankTokenizer) or unsupervised (PunktSentenceTokenizer) default.

The question about maintenance is a good one, however providing recipes for using an independently maintained package wouldn't address my concern which is that I would like a better default NLTK tokenizer/splitter which I can recommend beginners to use. Ideally this wouldn't require any extra packages or downloads beyond standard NLTK.

andreasvc · 2021-07-13T12:43:14Z

I just noticed the issue title says "for English" so talking about language independence and multilinguality is a bit offtopic...

nschneid added the enhancement label Nov 30, 2015

alvations mentioned this issue Feb 4, 2016

Added three tokenizers, one detokenizer and two tokenizer-related word/char list corpora #1282

Merged

alvations closed this as completed Feb 10, 2016

alvations reopened this Feb 10, 2016

alvations mentioned this issue Apr 23, 2016

"attempt" classified as past verb #1381

Closed

alvations mentioned this issue Dec 27, 2016

word_tokenizer and Spanish #1558

Closed

alvations mentioned this issue May 5, 2017

Better Treebank tokenizer #1710

Merged

PetrochukM mentioned this issue Jul 9, 2017

Include Moses Tokenizer pytorch/text#53

Closed

alvations added good first issue language-model nice idea labels Oct 4, 2017

nschneid mentioned this issue Dec 12, 2018

Improve tokenization of Multi Word Expressions by including "python partitioner" #2202

Open

alvations removed the language-model label May 9, 2019

alvations added the model label May 9, 2019

alvations mentioned this issue Aug 27, 2019

Splitting sentences fails on some corner cases #2376

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorporate more accurate sentence-splitter, tokenizer, and/or lemmatizer for English? #1214

Incorporate more accurate sentence-splitter, tokenizer, and/or lemmatizer for English? #1214

nschneid commented Nov 30, 2015

alvations commented Nov 30, 2015

nschneid commented Nov 30, 2015

alvations commented Dec 2, 2015

goodmami commented Dec 2, 2015

alvations commented Dec 31, 2015

nschneid commented Dec 31, 2015

nschneid commented Dec 31, 2015

alvations commented Dec 31, 2015

alvations commented Jan 27, 2016

alvations commented Jan 27, 2016

goodmami commented Jan 27, 2016

alvations commented Jan 27, 2016

alvations commented Feb 10, 2016

alvations commented Dec 22, 2016 •

edited

alvations commented Nov 3, 2017

no-identd commented Dec 11, 2018

no-identd commented Dec 11, 2018

no-identd commented Jan 3, 2021

andreasvc commented Jul 12, 2021

no-identd commented Jul 12, 2021

andreasvc commented Jul 12, 2021

goodmami commented Jul 12, 2021

andreasvc commented Jul 13, 2021

goodmami commented Jul 13, 2021

stevenbird commented Jul 13, 2021

andreasvc commented Jul 13, 2021

andreasvc commented Jul 13, 2021

Incorporate more accurate sentence-splitter, tokenizer, and/or lemmatizer for English? #1214

Incorporate more accurate sentence-splitter, tokenizer, and/or lemmatizer for English? #1214

Comments

nschneid commented Nov 30, 2015

alvations commented Nov 30, 2015

nschneid commented Nov 30, 2015

alvations commented Dec 2, 2015

goodmami commented Dec 2, 2015

alvations commented Dec 31, 2015

nschneid commented Dec 31, 2015

nschneid commented Dec 31, 2015

alvations commented Dec 31, 2015

alvations commented Jan 27, 2016

alvations commented Jan 27, 2016

goodmami commented Jan 27, 2016

alvations commented Jan 27, 2016

alvations commented Feb 10, 2016

alvations commented Dec 22, 2016 • edited

alvations commented Nov 3, 2017

no-identd commented Dec 11, 2018

no-identd commented Dec 11, 2018

no-identd commented Jan 3, 2021

andreasvc commented Jul 12, 2021

no-identd commented Jul 12, 2021

andreasvc commented Jul 12, 2021

goodmami commented Jul 12, 2021

andreasvc commented Jul 13, 2021

goodmami commented Jul 13, 2021

stevenbird commented Jul 13, 2021

andreasvc commented Jul 13, 2021

andreasvc commented Jul 13, 2021

alvations commented Dec 22, 2016 •

edited