Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate more accurate sentence-splitter, tokenizer, and/or lemmatizer for English? #1214

Open
nschneid opened this issue Nov 30, 2015 · 27 comments

Comments

@nschneid
Copy link
Contributor

Among open issues, we have (not an exhaustive list):

I'm not an expert on these tasks but I know that Rebecca Dridan, for instance, has recently published methods for some of them. Given that segmentation and lemmatization are so widely used for preprocessing, state-of-the-art methods may deserve some attention.

The issue of genre (news/edited text, informal web text, tweets/SMS) is important as well: hence the separate Twitter tokenizer.

@alvations
Copy link
Contributor

+1 @nschneid Most of Rebecca's work is in HPSG which I would love to integrate into NLTK but it's a tough nut. @goodmami, @fcbond and the DELPH-IN group has done quite some work with https://github.com/delph-in/pydelphin

Possibly a python wrapper to Repp might be worth the code =)

@nschneid
Copy link
Contributor Author

There's one on word tokenization, and this and this on sentence splitting.

Looking at citing papers, I see this and this for various genres of web text. Also this for the news domain.

@alvations
Copy link
Contributor

@nschneid after some trawling on the REPP code, there're quite a lot of LISP rules written in separate file. Maybe the first thing we could try is to organize all of them into a single text file and then write a wrapper to read these rules. Or possibly just wrapping the whole tool itself might be easier like what we did with nltk.tag.stanford:

alvas@ubi:~/repp$ cat test.txt
Tokenization is widely regarded as a solved problem due to the high accuracy that rulebased tokenizers achieve. 
But rule-based tokenizers are hard to maintain and their rules language specific. 
We show that high accuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning. 
We evaluated our method on three languages and obtained error rates of 0.27 ‰ (English), 0.35 ‰ (Dutch) and 0.76 ‰ (Italian) for our best models.

alvas@ubi:~/repp$ cat test.txt|src/repp -c erg/repp.set 
Tokenization is widely regarded as a solved problem due to the high accuracy that rulebased tokenizers achieve .
But rule-based tokenizers are hard to maintain and their rules language specific .
We show that high accuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning .
We evaluated our method on three languages and obtained error rates of 0.27 ‰ ( English ) , 0.35 ‰ ( Dutch ) and 0.76 ‰ ( Italian ) for our best models .

alvas@ubi:~/repp$ cat test.txt|src/repp -c erg/repp.set --format triple
(0, 12, Tokenization)
(13, 15, is)
(16, 22, widely)
(23, 31, regarded)
(32, 34, as)
(35, 36, a)
(37, 43, solved)
(44, 51, problem)
(52, 55, due)
(56, 58, to)
(59, 62, the)
(63, 67, high)
(68, 76, accuracy)
(77, 81, that)
(82, 91, rulebased)
(92, 102, tokenizers)
(103, 110, achieve)
(110, 111, .)

(0, 3, But)
(4, 14, rule-based)
(15, 25, tokenizers)
(26, 29, are)
(30, 34, hard)
(35, 37, to)
(38, 46, maintain)
(47, 50, and)
(51, 56, their)
(57, 62, rules)
(63, 71, language)
(72, 80, specific)
(80, 81, .)

(0, 2, We)
(3, 7, show)
(8, 12, that)
(13, 17, high)
(18, 26, accuracy)
(27, 31, word)
(32, 35, and)
(36, 44, sentence)
(45, 57, segmentation)
(58, 61, can)
(62, 64, be)
(65, 73, achieved)
(74, 76, by)
(77, 82, using)
(83, 93, supervised)
(94, 102, sequence)
(103, 111, labeling)
(112, 114, on)
(115, 118, the)
(119, 128, character)
(129, 134, level)
(135, 143, combined)
(144, 148, with)
(149, 161, unsupervised)
(162, 169, feature)
(170, 178, learning)
(178, 179, .)

(0, 2, We)
(3, 12, evaluated)
(13, 16, our)
(17, 23, method)
(24, 26, on)
(27, 32, three)
(33, 42, languages)
(43, 46, and)
(47, 55, obtained)
(56, 61, error)
(62, 67, rates)
(68, 70, of)
(71, 75, 0.27)
(76, 77, ‰)
(78, 79, ()
(79, 86, English)
(86, 87, ))
(87, 88, ,)
(89, 93, 0.35)
(94, 95, ‰)
(96, 97, ()
(97, 102, Dutch)
(102, 103, ))
(104, 107, and)
(108, 112, 0.76)
(113, 114, ‰)
(115, 116, ()
(116, 123, Italian)
(123, 124, ))
(125, 128, for)
(129, 132, our)
(133, 137, best)
(138, 144, models)
(144, 145, .)

Possibly there are constraints up the tool chain when it reaches the HPSG parser too. Maybe directly using the ACE http://sweaglesw.org/linguistics/ace/ using pyDelphin interface would be better.


On a side note there's also these from Moses MT toolkit: https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer . For these, it would be good to have for better consistency between nltk.translate and mosesdecoder


For reference sake, there's also Penn Treebank tokenizer, the art of tokenization and bio/medical tokenizers

@goodmami
Copy link
Contributor

goodmami commented Dec 2, 2015

@alvations I don't think we would gain much by using a full parser like ACE just for tokenization, unless you also want morphological analysis, or pronoun detection, or quantifier scope, or something else. I don't think it would be all that difficult to implement a REPP tokenizer in pyDelphin that doesn't require the full grammars or parsers (see the newly created delph-in/pydelphin#43), or perhaps such an implementation could be a standalone module (i.e. separate from pyDelphin), which would make it easier to incorporate into NLTK.

But REPP is basically just a system of regular expressions with a few extra niceties, so I think the primary gain for NLTK is that it could use the systems developed for the DELPH-IN grammars. However, the paper that @nschneid linked to says that Dridan and Oepen "eliminate[d] two thirds of the remaining tokenization errors" on the PTB data compared to the best-performing off-the-shelf system at the time by using REPP. On a related note, Fokkens et al. 2013 showed that tokenization (among other often-overlooked factors) has significant effects on system performance and, thus, reproducibility of results. So maybe a REPP implementation for NLTK would be useful? In general I'd be happy to help with the implementation, but I don't have spare cycles to do it all myself.

@alvations
Copy link
Contributor

@nschneid for now, the simplest solution seems to be wrapping REPP and reading the output files like other third party tools in NLTK. It seems simple enough and there are several implementations. I'm not sure which one to choose given the variety of solutions: http://stackoverflow.com/questions/34416365/unpacking-tuple-like-textfile

Which implementation should we use to wrap REPP in NLTK?

@goodmami Meanwhile the slower but easier to maintain option is to rewrite the LISP + Perl + Boost regexes but I think we can keep this in pyDelphin instead. Some notes: http://stackoverflow.com/questions/34048609/converting-c-boost-regexes-to-python-re-regexes It will be slow for me as well as I'm finishing up my PhD work. I'll try my best when traveling on trains/buses that should kill some time in trying to port regexes =)

@nschneid
Copy link
Contributor Author

I have no particular objection to REPP, but it may be worth exploring other tools that are out there.

For example, https://github.com/armatthews/TokenizeAnything is written in Python and claims to do something reasonable for most languages. (It's a fairly young implementation, but is based on https://github.com/redpony/cdec/blob/master/corpus/tokenize-anything.sh, which has been around for awhile.) @armatthews, do you think it would be a good idea to include it in NLTK?

@nschneid
Copy link
Contributor Author

Another question more broadly is what options/functionality we want the tokenizer to support. E.g., I think it would be useful to have:

  • an option to separate clitics like "'s" and "n't" or not in English (--no_english_apos in TokenizeAnything)
  • offsets back into the original string

@alvations
Copy link
Contributor

+1 for TokenizeAnything. There's also https://github.com/jonsafari/tok-tok from @jonsafari.

@alvations
Copy link
Contributor

I've written a small wrapper for REPP: https://github.com/alvations/nltk/blob/repp/nltk/tokenize/repp.py. Will do a PR once the translate modules are more stable.

@alvations
Copy link
Contributor

Ported tok-tok.pl into python: https://github.com/alvations/nltk/blob/repp/nltk/tokenize/toktok.py too.

@goodmami
Copy link
Contributor

@alvations so you wrapped a REPP binary instead of implementing a REPP processor? It would then be good to provide a link to where someone could get such a binary. And I haven't used the REPP binary directly---do you know how portable it is?

@alvations
Copy link
Contributor

@goodmami I would put up the information on https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software once I get some time to do a PR. I wrote the wrapper while being stuck on a train without wifi =)

I'm not sure whether REPP works outside of linux though, possibly we'll have to ask rebecca or stefan whether they've tried installing REPP on a windows/mac. I would still want to reimplement REPP at some point but I can't commit that much time for a while more.

@alvations
Copy link
Contributor

If anyone else is interested in reimplementing/wrapping other tokenizers/stemmers/lemmatizers, I would suggest the following list of tools that would make a good-first-contribution to NLTK.

Sorry I've pressed closed the issue wrongly. It's reopened now.

@alvations
Copy link
Contributor

alvations commented Dec 22, 2016

Now with Moses tokenizer and detokenizer working (#1551, #1553), any brave soul want to try reimplementing Elephant with sklearn ?

@alvations
Copy link
Contributor

From #1860, it looks like the TreebankTokenizer that we're using as the default word_tokenize() is rather outdated and URL and dates parsing isn't really supported.

Taking a closer look at the TreebankTokenizer regexes and MosesTokenizer regexes, there's no easy solution.

For the TreebankTokenizer, it might be easier to tease apart the colons from the commas at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L76

For Moses, it looks like the default tokenizer didn't care about keeping the numbers together:

$ echo 'This is a sentence with dates like 23/12/1923 and 05/11/2013, and an URL like https://hello.world.com' > test.in

$ ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < test.in 
This is a sentence with dates like 23 / 12 / 1923 and 05 / 11 / 2013 , and an URL like https : / / hello.world.com

Perhaps, we need to consider a more modern tokenizer as NLTK default word_tokenize()?

@no-identd
Copy link

Perhaps, we need to consider a more modern tokenizer as NLTK default word_tokenize()?

Consider #2202? :)

@no-identd
Copy link

Also, of note, the current bug (I mean #1214) seems to miss the tokenizer label

@no-identd
Copy link

@alvations still missing the label :p

@andreasvc
Copy link

Would there be interest in including syntok as the default sentence splitter and word tokenizer? It is pure Python.

@no-identd
Copy link

@andreasvc have you seen #2202 (comment) +sequel? I wonder how one could accommodate both simultaneously and/or in a hybrid manner 🤔

@andreasvc
Copy link

I saw it, it sounds more complicated requiring big data files etc.

syntok is a simple, self-contained multilingual regex sentence splitter and tokenizer, which keeps track of the original string indices. it seems like it would be a good default since it wouldn't require a download.

@goodmami
Copy link
Contributor

As another alternative, and following up from earlier comments in this issue, the pure Python REPP implementation in PyDelphin (docs) has been available for about 3 years now and I'll soon be adding masking support (so things like email addresses don't get split on punctuation). A few notes:

  • It depends on a few other minor modules in PyDelphin but it wouldn't be hard to make it standalone. The main repp module isn't that big.
  • It depends on the regex module because the tokenizer definition files use some PCRE features. If these files are packaged for the NLTK they could be made compliant with Python's re module.
  • It has some nice features such as a trace mode that shows each rule application as a diff and sub-tokenizers (e.g., for LaTeX code, HTML tags, etc.) can be selectively disabled.

I'm happy to help someone with porting it to the NLTK, but otherwise I'm not strongly pushing it as it already works well in PyDelphin and other options (syntok, etc.) may be good enough for the NLTK.

@andreasvc
Copy link

@goodmami that sounds good. Is it multilingual? You can get decent tokenization from a language-independent tokenizer, but good results require some language-specific rules/data. And from your assessment it does sound more difficult to port.

I develop pyre2 so I know about the hassles of different regex flavors...

@goodmami
Copy link
Contributor

Is it multilingual?

The implementation is just the engine for applying systems of regexes, tracking original string indices, etc. The rules for tokenization are defined separately, mainly by implemented HPSG grammars. For instance, the English Resource Grammar has a rather advanced tokenizer definition, and the Indonesian grammar INDRA has a significant one, among some others. But, alas, definitions for many languages would need to be created or expanded to make a more multilingual offering. I suspect it (or any regex-based tokenizer) wouldn't work so well for languages that generally use dictionary-based tokenizers/morphological-analyzers, like Chinese and Japanese. The good news is that the sub-tokenizers (for handling LaTeX, HTML, XML, email addresses, etc.) can likely be reused across languages.

@stevenbird
Copy link
Member

Before we decide to port anything it would be good to consider how it will be maintained as the original is improved. Another solution is to provide recipes for combining functionality from multiple existing packages which are independently maintained.

@andreasvc
Copy link

My aim with proposing to incorporate syntok is to have a simple default tokenizer/splitter which is (a little) better than the current anglocentric (TreebankTokenizer) or unsupervised (PunktSentenceTokenizer) default.

The question about maintenance is a good one, however providing recipes for using an independently maintained package wouldn't address my concern which is that I would like a better default NLTK tokenizer/splitter which I can recommend beginners to use. Ideally this wouldn't require any extra packages or downloads beyond standard NLTK.

@andreasvc
Copy link

I just noticed the issue title says "for English" so talking about language independence and multilinguality is a bit offtopic...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants