-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorporate more accurate sentence-splitter, tokenizer, and/or lemmatizer for English? #1214
Comments
+1 @nschneid Most of Rebecca's work is in HPSG which I would love to integrate into NLTK but it's a tough nut. @goodmami, @fcbond and the DELPH-IN group has done quite some work with https://github.com/delph-in/pydelphin Possibly a python wrapper to Repp might be worth the code =) |
@nschneid after some trawling on the REPP code, there're quite a lot of LISP rules written in separate file. Maybe the first thing we could try is to organize all of them into a single text file and then write a wrapper to read these rules. Or possibly just wrapping the whole tool itself might be easier like what we did with alvas@ubi:~/repp$ cat test.txt
Tokenization is widely regarded as a solved problem due to the high accuracy that rulebased tokenizers achieve.
But rule-based tokenizers are hard to maintain and their rules language specific.
We show that high accuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning.
We evaluated our method on three languages and obtained error rates of 0.27 ‰ (English), 0.35 ‰ (Dutch) and 0.76 ‰ (Italian) for our best models.
alvas@ubi:~/repp$ cat test.txt|src/repp -c erg/repp.set
Tokenization is widely regarded as a solved problem due to the high accuracy that rulebased tokenizers achieve .
But rule-based tokenizers are hard to maintain and their rules language specific .
We show that high accuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning .
We evaluated our method on three languages and obtained error rates of 0.27 ‰ ( English ) , 0.35 ‰ ( Dutch ) and 0.76 ‰ ( Italian ) for our best models .
alvas@ubi:~/repp$ cat test.txt|src/repp -c erg/repp.set --format triple
(0, 12, Tokenization)
(13, 15, is)
(16, 22, widely)
(23, 31, regarded)
(32, 34, as)
(35, 36, a)
(37, 43, solved)
(44, 51, problem)
(52, 55, due)
(56, 58, to)
(59, 62, the)
(63, 67, high)
(68, 76, accuracy)
(77, 81, that)
(82, 91, rulebased)
(92, 102, tokenizers)
(103, 110, achieve)
(110, 111, .)
(0, 3, But)
(4, 14, rule-based)
(15, 25, tokenizers)
(26, 29, are)
(30, 34, hard)
(35, 37, to)
(38, 46, maintain)
(47, 50, and)
(51, 56, their)
(57, 62, rules)
(63, 71, language)
(72, 80, specific)
(80, 81, .)
(0, 2, We)
(3, 7, show)
(8, 12, that)
(13, 17, high)
(18, 26, accuracy)
(27, 31, word)
(32, 35, and)
(36, 44, sentence)
(45, 57, segmentation)
(58, 61, can)
(62, 64, be)
(65, 73, achieved)
(74, 76, by)
(77, 82, using)
(83, 93, supervised)
(94, 102, sequence)
(103, 111, labeling)
(112, 114, on)
(115, 118, the)
(119, 128, character)
(129, 134, level)
(135, 143, combined)
(144, 148, with)
(149, 161, unsupervised)
(162, 169, feature)
(170, 178, learning)
(178, 179, .)
(0, 2, We)
(3, 12, evaluated)
(13, 16, our)
(17, 23, method)
(24, 26, on)
(27, 32, three)
(33, 42, languages)
(43, 46, and)
(47, 55, obtained)
(56, 61, error)
(62, 67, rates)
(68, 70, of)
(71, 75, 0.27)
(76, 77, ‰)
(78, 79, ()
(79, 86, English)
(86, 87, ))
(87, 88, ,)
(89, 93, 0.35)
(94, 95, ‰)
(96, 97, ()
(97, 102, Dutch)
(102, 103, ))
(104, 107, and)
(108, 112, 0.76)
(113, 114, ‰)
(115, 116, ()
(116, 123, Italian)
(123, 124, ))
(125, 128, for)
(129, 132, our)
(133, 137, best)
(138, 144, models)
(144, 145, .)
Possibly there are constraints up the tool chain when it reaches the HPSG parser too. Maybe directly using the ACE http://sweaglesw.org/linguistics/ace/ using On a side note there's also these from Moses MT toolkit: https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer . For these, it would be good to have for better consistency between For reference sake, there's also Penn Treebank tokenizer, the art of tokenization and bio/medical tokenizers |
@alvations I don't think we would gain much by using a full parser like ACE just for tokenization, unless you also want morphological analysis, or pronoun detection, or quantifier scope, or something else. I don't think it would be all that difficult to implement a REPP tokenizer in pyDelphin that doesn't require the full grammars or parsers (see the newly created delph-in/pydelphin#43), or perhaps such an implementation could be a standalone module (i.e. separate from pyDelphin), which would make it easier to incorporate into NLTK. But REPP is basically just a system of regular expressions with a few extra niceties, so I think the primary gain for NLTK is that it could use the systems developed for the DELPH-IN grammars. However, the paper that @nschneid linked to says that Dridan and Oepen "eliminate[d] two thirds of the remaining tokenization errors" on the PTB data compared to the best-performing off-the-shelf system at the time by using REPP. On a related note, Fokkens et al. 2013 showed that tokenization (among other often-overlooked factors) has significant effects on system performance and, thus, reproducibility of results. So maybe a REPP implementation for NLTK would be useful? In general I'd be happy to help with the implementation, but I don't have spare cycles to do it all myself. |
@nschneid for now, the simplest solution seems to be wrapping REPP and reading the output files like other third party tools in NLTK. It seems simple enough and there are several implementations. I'm not sure which one to choose given the variety of solutions: http://stackoverflow.com/questions/34416365/unpacking-tuple-like-textfile Which implementation should we use to wrap REPP in NLTK? @goodmami Meanwhile the slower but easier to maintain option is to rewrite the LISP + Perl + Boost regexes but I think we can keep this in |
I have no particular objection to REPP, but it may be worth exploring other tools that are out there. For example, https://github.com/armatthews/TokenizeAnything is written in Python and claims to do something reasonable for most languages. (It's a fairly young implementation, but is based on https://github.com/redpony/cdec/blob/master/corpus/tokenize-anything.sh, which has been around for awhile.) @armatthews, do you think it would be a good idea to include it in NLTK? |
Another question more broadly is what options/functionality we want the tokenizer to support. E.g., I think it would be useful to have:
|
+1 for TokenizeAnything. There's also https://github.com/jonsafari/tok-tok from @jonsafari. |
I've written a small wrapper for REPP: https://github.com/alvations/nltk/blob/repp/nltk/tokenize/repp.py. Will do a PR once the |
Ported |
@alvations so you wrapped a REPP binary instead of implementing a REPP processor? It would then be good to provide a link to where someone could get such a binary. And I haven't used the REPP binary directly---do you know how portable it is? |
@goodmami I would put up the information on https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software once I get some time to do a PR. I wrote the wrapper while being stuck on a train without wifi =) I'm not sure whether REPP works outside of linux though, possibly we'll have to ask rebecca or stefan whether they've tried installing REPP on a windows/mac. I would still want to reimplement REPP at some point but I can't commit that much time for a while more. |
If anyone else is interested in reimplementing/wrapping other tokenizers/stemmers/lemmatizers, I would suggest the following list of tools that would make a Sorry I've pressed closed the issue wrongly. It's reopened now. |
From #1860, it looks like the TreebankTokenizer that we're using as the default Taking a closer look at the TreebankTokenizer regexes and MosesTokenizer regexes, there's no easy solution. For the For Moses, it looks like the default tokenizer didn't care about keeping the numbers together:
Perhaps, we need to consider a more modern tokenizer as NLTK default |
Consider #2202? :) |
Also, of note, the current bug (I mean #1214) seems to miss the tokenizer label |
@alvations still missing the label :p |
Would there be interest in including syntok as the default sentence splitter and word tokenizer? It is pure Python. |
@andreasvc have you seen #2202 (comment) +sequel? I wonder how one could accommodate both simultaneously and/or in a hybrid manner 🤔 |
I saw it, it sounds more complicated requiring big data files etc. syntok is a simple, self-contained multilingual regex sentence splitter and tokenizer, which keeps track of the original string indices. it seems like it would be a good default since it wouldn't require a download. |
As another alternative, and following up from earlier comments in this issue, the pure Python REPP implementation in PyDelphin (docs) has been available for about 3 years now and I'll soon be adding masking support (so things like email addresses don't get split on punctuation). A few notes:
I'm happy to help someone with porting it to the NLTK, but otherwise I'm not strongly pushing it as it already works well in PyDelphin and other options (syntok, etc.) may be good enough for the NLTK. |
@goodmami that sounds good. Is it multilingual? You can get decent tokenization from a language-independent tokenizer, but good results require some language-specific rules/data. And from your assessment it does sound more difficult to port. I develop pyre2 so I know about the hassles of different regex flavors... |
The implementation is just the engine for applying systems of regexes, tracking original string indices, etc. The rules for tokenization are defined separately, mainly by implemented HPSG grammars. For instance, the English Resource Grammar has a rather advanced tokenizer definition, and the Indonesian grammar INDRA has a significant one, among some others. But, alas, definitions for many languages would need to be created or expanded to make a more multilingual offering. I suspect it (or any regex-based tokenizer) wouldn't work so well for languages that generally use dictionary-based tokenizers/morphological-analyzers, like Chinese and Japanese. The good news is that the sub-tokenizers (for handling LaTeX, HTML, XML, email addresses, etc.) can likely be reused across languages. |
Before we decide to port anything it would be good to consider how it will be maintained as the original is improved. Another solution is to provide recipes for combining functionality from multiple existing packages which are independently maintained. |
My aim with proposing to incorporate syntok is to have a simple default tokenizer/splitter which is (a little) better than the current anglocentric (TreebankTokenizer) or unsupervised (PunktSentenceTokenizer) default. The question about maintenance is a good one, however providing recipes for using an independently maintained package wouldn't address my concern which is that I would like a better default NLTK tokenizer/splitter which I can recommend beginners to use. Ideally this wouldn't require any extra packages or downloads beyond standard NLTK. |
I just noticed the issue title says "for English" so talking about language independence and multilinguality is a bit offtopic... |
Among open issues, we have (not an exhaustive list):
I'm not an expert on these tasks but I know that Rebecca Dridan, for instance, has recently published methods for some of them. Given that segmentation and lemmatization are so widely used for preprocessing, state-of-the-art methods may deserve some attention.
The issue of genre (news/edited text, informal web text, tweets/SMS) is important as well: hence the separate Twitter tokenizer.
The text was updated successfully, but these errors were encountered: