Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include some pre-packaged NLP tools #95

Open
amir-zeldes opened this issue Sep 19, 2018 · 2 comments
Open

Include some pre-packaged NLP tools #95

amir-zeldes opened this issue Sep 19, 2018 · 2 comments

Comments

@amir-zeldes
Copy link
Contributor

e.g. make a builtin tokenizer addressable not as an external REST API

@lgessler
Copy link
Collaborator

NLTK has several tokenizers that we could allow users to choose from using a line in the config

@amir-zeldes
Copy link
Contributor Author

One issue with NLTK is that it's not XML preserving: if users need to be able to transform data to spreadsheet mode, we need a tokenizer that produces TT-SGML (or we offer different ways of transforming to spreadsheets). The TreeTagger tokenizer does this, but is in native Perl (this is what GU GitDox currently uses via a service call). But I recently ported this tokenizer to Python here:

https://github.com/amir-zeldes/HebPipe/blob/master/lib/whitespace_tokenize.py

This could be a candidate for a generic tokenizer which preserves XML, outputs TT format, and you can plug different abbreviation files to match language specific abbreviations not to split.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants