This version introduces new and improved tools, focusing on out-of-domain accuracy and robustness:
- New 3 step normalization framework using Foma
- Added smart rebinding module (
-d 3
) by @lgessler - New stacked segmentation, now using xgboost and better handling of ambiguous groups
- New POS tagger using Marmot
- Hyperparameter optimization
- Various data/lexicon/ruleset improvements and bugfixes
- Complete unit test suite in
run_tests.py
and evaluation suite ineval/