Skip to content

Releases: CopticScriptorium/coptic-nlp

V3.0.0 - New tools and improved accuracy

20 Sep 16:25
c8931ba
Compare
Choose a tag to compare

This version introduces new and improved tools, focusing on out-of-domain accuracy and robustness:

  • New 3 step normalization framework using Foma
  • Added smart rebinding module (-d 3) by @lgessler
  • New stacked segmentation, now using xgboost and better handling of ambiguous groups
  • New POS tagger using Marmot
  • Hyperparameter optimization
  • Various data/lexicon/ruleset improvements and bugfixes
  • Complete unit test suite in run_tests.py and evaluation suite in eval/

V2.2.0 - bugfix and better interface to detokenizer

17 Jun 14:06
a35476b
Compare
Choose a tag to compare
  • Refactor rf_tokenizer for relative import as single file
  • Smarter auto 'line' tag detection in api.py
  • Adjust boundaries within thetas in 'from pipes' mode (bug fix)
  • Add detokenizer to web interface
  • More control over detokenizer aggressive/conservative + split norms at group merge point
  • Option to merge gold trees into pipeline

V2.1.0 - MWE detection, detokenizer, improved whitespace/punctuation handling

26 Oct 14:41
57c668e
Compare
Choose a tag to compare
  • Multiword expression detection based on Coptic Dictionary Online (use -m option)
  • Detokenizer to auto-adjust bound groups to Layton's segmentation standards:
    • -d 1 = conservative (only re-bind high certainty groups from alternative editorial practices)
    • -d 2 = aggressive (re-bind anything that doesn't look like it should be separate)
    • --segment_merged option: enforces a boundary at detokenized merge positions
  • Improved --space option to separate punctuation spelled together with bound groups
  • Various bug fixes and performance improvements