Skip to content

ERRANT v2.3.0

Compare
Choose a tag to compare
@chrisjbryant chrisjbryant released this 15 Jul 16:48

v2.3.0 (15-07-021)

  1. Added some new rules to reduce the number of OTHER-type 1:1 edits and classify them as something else. Specifically, there are now ~40% fewer 1:1 OTHER edits and ~15% fewer n:n OTHER edits overall (tested on the FCE and W&I training sets combined). The changes are as follows:

    • A possessive suffix at the start of a merge sequence is now always split:
    Example people life -> people 's lives
    Old life -> 's lives (R:OTHER)
    New ε -> 's (M:NOUN:POSS), life -> lives (R:NOUN:NUM)
    • NUM <-> DET edits are now classified as R:DET; e.g. one (cat) -> a (cat). Thanks to @katkorre!

    • Changed the string similarity score in the classifier from the Levenshtein ratio to the normalised Levenshtein distance based on the length of the longest input string. This is because we felt some ratio scores were unintuitive; e.g. smt -> something has a ratio score of 0.5 despite the insertion of 6 characters (the new normalised score is 0.33).

    • The non-word spelling error rules were updated slightly to take the new normalised Levenshtein score into account. Additionally, dissimilar strings are now classified based on the POS tag of the correction rather than as OTHER; e.g. amougnht -> number (R:NOUN).

    • The new normalised Levenshtein score is also used to classify many of the remaining 1:1 replacement edits that were previously classified as OTHER. Many of these are real-word spelling errors (e.g. their <-> there), but there are also some morphological errors (e.g. health -> healthy) and POS-based errors (e.g. transport -> travel). Note that these rules are a little complex and depend on both the similarity score and the length of the original and corrected strings. For example, form -> from (R:SPELL) and eventually -> finally (R:ADV) both have the same similarity score of 0.5 yet are differentiated as different error types based on their string lengths.

  2. Various minor updates:

    • out_m2 in parallel_to_m2.py and m2_to_m2.py is now opened and closed properly. #20
    • Fixed a bracketing error that deleted a valid edit in rare circumstances. #26 #28
    • Updated the English wordlist.
    • Minor changes to the readme.
    • Tidied up some code comments.