ERRANT v2.2.0

chrisjbryant · May 6, 2020 · 1a56544 · 1a56544
1 parent e1e6066
commit 1a56544
Show file tree

Hide file tree

Showing 6 changed files with 33 additions and 57 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,12 @@
 
 This log describes all the significant changes made to ERRANT since its release.
 
+## v2.2.0 (06-05-20)
+
+1. ERRANT now works with spaCy v2.2. It is 4x slower, but this change was necessary to make it work on Python 3.7.  
+
+2. SpaCy 2 uses slightly different POS tags to spaCy 1 (e.g. auxiliary verbs are now tagged AUX rather than VERB) so I updated some of the merging rules to maintain performance.
+
 ## v2.1.0 (09-01-20)
 
 1. The character level cost in the sentence alignment function is now computed by the much faster [python-Levenshtein](https://pypi.org/project/python-Levenshtein/) library instead of python's native `difflib.SequenceMatcher`. This makes ERRANT 3x faster!

diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# ERRANT v2.1.0
+# ERRANT v2.2.0
 
 This repository contains the grammatical ERRor ANnotation Toolkit (ERRANT) described in:
 
@@ -37,20 +37,23 @@ source errant_env/bin/activate
 pip3 install errant
 python3 -m spacy download en
 ```
-This will create and activate a new python3 environment called `errant_env` in the current directory. `pip` will then install ERRANT, [spaCy v1.9.0](https://spacy.io/), [NLTK](http://www.nltk.org/), [python-Levenshtein](https://pypi.org/project/python-Levenshtein/) and spaCy's default English model in this environment. You can deactivate the environment at any time by running `deactivate`, but must remember to activate it again whenever you want to use ERRANT.  
+This will create and activate a new python3 environment called `errant_env` in the current directory. `pip` will then install ERRANT, [spaCy](https://spacy.io/), [NLTK](http://www.nltk.org/), [python-Levenshtein](https://pypi.org/project/python-Levenshtein/) and spaCy's default English model in this environment. You can deactivate the environment at any time by running `deactivate`, but must remember to activate it again whenever you want to use ERRANT.  
 
-#### BEA-2019 Shared Task
+#### ERRANT and spaCy 2
+
+ERRANT was originally designed to work with spaCy v1.9.0 and works best with this version. SpaCy v1.9.0 does not work with Python >= 3.7 however, and so we were forced to update ERRANT to be compatible with spaCy 2. Since spaCy 2 uses a neural system to trade speed for accuracy (see the [official spaCy benchmarks](https://spacy.io/usage/facts-figures#spacy-models)), this means ERRANT v2.2.0 is **over 4x slower** than ERRANT v2.1.0.
 
-ERRANT v2.0.0 was designed to be fully compatible with the [BEA-2019 Shared Task](https://www.cl.cam.ac.uk/research/nl/bea2019st/). If you want to directly compare against the results in the shared task, you should make sure to install ERRANT v2.0.0 as newer versions may produce slightly different scores.  
+There is no way around this if you use Python >= 3.7, but we recommend installing ERRANT v2.1.0 if you use Python < 3.7.  
 ```
-pip3 install errant==2.0.0
+pip3 install errant==2.1.0
 ```
 
-#### ERRANT and spaCy 2
-
-ERRANT was originally designed to work with spaCy v1.9.0 and so only officially supports this version. We nevertheless tested ERRANT v2.1.0 with spaCy v2.2.3 and found it to be **over 4x slower and ~2% less accurate**. 
+#### BEA-2019 Shared Task
 
-This is mainly because spaCy 2 uses a neural system to trade speed for accuracy (see the [official spaCy benchmarks](https://spacy.io/usage/facts-figures#spacy-models)), but also because some Universal POS tag mappings changed, and so certain ERRANT rules no longer worked as intended. Although we could offset the accuracy loss by modifying ERRANT rules for the new POS mappings, there is nothing we can do about the significant speed loss, and so do not recommend spaCy 2 with ERRANT at this time. 
+ERRANT v2.0.0 was designed to be fully compatible with the [BEA-2019 Shared Task](https://www.cl.cam.ac.uk/research/nl/bea2019st/). If you want to directly compare against the results in the shared task, you should make sure to install ERRANT v2.0.0 as newer versions may produce slightly different scores. You can also use [Codalab](https://competitions.codalab.org/competitions/20228) to evaluate anonymously on the shared task datasets. ERRANT v2.0.0 is not compatible with Python >= 3.7.  
+```
+pip3 install errant==2.0.0
+```
 
 ## Source Install
 
@@ -100,10 +103,6 @@ Three main commands are provided with ERRANT: `errant_parallel`, `errant_m2` and
 
 All these scripts also have additional advanced command line options which can be displayed using the `-h` flag. 
 
-#### Runtime
-
-In terms of speed, ERRANT processes ~500 sents/sec in the fully automatic edit extraction and classification setting, but ~1000 sents/sec in the classification setting alone. These figures were calculated on an Intel Core i5-6600 @ 3.30GHz machine, but results will vary depending on how different/long the original and corrected sentences are.  
-
 ## API
 
 As of v2.0.0, ERRANT now also comes with an API.

diff --git a/errant/__init__.py b/errant/__init__.py
@@ -1,10 +1,9 @@
 from importlib import import_module
-import logging
 import spacy
 from errant.annotator import Annotator
 
 # ERRANT version
-__version__ = '2.1.0'
+__version__ = '2.2.0'
 
 # Load an ERRANT Annotator object for a given language
 def load(lang, nlp=None):
@@ -15,10 +14,6 @@ def load(lang, nlp=None):
 
     # Load spacy
     nlp = nlp or spacy.load(lang, disable=["ner"])
-    # Warning for spacy 2
-    if spacy.__version__[0] == "2":
-        logging.warning("ERRANT is 4x slower and 2% less accurate with spaCy 2. "
-            "We strongly recommend spaCy 1.9.0!")
 
     # Load language edit merger
     merger = import_module("errant.%s.merger" % lang)

diff --git a/errant/en/classifier.py b/errant/en/classifier.py
@@ -2,7 +2,7 @@
 import Levenshtein
 from nltk.stem import LancasterStemmer
 import spacy
-import spacy.parts_of_speech as POS
+import spacy.symbols as POS
 
 # Load Hunspell word list
 def load_word_list(path):
@@ -201,7 +201,7 @@ def get_two_sided_type(o_toks, c_toks):
             if o_toks[0].text not in spell and \
                     o_toks[0].lower_ not in spell:
                 # Check if both sides have a common lemma
-                if same_lemma(o_toks[0], c_toks[0]):
+                if o_toks[0].lemma == c_toks[0].lemma:
                     # Inflection; often count vs mass nouns or e.g. got vs getted
                     if o_pos == c_pos and o_pos[0] in {"NOUN", "VERB"}:
                         return o_pos[0]+":INFL"
@@ -227,7 +227,7 @@ def get_two_sided_type(o_toks, c_toks):
 
         # 3. MORPHOLOGY
         # Only ADJ, ADV, NOUN and VERB can have inflectional changes.
-        if same_lemma(o_toks[0], c_toks[0]) and \
+        if o_toks[0].lemma == c_toks[0].lemma and \
                 o_pos[0] in open_pos2 and \
                 c_pos[0] in open_pos2:
             # Same POS on both sides
@@ -316,7 +316,7 @@ def get_two_sided_type(o_toks, c_toks):
     if len(set(o_pos+c_pos)) == 1:
         # Final verbs with the same lemma are tense; e.g. eat -> has eaten 
         if o_pos[0] == "VERB" and \
-                same_lemma(o_toks[-1], c_toks[-1]):
+                o_toks[-1].lemma == c_toks[-1].lemma:
             return "VERB:TENSE"
         # POS-based tags.
         elif o_pos[0] not in rare_pos:
@@ -328,19 +328,19 @@ def get_two_sided_type(o_toks, c_toks):
     # Infinitives, gerunds, phrasal verbs.
     if set(o_pos+c_pos) == {"PART", "VERB"}:
         # Final verbs with the same lemma are form; e.g. to eat -> eating
-        if same_lemma(o_toks[-1], c_toks[-1]):
+        if o_toks[-1].lemma == c_toks[-1].lemma:
             return "VERB:FORM"
         # Remaining edits are often verb; e.g. to eat -> consuming, look at -> see
         else:
             return "VERB"
     # Possessive nouns; e.g. friends -> friend 's
     if (o_pos == ["NOUN", "PART"] or c_pos == ["NOUN", "PART"]) and \
-            same_lemma(o_toks[0], c_toks[0]):
+            o_toks[0].lemma == c_toks[0].lemma:
         return "NOUN:POSS"
     # Adjective forms with "most" and "more"; e.g. more free -> freer
     if (o_toks[0].lower_ in {"most", "more"} or \
             c_toks[0].lower_ in {"most", "more"}) and \
-            same_lemma(o_toks[-1], c_toks[-1]) and \
+            o_toks[-1].lemma == c_toks[-1].lemma and \
             len(o_toks) <= 2 and len(c_toks) <= 2:
         return "ADJ:FORM"
 
@@ -369,30 +369,6 @@ def exact_reordering(o_toks, c_toks):
         return True
     return False
 
-# Input 1: A spacy orig token
-# Input 2: A spacy cor token
-# Output: Boolean; the tokens have the same lemma
-# Spacy only finds lemma for its predicted POS tag. Sometimes these are wrong,
-# so we also consider alternative POS tags to improve chance of a match.
-def same_lemma(o_tok, c_tok):
-    # Basic lemmatisation for spacy >= 2 (avoids an error at least)
-    if spacy.__version__ != "1.9.0":
-        if o_tok.lemma == c_tok.lemma:
-            return True
-        return False
-    # Multi-POS lemmatisation for spacy 1.9.0
-    o_lemmas = []
-    c_lemmas = []
-    for pos in open_pos1:
-        # Lemmatise the lower cased form of the word
-        o_lemmas.append(nlp.vocab.morphology.lemmatize(
-            pos, o_tok.lower, nlp.vocab.morphology.tag_map))
-        c_lemmas.append(nlp.vocab.morphology.lemmatize(
-            pos, c_tok.lower, nlp.vocab.morphology.tag_map))
-    if set(o_lemmas).intersection(set(c_lemmas)):
-        return True
-    return False
-
 # Input 1: An original text spacy token. 
 # Input 2: A corrected text spacy token.
 # Output: Boolean; both tokens have a dependant auxiliary verb.

diff --git a/errant/en/merger.py b/errant/en/merger.py
@@ -2,11 +2,11 @@
 from re import sub
 from string import punctuation
 import Levenshtein
-import spacy.parts_of_speech as POS
+import spacy.symbols as POS
 from errant.edit import Edit
 
 # Merger resources
-open_pos = {POS.ADJ, POS.ADV, POS.NOUN, POS.VERB}
+open_pos = {POS.ADJ, POS.AUX, POS.ADV, POS.NOUN, POS.VERB}
 
 # Input: An Alignment object
 # Output: A list of Edit objects
@@ -78,11 +78,11 @@ def process_seq(seq, alignment):
             return process_seq(seq[:start], alignment) + \
                 merge_edits(seq[start:end+1]) + \
                 process_seq(seq[end+1:], alignment)
-        # Merge same POS or infinitive/phrasal verbs: 
+        # Merge same POS or auxiliary/infinitive/phrasal verbs:
         # [to eat -> eating], [watch -> look at]
         pos_set = set([tok.pos for tok in o]+[tok.pos for tok in c])
-        if (len(pos_set) == 1 and len(o) != len(c)) or \
-                pos_set == {POS.PART, POS.VERB}:
+        if len(o) != len(c) and (len(pos_set) == 1 or \
+                pos_set.issubset({POS.AUX, POS.PART, POS.VERB})):
             return process_seq(seq[:start], alignment) + \
                 merge_edits(seq[start:end+1]) + \
                 process_seq(seq[end+1:], alignment)

diff --git a/setup.py b/setup.py
@@ -10,7 +10,7 @@
 
 setup(
     name = "errant",
-    version = "2.1.0",
+    version = "2.2.0",
     license = "MIT",
     description = "The ERRor ANnotation Toolkit (ERRANT). Automatically extract and classify edits in parallel sentences.",
     long_description = readme,
@@ -20,7 +20,7 @@
     url = "https://github.com/chrisjbryant/errant",    
     keywords = ["automatic annotation", "grammatical errors", "natural language processing"],
     python_requires = ">= 3.3",
-    install_requires = ["spacy==1.9.0", "nltk==3.4.5", "python-Levenshtein==0.12.0"],
+    install_requires = ["spacy>=2.2.0", "nltk==3.4.5", "python-Levenshtein==0.12.0"],
     packages = find_packages(),    
     include_package_data=True,
     entry_points = {