Merge pull request #87 from zaibacu/rita-rust-engine

Rita rust engine
zaibacu · Aug 29, 2020 · d88d50e · d88d50e
2 parents 91d7718 + 4b5387c
commit d88d50e
Show file tree

Hide file tree

Showing 22 changed files with 278 additions and 46 deletions.
diff --git a/.coveragerc b/.coveragerc
@@ -3,5 +3,8 @@ branch = True
 source =
     rita
 
+omit = rita/engine/translate_rust.py
+
 [report]
 show_missing = True
+omit = rita/engine/translate_rust.py
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,61 @@
+0.6.0 (2020-08-29)
+****************************
+
+Features
+--------
+
+- Implemented ability to alias macros, eg.:
+
+  .. code-block::
+
+      numbers = {"one", "two", "three"}
+      @alias IN_LIST IL
+
+      IL(numbers) -> MARK("NUMBER")
+
+  Now using "IL" will actually call "IN_LIST" macro.
+  #66
+- introduce the TAG element as a module. Needs a new parser for the SpaCy translate.
+  Would allow more flexible matching of detailed part-of-speech tag, like all adjectives or nouns: TAG("^NN|^JJ").
+
+  Implemented by:
+  Roland M. Mueller (https://github.com/rolandmueller)
+  #81
+- Add a new module for a PLURALIZE tag
+  For a noun or a list of nouns, it will match any singular or plural word.
+
+  Implemented by:
+  Roland M. Mueller (https://github.com/rolandmueller)
+  #82
+- Add a new Configuration implicit_hyphon (default false) for automatically adding hyphon characters - to the rules.
+
+  Implemented by:
+  Roland M. Mueller (https://github.com/rolandmueller)
+  #84
+- Allow to give custom regex impl. By default `re` is used
+  #86
+- An interface to be able to use rust engine.
+
+  In general it's identical to `standalone`, but differs in one crucial part - all of the rules are compiled into actual binary code and that provides large performance boost.
+  It is proprietary, because there are various caveats, engine itself is a bit more fragile and needs to be tinkered to be optimized to very specific case
+  (eg. few long texts with many matches vs a lot short texts with few matches).
+  #87
+
+Fix
+---
+
+- Fix `-` bug when it is used as stand alone word
+  #71
+- Fix regex matching, when shortest word is selected from IN_LIST
+  #72
+- Fix IN_LIST regex so that it wouldn't take part of word
+  #75
+- Fix IN_LIST operation bug - it was ignoring them
+  #77
+- Use list branching only when using spaCy Engine
+  #80
+
+
 0.5.0 (2020-06-18)
 ****************************
 

diff --git a/changes/66.feature.rst b/changes/66.feature.rst
diff --git a/changes/71.fix.rst b/changes/71.fix.rst
diff --git a/changes/72.fix.rst b/changes/72.fix.rst
diff --git a/changes/75.fix.rst b/changes/75.fix.rst
diff --git a/changes/77.fix.rst b/changes/77.fix.rst
diff --git a/changes/80.fix.rst b/changes/80.fix.rst
diff --git a/changes/81.feature.rst b/changes/81.feature.rst
diff --git a/changes/82.feature.rst b/changes/82.feature.rst
diff --git a/changes/84.feature.rst b/changes/84.feature.rst
diff --git a/changes/86.feature.rst b/changes/86.feature.rst
diff --git a/docs/engines.md b/docs/engines.md
@@ -0,0 +1,36 @@
+# Engines
+
+In RITA what we call `engine` is a system we will compile rules to, and which will do the heavy lifting after that.
+
+Currently there are three engines:
+
+## spaCy
+
+Activated by using `rita.compile(<rules_file>, use_engine="spacy")`
+
+Using this engine, all of the RITA rules will be compiled into spaCy patterns, which can be natively used by spaCy in various scenarios.
+Most often - to improve NER (Named Entity Recognition), by adding additional entities derived from your given rules
+
+It requires to have spaCy package installed (`pip install spacy`) and to actually use it later, language model needs to be downloaded (`python -m spacy download <language_code>`)
+
+## Standalone
+
+Activated by using `rita.compile(<rules_file>, use_engine="standalone")`. It compiles into pure regex and can be used with zero dependencies.
+By default, it uses Python `re` library. Since `0.5.10` version, you can give a custom regex implementation to use:
+eg. regex package: `rita.compile(<rules_file>, use_engine="standalone", regex_impl=regex)`
+
+It is very lightweight, very fast (compared to spaCy), however lacking in some functionality which only proper language model can bring:
+- Patterns by entity (PERSON, ORGANIZATION, etc)
+- Patterns by Lemmas
+- Patterns by POS (Part Of Speech)
+
+Only generic things, like WORD, NUMBER can be matched.
+
+
+## Rust (new in `0.6.0`)
+
+There's only an interface inside the code, engine itself is proprietary. 
+
+In general it's identical to `standalone`, but differs in one crucial part - all of the rules are compiled into actual binary code and that provides large performance boost.
+It is proprietary, because there are various caveats, engine itself is a bit more fragile and needs to be tinkered to be optimized to very specific case
+(eg. few long texts with many matches vs a lot short texts with few matches).
diff --git a/docs/modules.md b/docs/modules.md
@@ -0,0 +1,56 @@
+# Modules
+
+Modules are like plugins to the system, usually providing additional functionality at some cost - needs additional dependencies, supports only specific language etc.
+That's why they are not included into the core system, but can be easily included into your rules.
+
+eg.
+```
+!IMPORT("rita.modules.fuzzy")
+
+FUZZY("squirrel") -> MARK("CRITTER")
+```
+
+**NOTE**: the import path can be any proper Python import. So this actually allows you to add extra functionality by not modifying RITA's source code.
+More on that in [Extending section](./extend.md)
+
+## Fuzzy
+
+This is more as an example rather than proper module. The main goal is to generate possible misspelled variants of given word, so that match matches more cases. 
+Very useful when dealing with actual natural language, eg. comments, social media posts. Word `you` can be automatically matched by proper `you` and `u`, `for` as `for` and `4` etc.
+
+Usage:
+```
+!IMPORT("rita.modules.fuzzy")
+
+FUZZY("squirrel") -> MARK("CRITTER")
+```
+
+## Pluralize
+
+Takes list (or single) words, and creates plural version of each of these.
+
+Requires: `inflect` library (`pip install inflect`) before using. Works only on english words.
+
+Usage:
+
+```
+!IMPORT("rita.modules.pluralize")
+
+vehicles={"car", "motorbike", "bicycle", "ship", "plane"}
+{NUM, PLURALIZE(vehicles)}->MARK("VEHICLES")
+```
+
+## Tag
+
+Is used or generating POS/TAG patterns based on a Regex
+e.g. TAG("^NN|^JJ") for nouns or adjectives.
+
+Works only with spaCy engine
+
+Usage:
+
+```
+!IMPORT("rita.modules.tag")
+
+{WORD*, TAG("^NN|^JJ")}->MARK("TAGGED_MATCH")
+```
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -7,6 +7,8 @@ nav:
   - Quickstart: quickstart.md
   - Syntax: syntax.md
   - Macros: macros.md
+  - Engines: engines.md
+  - Modules: modules.md
   - Extending: extend.md
   - Config: config.md
   - Advanced: advanced.md

diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "rita-dsl"
-version = "0.5.10"
+version = "0.6.0"
 description = "DSL for building language rules"
 authors = [
     "Šarūnas Navickas <zaibacu@gmail.com>"

diff --git a/rita/__init__.py b/rita/__init__.py
@@ -10,7 +10,7 @@
 
 logger = logging.getLogger(__name__)
 
-__version__ = (0, 5, 10, os.getenv("VERSION_PATCH"))
+__version__ = (0, 6, 0, os.getenv("VERSION_PATCH"))
 
 
 def get_version():

diff --git a/rita/config.py b/rita/config.py
@@ -8,6 +8,7 @@
     pass
 
 from rita.engine.translate_standalone import compile_rules as standalone_engine
+from rita.engine.translate_rust import compile_rules as rust_engine
 
 from rita.utils import SingletonMixin
 
@@ -27,6 +28,7 @@ def __init__(self):
             # spacy_engine is not imported
             pass
         self.register_engine(2, "standalone", standalone_engine)
+        self.register_engine(3, "rust", rust_engine)
 
     def register_engine(self, priority, key, compile_fn):
         self.available_engines.append((priority, key, compile_fn))

diff --git a/rita/engine/translate_rust.py b/rita/engine/translate_rust.py
@@ -0,0 +1,89 @@
+import os
+import logging
+
+from ctypes import (c_char_p, c_size_t, c_uint, Structure, cdll, POINTER)
+
+from rita.engine.translate_standalone import rules_to_patterns, RuleExecutor
+
+logger = logging.getLogger(__name__)
+
+
+class ResultEntity(Structure):
+    _fields_ = [
+        ("label", c_char_p),
+        ("text", c_char_p),
+        ("start", c_size_t),
+        ("end", c_size_t),
+    ]
+
+
+class ResultsWrapper(Structure):
+    _fields_ = [
+        ("count", c_uint),
+        ("results", (ResultEntity * 32))
+    ]
+
+
+class Context(Structure):
+    _fields_ = []
+
+
+def load_lib():
+    try:
+        if "nt" in os.name:
+            lib = cdll.LoadLibrary("rita_rust.dll")
+        elif os.name == "posix":
+            lib = cdll.LoadLibrary("librita_rust.dylib")
+        else:
+            lib = cdll.LoadLibrary("librita_rust.so")
+        lib.compile.restype = POINTER(Context)
+        lib.execute.argtypes = [POINTER(Context), c_char_p]
+        lib.execute.restype = ResultsWrapper
+        lib.clean_env.argtypes = [POINTER(Context)]
+        return lib
+    except Exception as ex:
+        logger.error("Failed to load rita-rust library, reason: {}\n\n"
+                     "Most likely you don't have required shared library to use it".format(ex))
+
+
+class RustRuleExecutor(RuleExecutor):
+    def __init__(self, patterns, config):
+        self.config = config
+        self.context = None
+
+        self.lib = load_lib()
+        self.patterns = [self._build_regex_str(label, rules)
+                         for label, rules in patterns]
+
+        self.compile()
+
+    @staticmethod
+    def _build_regex_str(label, rules):
+        return r"(?P<{0}>{1})".format(label, "".join(rules))
+
+    def compile(self):
+        flag = 0 if self.config.ignore_case else 1
+        c_array = (c_char_p * len(self.patterns))(*list([p.encode("UTF-8") for p in self.patterns]))
+        self.context = self.lib.compile(c_array, len(c_array), flag)
+        return self.context
+
+    def _results(self, text):
+        raw = self.lib.execute(self.context, text.encode("UTF-8"))
+        for i in range(0, raw.count):
+            match = raw.results[i]
+            yield {
+                "start": match.start,
+                "end": match.end,
+                "text": match.text.decode("UTF-8").strip(),
+                "label": match.label.decode("UTF-8"),
+            }
+
+    def clean_context(self):
+        self.lib.clean_env(self.context)
+
+
+def compile_rules(rules, config, **kwargs):
+    logger.info("Using rita-rust rule implementation")
+    patterns = [rules_to_patterns(*group) for group in rules]
+    executor = RustRuleExecutor(patterns, config)
+    return executor
diff --git a/tests/test_config.py b/tests/test_config.py
@@ -20,7 +20,7 @@ def test_registered_engines(cfg):
 def test_registered_engines_has_spacy(cfg):
     pytest.importorskip("spacy", minversion="2.1")
     from rita.engine.translate_spacy import compile_rules
-    assert len(cfg.available_engines) == 2
+    assert len(cfg.available_engines) == 3
     assert cfg.default_engine == compile_rules