Skip to content

Commit

Permalink
Merge pull request #87 from zaibacu/rita-rust-engine
Browse files Browse the repository at this point in the history
Rita rust engine
  • Loading branch information
zaibacu committed Aug 29, 2020
2 parents 91d7718 + 4b5387c commit d88d50e
Show file tree
Hide file tree
Showing 22 changed files with 278 additions and 46 deletions.
3 changes: 3 additions & 0 deletions .coveragerc
Expand Up @@ -3,5 +3,8 @@ branch = True
source =
rita

omit = rita/engine/translate_rust.py

[report]
show_missing = True
omit = rita/engine/translate_rust.py
58 changes: 58 additions & 0 deletions CHANGELOG.md
@@ -1,3 +1,61 @@
0.6.0 (2020-08-29)
****************************

Features
--------

- Implemented ability to alias macros, eg.:

.. code-block::

numbers = {"one", "two", "three"}
@alias IN_LIST IL

IL(numbers) -> MARK("NUMBER")

Now using "IL" will actually call "IN_LIST" macro.
#66
- introduce the TAG element as a module. Needs a new parser for the SpaCy translate.
Would allow more flexible matching of detailed part-of-speech tag, like all adjectives or nouns: TAG("^NN|^JJ").

Implemented by:
Roland M. Mueller (https://github.com/rolandmueller)
#81
- Add a new module for a PLURALIZE tag
For a noun or a list of nouns, it will match any singular or plural word.

Implemented by:
Roland M. Mueller (https://github.com/rolandmueller)
#82
- Add a new Configuration implicit_hyphon (default false) for automatically adding hyphon characters - to the rules.

Implemented by:
Roland M. Mueller (https://github.com/rolandmueller)
#84
- Allow to give custom regex impl. By default `re` is used
#86
- An interface to be able to use rust engine.

In general it's identical to `standalone`, but differs in one crucial part - all of the rules are compiled into actual binary code and that provides large performance boost.
It is proprietary, because there are various caveats, engine itself is a bit more fragile and needs to be tinkered to be optimized to very specific case
(eg. few long texts with many matches vs a lot short texts with few matches).
#87

Fix
---

- Fix `-` bug when it is used as stand alone word
#71
- Fix regex matching, when shortest word is selected from IN_LIST
#72
- Fix IN_LIST regex so that it wouldn't take part of word
#75
- Fix IN_LIST operation bug - it was ignoring them
#77
- Use list branching only when using spaCy Engine
#80


0.5.0 (2020-06-18)
****************************

Expand Down
10 changes: 0 additions & 10 deletions changes/66.feature.rst

This file was deleted.

1 change: 0 additions & 1 deletion changes/71.fix.rst

This file was deleted.

1 change: 0 additions & 1 deletion changes/72.fix.rst

This file was deleted.

1 change: 0 additions & 1 deletion changes/75.fix.rst

This file was deleted.

1 change: 0 additions & 1 deletion changes/77.fix.rst

This file was deleted.

1 change: 0 additions & 1 deletion changes/80.fix.rst

This file was deleted.

5 changes: 0 additions & 5 deletions changes/81.feature.rst

This file was deleted.

5 changes: 0 additions & 5 deletions changes/82.feature.rst

This file was deleted.

4 changes: 0 additions & 4 deletions changes/84.feature.rst

This file was deleted.

1 change: 0 additions & 1 deletion changes/86.feature.rst

This file was deleted.

36 changes: 36 additions & 0 deletions docs/engines.md
@@ -0,0 +1,36 @@
# Engines

In RITA what we call `engine` is a system we will compile rules to, and which will do the heavy lifting after that.

Currently there are three engines:

## spaCy

Activated by using `rita.compile(<rules_file>, use_engine="spacy")`

Using this engine, all of the RITA rules will be compiled into spaCy patterns, which can be natively used by spaCy in various scenarios.
Most often - to improve NER (Named Entity Recognition), by adding additional entities derived from your given rules

It requires to have spaCy package installed (`pip install spacy`) and to actually use it later, language model needs to be downloaded (`python -m spacy download <language_code>`)

## Standalone

Activated by using `rita.compile(<rules_file>, use_engine="standalone")`. It compiles into pure regex and can be used with zero dependencies.
By default, it uses Python `re` library. Since `0.5.10` version, you can give a custom regex implementation to use:
eg. regex package: `rita.compile(<rules_file>, use_engine="standalone", regex_impl=regex)`

It is very lightweight, very fast (compared to spaCy), however lacking in some functionality which only proper language model can bring:
- Patterns by entity (PERSON, ORGANIZATION, etc)
- Patterns by Lemmas
- Patterns by POS (Part Of Speech)

Only generic things, like WORD, NUMBER can be matched.


## Rust (new in `0.6.0`)

There's only an interface inside the code, engine itself is proprietary.

In general it's identical to `standalone`, but differs in one crucial part - all of the rules are compiled into actual binary code and that provides large performance boost.
It is proprietary, because there are various caveats, engine itself is a bit more fragile and needs to be tinkered to be optimized to very specific case
(eg. few long texts with many matches vs a lot short texts with few matches).
56 changes: 56 additions & 0 deletions docs/modules.md
@@ -0,0 +1,56 @@
# Modules

Modules are like plugins to the system, usually providing additional functionality at some cost - needs additional dependencies, supports only specific language etc.
That's why they are not included into the core system, but can be easily included into your rules.

eg.
```
!IMPORT("rita.modules.fuzzy")
FUZZY("squirrel") -> MARK("CRITTER")
```

**NOTE**: the import path can be any proper Python import. So this actually allows you to add extra functionality by not modifying RITA's source code.
More on that in [Extending section](./extend.md)

## Fuzzy

This is more as an example rather than proper module. The main goal is to generate possible misspelled variants of given word, so that match matches more cases.
Very useful when dealing with actual natural language, eg. comments, social media posts. Word `you` can be automatically matched by proper `you` and `u`, `for` as `for` and `4` etc.

Usage:
```
!IMPORT("rita.modules.fuzzy")
FUZZY("squirrel") -> MARK("CRITTER")
```

## Pluralize

Takes list (or single) words, and creates plural version of each of these.

Requires: `inflect` library (`pip install inflect`) before using. Works only on english words.

Usage:

```
!IMPORT("rita.modules.pluralize")
vehicles={"car", "motorbike", "bicycle", "ship", "plane"}
{NUM, PLURALIZE(vehicles)}->MARK("VEHICLES")
```

## Tag

Is used or generating POS/TAG patterns based on a Regex
e.g. TAG("^NN|^JJ") for nouns or adjectives.

Works only with spaCy engine

Usage:

```
!IMPORT("rita.modules.tag")
{WORD*, TAG("^NN|^JJ")}->MARK("TAGGED_MATCH")
```
2 changes: 2 additions & 0 deletions mkdocs.yml
Expand Up @@ -7,6 +7,8 @@ nav:
- Quickstart: quickstart.md
- Syntax: syntax.md
- Macros: macros.md
- Engines: engines.md
- Modules: modules.md
- Extending: extend.md
- Config: config.md
- Advanced: advanced.md
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
@@ -1,6 +1,6 @@
[tool.poetry]
name = "rita-dsl"
version = "0.5.10"
version = "0.6.0"
description = "DSL for building language rules"
authors = [
"Šarūnas Navickas <zaibacu@gmail.com>"
Expand Down
2 changes: 1 addition & 1 deletion rita/__init__.py
Expand Up @@ -10,7 +10,7 @@

logger = logging.getLogger(__name__)

__version__ = (0, 5, 10, os.getenv("VERSION_PATCH"))
__version__ = (0, 6, 0, os.getenv("VERSION_PATCH"))


def get_version():
Expand Down
2 changes: 2 additions & 0 deletions rita/config.py
Expand Up @@ -8,6 +8,7 @@
pass

from rita.engine.translate_standalone import compile_rules as standalone_engine
from rita.engine.translate_rust import compile_rules as rust_engine

from rita.utils import SingletonMixin

Expand All @@ -27,6 +28,7 @@ def __init__(self):
# spacy_engine is not imported
pass
self.register_engine(2, "standalone", standalone_engine)
self.register_engine(3, "rust", rust_engine)

def register_engine(self, priority, key, compile_fn):
self.available_engines.append((priority, key, compile_fn))
Expand Down
89 changes: 89 additions & 0 deletions rita/engine/translate_rust.py
@@ -0,0 +1,89 @@
import os
import logging

from ctypes import (c_char_p, c_size_t, c_uint, Structure, cdll, POINTER)

from rita.engine.translate_standalone import rules_to_patterns, RuleExecutor

logger = logging.getLogger(__name__)


class ResultEntity(Structure):
_fields_ = [
("label", c_char_p),
("text", c_char_p),
("start", c_size_t),
("end", c_size_t),
]


class ResultsWrapper(Structure):
_fields_ = [
("count", c_uint),
("results", (ResultEntity * 32))
]


class Context(Structure):
_fields_ = []


def load_lib():
try:
if "nt" in os.name:
lib = cdll.LoadLibrary("rita_rust.dll")
elif os.name == "posix":
lib = cdll.LoadLibrary("librita_rust.dylib")
else:
lib = cdll.LoadLibrary("librita_rust.so")
lib.compile.restype = POINTER(Context)
lib.execute.argtypes = [POINTER(Context), c_char_p]
lib.execute.restype = ResultsWrapper
lib.clean_env.argtypes = [POINTER(Context)]
return lib
except Exception as ex:
logger.error("Failed to load rita-rust library, reason: {}\n\n"
"Most likely you don't have required shared library to use it".format(ex))


class RustRuleExecutor(RuleExecutor):
def __init__(self, patterns, config):
self.config = config
self.context = None

self.lib = load_lib()
self.patterns = [self._build_regex_str(label, rules)
for label, rules in patterns]

self.compile()

@staticmethod
def _build_regex_str(label, rules):
return r"(?P<{0}>{1})".format(label, "".join(rules))

def compile(self):
flag = 0 if self.config.ignore_case else 1
c_array = (c_char_p * len(self.patterns))(*list([p.encode("UTF-8") for p in self.patterns]))
self.context = self.lib.compile(c_array, len(c_array), flag)
return self.context

def _results(self, text):
raw = self.lib.execute(self.context, text.encode("UTF-8"))
for i in range(0, raw.count):
match = raw.results[i]
yield {
"start": match.start,
"end": match.end,
"text": match.text.decode("UTF-8").strip(),
"label": match.label.decode("UTF-8"),
}

def clean_context(self):
self.lib.clean_env(self.context)


def compile_rules(rules, config, **kwargs):
logger.info("Using rita-rust rule implementation")
patterns = [rules_to_patterns(*group) for group in rules]
executor = RustRuleExecutor(patterns, config)
return executor
2 changes: 1 addition & 1 deletion tests/test_config.py
Expand Up @@ -20,7 +20,7 @@ def test_registered_engines(cfg):
def test_registered_engines_has_spacy(cfg):
pytest.importorskip("spacy", minversion="2.1")
from rita.engine.translate_spacy import compile_rules
assert len(cfg.available_engines) == 2
assert len(cfg.available_engines) == 3
assert cfg.default_engine == compile_rules


Expand Down

0 comments on commit d88d50e

Please sign in to comment.