Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #87 from zaibacu/rita-rust-engine
Rita rust engine
- Loading branch information
Showing
22 changed files
with
278 additions
and
46 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# Engines | ||
|
||
In RITA what we call `engine` is a system we will compile rules to, and which will do the heavy lifting after that. | ||
|
||
Currently there are three engines: | ||
|
||
## spaCy | ||
|
||
Activated by using `rita.compile(<rules_file>, use_engine="spacy")` | ||
|
||
Using this engine, all of the RITA rules will be compiled into spaCy patterns, which can be natively used by spaCy in various scenarios. | ||
Most often - to improve NER (Named Entity Recognition), by adding additional entities derived from your given rules | ||
|
||
It requires to have spaCy package installed (`pip install spacy`) and to actually use it later, language model needs to be downloaded (`python -m spacy download <language_code>`) | ||
|
||
## Standalone | ||
|
||
Activated by using `rita.compile(<rules_file>, use_engine="standalone")`. It compiles into pure regex and can be used with zero dependencies. | ||
By default, it uses Python `re` library. Since `0.5.10` version, you can give a custom regex implementation to use: | ||
eg. regex package: `rita.compile(<rules_file>, use_engine="standalone", regex_impl=regex)` | ||
|
||
It is very lightweight, very fast (compared to spaCy), however lacking in some functionality which only proper language model can bring: | ||
- Patterns by entity (PERSON, ORGANIZATION, etc) | ||
- Patterns by Lemmas | ||
- Patterns by POS (Part Of Speech) | ||
|
||
Only generic things, like WORD, NUMBER can be matched. | ||
|
||
|
||
## Rust (new in `0.6.0`) | ||
|
||
There's only an interface inside the code, engine itself is proprietary. | ||
|
||
In general it's identical to `standalone`, but differs in one crucial part - all of the rules are compiled into actual binary code and that provides large performance boost. | ||
It is proprietary, because there are various caveats, engine itself is a bit more fragile and needs to be tinkered to be optimized to very specific case | ||
(eg. few long texts with many matches vs a lot short texts with few matches). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# Modules | ||
|
||
Modules are like plugins to the system, usually providing additional functionality at some cost - needs additional dependencies, supports only specific language etc. | ||
That's why they are not included into the core system, but can be easily included into your rules. | ||
|
||
eg. | ||
``` | ||
!IMPORT("rita.modules.fuzzy") | ||
FUZZY("squirrel") -> MARK("CRITTER") | ||
``` | ||
|
||
**NOTE**: the import path can be any proper Python import. So this actually allows you to add extra functionality by not modifying RITA's source code. | ||
More on that in [Extending section](./extend.md) | ||
|
||
## Fuzzy | ||
|
||
This is more as an example rather than proper module. The main goal is to generate possible misspelled variants of given word, so that match matches more cases. | ||
Very useful when dealing with actual natural language, eg. comments, social media posts. Word `you` can be automatically matched by proper `you` and `u`, `for` as `for` and `4` etc. | ||
|
||
Usage: | ||
``` | ||
!IMPORT("rita.modules.fuzzy") | ||
FUZZY("squirrel") -> MARK("CRITTER") | ||
``` | ||
|
||
## Pluralize | ||
|
||
Takes list (or single) words, and creates plural version of each of these. | ||
|
||
Requires: `inflect` library (`pip install inflect`) before using. Works only on english words. | ||
|
||
Usage: | ||
|
||
``` | ||
!IMPORT("rita.modules.pluralize") | ||
vehicles={"car", "motorbike", "bicycle", "ship", "plane"} | ||
{NUM, PLURALIZE(vehicles)}->MARK("VEHICLES") | ||
``` | ||
|
||
## Tag | ||
|
||
Is used or generating POS/TAG patterns based on a Regex | ||
e.g. TAG("^NN|^JJ") for nouns or adjectives. | ||
|
||
Works only with spaCy engine | ||
|
||
Usage: | ||
|
||
``` | ||
!IMPORT("rita.modules.tag") | ||
{WORD*, TAG("^NN|^JJ")}->MARK("TAGGED_MATCH") | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
import os | ||
import logging | ||
|
||
from ctypes import (c_char_p, c_size_t, c_uint, Structure, cdll, POINTER) | ||
|
||
from rita.engine.translate_standalone import rules_to_patterns, RuleExecutor | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
class ResultEntity(Structure): | ||
_fields_ = [ | ||
("label", c_char_p), | ||
("text", c_char_p), | ||
("start", c_size_t), | ||
("end", c_size_t), | ||
] | ||
|
||
|
||
class ResultsWrapper(Structure): | ||
_fields_ = [ | ||
("count", c_uint), | ||
("results", (ResultEntity * 32)) | ||
] | ||
|
||
|
||
class Context(Structure): | ||
_fields_ = [] | ||
|
||
|
||
def load_lib(): | ||
try: | ||
if "nt" in os.name: | ||
lib = cdll.LoadLibrary("rita_rust.dll") | ||
elif os.name == "posix": | ||
lib = cdll.LoadLibrary("librita_rust.dylib") | ||
else: | ||
lib = cdll.LoadLibrary("librita_rust.so") | ||
lib.compile.restype = POINTER(Context) | ||
lib.execute.argtypes = [POINTER(Context), c_char_p] | ||
lib.execute.restype = ResultsWrapper | ||
lib.clean_env.argtypes = [POINTER(Context)] | ||
return lib | ||
except Exception as ex: | ||
logger.error("Failed to load rita-rust library, reason: {}\n\n" | ||
"Most likely you don't have required shared library to use it".format(ex)) | ||
|
||
|
||
class RustRuleExecutor(RuleExecutor): | ||
def __init__(self, patterns, config): | ||
self.config = config | ||
self.context = None | ||
|
||
self.lib = load_lib() | ||
self.patterns = [self._build_regex_str(label, rules) | ||
for label, rules in patterns] | ||
|
||
self.compile() | ||
|
||
@staticmethod | ||
def _build_regex_str(label, rules): | ||
return r"(?P<{0}>{1})".format(label, "".join(rules)) | ||
|
||
def compile(self): | ||
flag = 0 if self.config.ignore_case else 1 | ||
c_array = (c_char_p * len(self.patterns))(*list([p.encode("UTF-8") for p in self.patterns])) | ||
self.context = self.lib.compile(c_array, len(c_array), flag) | ||
return self.context | ||
|
||
def _results(self, text): | ||
raw = self.lib.execute(self.context, text.encode("UTF-8")) | ||
for i in range(0, raw.count): | ||
match = raw.results[i] | ||
yield { | ||
"start": match.start, | ||
"end": match.end, | ||
"text": match.text.decode("UTF-8").strip(), | ||
"label": match.label.decode("UTF-8"), | ||
} | ||
|
||
def clean_context(self): | ||
self.lib.clean_env(self.context) | ||
|
||
|
||
def compile_rules(rules, config, **kwargs): | ||
logger.info("Using rita-rust rule implementation") | ||
patterns = [rules_to_patterns(*group) for group in rules] | ||
executor = RustRuleExecutor(patterns, config) | ||
return executor |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.