Skip to content

Optum/nertag

Repository files navigation

Contributors Forks Stargazers Issues Apache 2.0 License


NERTag

An automated named-entity recognition tagger

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. Contact

About The Project

NERTag is a nimble python package that automatically tags or labels named entities (IOB2 format) using a taxonomy or dictionary of terms. It aims to be simple, modular, and fast, and utilizes multiprocessing for large-scale data-labeling. Labeling a document of 1.8 billion characters with a taxonomy of 15k terms and 300 entities takes roughly ~60 minutes, with the main bottleneck stemming from disk I/O.

(back to top)

Getting Started

To use NERTag, it is recommended to create a conda virtual environment with Python 3.9+.

user $ conda create --name nertag python=3.9
user $ conda activate nertag
user (nertag) $ git clone https://github.com/Optum/nertag
user (nertag) $ pip install .

(back to top)

Usage

NERTag is made up of 3 distinct components:

  1. Text preprocessing: defines text preprocessing to use
  2. Base labeling behavior: defines default text labeling behavior
  3. Tagging algorithm: defines how to label entities

In this example, the tagging algorithm utilizes a taxonomy of terms.


A toy example is provided below.

Sample code
import pandas as pd
from nertag import ner, preprocess, utils, tagging

# --- Taxonomy
path = "taxonomy.csv"
df = utils.preprocess_df(
    pd.read_csv(path), stemmer=utils.lemmatizer, filters=utils.stop_words, tokenizer=utils.tokenizer
)
dct = utils.setup_dict(df)

# --- Data
texts = [
    "my account is closed",
    "my account is locked what account can i use to unlock it",
    "how do i check my fsa",
    "Also known as second molars. Back teeth that come in at around the age of 12.",
    "Hello. Need to update my account info, where can I make updates to my account address information?",
    "Need to update address information because I want to make account updates.",
]

# --- Pipeline
# NOTE: By default, punctuation is removed, words are stemmed, whitespaces are replaced with a single space, and stopwords are removed.
preprocessor = ner.Preprocessor(
    preprocess.preprocess,
    stemmer=utils.lemmatizer,
    stop_words=utils.stop_words,
    start=4,
    stop=0,
    step=-1,
)
baselabeler = ner.BaseLabeler(utils.base_label, utils.tokenizer)
tagger = ner.Tagger(tagging.ner_tagging, dct)
pipeline = ner.NER(preprocessor, baselabeler, tagger)

results = pipeline.sequential_labeling(texts)

print(results)
Sample output
my account is closed
            entity     word  index  start  end
0                O       my      0      0    2
1  B-cancellations  account      1      3   10
2  I-cancellations       is      2     11   13
3  I-cancellations   closed      3     14   20

my account is locked what account can i use to unlock it
              entity     word  index  start  end
0                  O       my      0      0    2
1   B-account locked  account      1      3   10
2   I-account locked       is      2     11   13
3   I-account locked   locked      3     14   20
4   I-account locked     what      4     21   25
5   I-account locked  account      5     26   33
6                  O      can      6     34   37
7                  O        i      7     38   39
8         B-common1k      use      8     40   43
9                  O       to      9     44   46
10                 O   unlock     10     47   53
11                 O       it     11     54   56

how do i check my fsa
                            entity   word  index  start  end
0                                O    how      0      0    3
1                                O     do      1      4    6
2                                O      i      2      7    8
3                       B-common1k  check      3      9   14
4                                O     my      4     15   17
5  B-FSA:Flexible Spending Account    fsa      5     18   21

Also known as second molars. Back teeth that come in at around the age of 12.
              entity     word  index  start  end
0         B-common1k     Also      0      0    4
1                  O    known      1      5   10
2                  O       as      2     11   13
3         B-common1k   second      3     14   20
4           B-dental  molars.      4     21   28
5         B-common1k     Back      5     29   33
6   B-35:Dental Care    teeth      6     34   39
7                  O     that      7     40   44
8         B-common1k     come      8     45   49
9                  O       in      9     50   52
10                 O       at     10     53   55
11                 O   around     11     56   62
12                 O      the     12     63   66
13                 O      age     13     67   70
14                 O       of     14     71   73
15                 O      12.     15     74   77

Hello. Need to update my account info, where can I make updates to my account address information?
              entity          word  index  start  end
0                  O        Hello.      0      0    6
1   B-account update          Need      1      7   11
2   I-account update            to      2     12   14
3   I-account update        update      3     15   21
4                  O            my      4     22   24
5                  O       account      5     25   32
6                  O         info,      6     33   38
7                  O         where      7     39   44
8                  O           can      8     45   48
9                  O             I      9     49   50
10  B-account update          make     10     51   55
11  I-account update       updates     11     56   63
12                 O            to     12     64   66
13                 O            my     13     67   69
14                 O       account     14     70   77
15         B-address       address     15     78   85
16         I-address  information?     16     86   98

Need to update address information because I want to make account updates.
              entity         word  index  start  end
0         B-common1k         Need      0      0    4
1                  O           to      1      5    7
2   B-account update       update      2      8   14
3   I-account update      address      3     15   22
4   I-account update  information      4     23   34
5                  O      because      5     35   42
6                  O            I      6     43   44
7         B-common1k         want      7     45   49
8                  O           to      8     50   52
9         B-common1k         make      9     53   57
10  B-account update      account     10     58   65
11  I-account update     updates.     11     66   74

For more examples, please refer to notebooks and examples

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

Contact

(back to top)

Releases

No releases published

Packages

No packages published

Languages