Skip to content

Annotators

dimakarp1996 edited this page Jul 8, 2023 · 6 revisions

Overview

Annotators are components (connectors/services) that annotate a given user's utterance.

An example of an annotator is NER: this annotator may return a dictionary with tokens and tags keys:

{"tokens": ["Paris"], "tags": ["I-LOC"]}

Another example is Sentiment Classification annotator. It can return a list of labels, e.g.:

["neutral", "speech"]

Available English Annotators

Name Requirements Description
ASR 40 MB RAM calculates overall ASR confidence for a given utterance and grades it as either very low, low, medium, or high (for Amazon markup)
Badlisted Words 150 MB RAM detects words and phrases from the badlist
Combined Classification 1.5 GB RAM, 3.5 GB GPU BERT-based model including topic classification, dialog acts classification, sentiment, toxicity, emotion, factoid classification
COMeT Atomic 2 GB RAM, 1.1 GB GPU Commonsense prediction models COMeT Atomic
COMeT ConceptNet 2 GB RAM, 1.1 GB GPU Commonsense prediction models COMeT ConceptNet
Convers Evaluator Annotator 1 GB RAM, 4.5 GB GPU is trained on the Alexa Prize data from the previous competitions and predicts whether the candidate response is interesting, comprehensible, on-topic, engaging, or erroneous
Emotion Classification 2.5 GB RAM emotion classification annotator
Entity Detection 1.5 GB RAM, 3.2 GB GPU extracts entities and their types from utterances
Entity Linking 2.5 GB RAM, 1.3 GB GPU finds Wikidata entity ids for the entities detected with Entity Detection
Entity Storer 220 MB RAM a rule-based component, which stores entities from the user's and socialbot's utterances if opinion expression is detected with patterns or MIDAS Classifier and saves them along with the detected attitude to dialogue state
Fact Random 50 MB RAM returns random facts for the given entity (for entities from user utterance)
Fact Retrieval 7.4 GB RAM, 1.2 GB GPU extracts facts from Wikipedia and wikiHow
Intent Catcher 1.7 GB RAM, 2.4 GB GPU classifies user utterances into a number of predefined intents which are trained on a set of phrases and regexps
KBQA 2 GB RAM, 1.4 GB GPU answers user's factoid questions based on Wikidata KB
MIDAS Classification 1.1 GB RAM, 4.5 GB GPU BERT-based model trained on a semantic classes subset of MIDAS dataset
MIDAS Predictor 30 MB RAM BERT-based model trained on a semantic classes subset of MIDAS dataset
NER 2.2 GB RAM, 5 GB GPU extracts person names, names of locations, organizations from uncased text
News API Annotator 80 MB RAM extracts the latest news about entities or topics using the GNews API. DeepPavlov Dream deployments utilize our own API key.
Personality Catcher 30 MB RAM the skill is to change the system's personality description via chatting interface, it works as a system command, the response is system-like message
Prompt Selector 50 MB RAM Annotator utilizing Sentence Ranker to rank prompts and selecting N_SENTENCES_TO_RETURN most relevant prompts (based on questions provided in prompts)
Property Extraction 6.3 GiB RAM extracts user attributes from utterances
Rake Keywords 40 MB RAM extracts keywords from utterances with the help of RAKE algorithm
Relative Persona Extractor 50 MB RAM Annotator utilizing Sentence Ranker to rank persona sentences and selecting N_SENTENCES_TO_RETURN the most relevant sentences
Sentrewrite 200 MB RAM rewrites user's utterances by replacing pronouns with specific names that provide more useful information to downstream components
Sentseg 1 GB RAM allows us to handle long and complex user's utterances by splitting them into sentences and recovering punctuation
Spacy Nounphrases 180 MB RAM extracts nounphrases using Spacy and filters out generic ones
Speech Function Classifier 1.1 GB RAM, 4.5 GB GPU a hierarchical algorithm based on several linear models and a rule-based approach for the prediction of speech functions described by Eggins and Slade
Speech Function Predictor 1.1 GB RAM, 4.5 GB GPU yields probabilities of speech functions that can follow a speech function predicted by Speech Function Classifier
Spelling Preprocessing 50 MB RAM pattern-based component to rewrite different colloquial expressions to a more formal style of conversation
Topic Recommendation 40 MB RAM offers a topic for further conversation using the information about the discussed topics and user's preferences. Current version is based on Reddit personalities (see Dream Report for Alexa Prize 4).
Toxic Classification 3.5 GB RAM, 3 GB GPU Toxic classification model from Transformers specified as PRETRAINED_MODEL_NAME_OR_PATH
User Persona Extractor 40 MB RAM determines which age category the user belongs to based on some key words
Wiki Parser 100 MB RAM extracts Wikidata triplets for the entities detected with Entity Linking
Wiki Facts 1.7 GB RAM model that extracts related facts from Wikipedia and WikiHow pages

Available Russian Annotators

Name Requirements Description
Badlisted Words 50 MB RAM detects obscene Russian words from the badlist
Entity Detection 5.5 GB RAM extracts entities and their types from utterances
Entity Linking 400 MB RAM finds Wikidata entity ids for the entities detected with Entity Detection
Fact Retrieval 6.5 GiB RAM, 1 GiB GPU Аннотатор извлечения параграфов Википедии, релевантных истории диалога.
Intent Catcher 900 MB RAM classifies user utterances into a number of predefined intents which are trained on a set of phrases and regexps
NER 1.7 GB RAM, 4.9 GB GPU extracts person names, names of locations, organizations from uncased text using ruBert-based (pyTorch) model
Sentseg 2.4 GB RAM, 4.9 GB GPU recovers punctuation using ruBert-based (pyTorch) model and splits into sentences
Spacy Annotator 250 MB RAM token-wise annotations by Spacy
Spelling Preprocessing 8 GB RAM Russian Levenshtein correction model
Toxic Classification 3.5 GB RAM, 3 GB GPU Toxic classification model from Transformers specified as PRETRAINED_MODEL_NAME_OR_PATH
Wiki Parser 100 MB RAM extracts Wikidata triplets for the entities detected with Entity Linking
DialogRPT 3.8 GB RAM, 2 GB GPU DialogRPT model which is based on Russian DialoGPT by DeepPavlov and fine-tuned on Russian Pikabu Comment sequences

Developing Own Annotator

TBD

Resources