Chatbot Evaluation and Maintenance

Exploratory data analysis and interactive model-understanding and evaluation tool for chatbot training data and feedback.

The tool is built to answer questions such as:

What examples does my model perform poorly on?
- In terms of classification?
- In terms of user feedback?
Can user feedback be attributed to adversarial behaviour?
Is there mislabeled text in the training set?

Features

The tool runs through a browser-based dashboard. Standard features include:

A visualization of the low-dimensional representation of the embeddings for both user queries as well as any stored data.
Aggregate analysis on chatbot metrics:
- language
- feedback type (upvote, downvote, something else, none)
- chatbot FAQ ID
- confidence of top intent
- outlier scores for training data
- novelty scores for feedback data
Future metrics to be added include:
- ranking of delivered content from bot API
- visible ranking when presented to users
- timestamp
- website
- IP address
- session ID
- attach policies (set of rules governing a chatbot)
- user annotated FAQ ID
- distance of query from FAQ ID

Workflow templates provide standard solutions for chatbot performance evaluation and maintenance.

Confirmation that upvote user feedback agrees with chatbot predictions.
- If user labels agree with chatbot labels, then feedback is likely genuine and can be quickly added to training data.
- If user labels are in many different categories, then the classifier performed poorly, but user says the result is correct. The examples should be checked before being added to training data.
- If user labels are in another category, then feedback can be attributed to adversarial behaviour.
Confirmation that downvote user feedback agrees with chatbot predictions.
- If user labels agree with chatbot labels, then feedback can be attributed to adversarial behaviour.
- If user labels are in many different categories, then the classifier performed poorly and the user confirms this. The examples should be checked and reviewed.
- If user labels are in another category, then feedback is likely genuine, but needs confirmation. The examples should be checked before being added to training data.
Confirmation that "something else" button presses agree with chatbot predictions.
- When a user chooses something else, confidences of previously returned chatbot intents should be low.
- Examples should be checked before deciding whether to add data to an existing category or a new category.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
app		app
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt