Skip to content

Maitreyapatel/reliability-checklist

Repository files navigation

Description

reliability-checklist is a Python framework (available via CLI) for Comprehensively Evaluating the Reliability of NLP Systems

reliability-checklist accepts any model and dataset as input and facilitates the comprehensive evaluation on a wide range of reliability-related aspects such as accuracy, selective prediction, novelty detection, stability, sensitivity, and calibration.

Why you might want to use it:

✅ No coding needed
Pre-defined templates available to easily integrate your models/datasets via command line only.

✅ Bring Your own Model (BYoM)
Your model template is missing? We have you covered: Check out BYoM to create your own model specific config file.

✅ Bring Your own Data (BYoD)
Your dataset template is missing? Check out BYoD to create your own dataset specific config file.

✅ Reliability metrics
Currently, we support a number of reliability related aspects:

  • Accuracy/F1/Precision/Recall
  • Calibration: Reliability Diagram Expected Calibration Error (ECE), Expected Overconfidence Error (EOE)
  • Selective Prediction: Risk-Coverage Curve (RCC), AUC of risk-coverage curve
  • Sensitivity
  • Stability
  • Out-of-Distribution

Upcoming Reliability Aspects:

  • Adversarial Attack: Model in the loop adversarial attacks to evaluate model's robustness.
  • Task-Specific Augmentations: Task-specific augmentations to check the reliability on augmented inputs.
  • Novelty
  • Other Measures: We plan to incorporate other measures such as bias, fairness, toxicity, and faithfulness of models. We also plan to measure the reliability of generative models on crucial parameters such as hallucinations.

Workflow

✅ Want to integrate more features?
Our easy-to-develop infrastructure allows developers to contribute models, datasets, augmentations, and evaluation metrics seamlessly to the workflow.

workflow

How to install?

pip install git+https://github.com/Maitreyapatel/reliability-checklist

python -m spacy download en_core_web_sm
python -c "import nltk;nltk.download('wordnet')"

How to use?

Evaluate example model/data with default configuration

# eval on CPU
recheck

# eval on GPU
recheck trainer=gpu +trainer.gpus=[1,2,3]

Evaluate model with chosen dataset-specific experiment configuration from reliability_checklist/configs/task/

recheck tasl=<task_name>

Specify the custom model_name as shown in following MNLI example

# if model_name is used for tokenizer as well.
recheck task=mnli custom_model="bert-base-uncased-mnli"

# if model_name is different for tokenizer then
recheck task=mnli custom_model="bert-base-uncased-mnli" custom_model.tokenizer.model_name="ishan/bert-base-uncased-mnli"

Add custom_model config

# create config folder structure similar to reliability_checklist/configs/
mkdir ./configs/
mkdir ./configs/custom_model/

# run following command after creating new config file inside ./configs/custom_model/<your-config>.yaml
recheck task=mnli custom_model=<your-config>

Visualization of results

reliability-checklist supports the wide range of visualization tools. One can decide to go with default wandb online visualizer. It also generates plots that are highly informative which will be stored into logs directory.

🤝 Contributing to reliability-checklist

Any kind of positive contribution is welcome! Please help us to grow by contributing to the project.

If you wish to contribute, you can work on any features/issues listed here or create one on your own. After adding your code, please send us a Pull Request.

Please read CONTRIBUTING for details on our CODE OF CONDUCT, and the process for submitting pull requests to us.


A ⭐️ to reliability-checklist is to build the reliability of Language Models.