StarCoder Safety Evaluation

This repository contains the code for the toxicity evaluations in the StarCoder paper. The code for the social bias evaluations is available here.

Install

git clone git@github.com:McGill-NLP/StarCoderSafetyEval.git
cd StarCoderSafetyEval
virtualenv env && source env/bin/activate
pip install -e .

Evaluating Response Toxicity

To evaluate using RealToxicityPrompts, you'll first need to download the dataset from here. Once you've downloaded the dataset, use the following commands to prepare the data:

# Uncompress dataset.
tar -xvf realtoxicityprompts-data.tar.gz

# Copy to working directory.
cp realtoxicityprompts-data/prompts.jsonl .

You can then use the following command to launch toxicity evaluation:

python3 real_toxicity_prompts_evaluation.py \
    --model_name_or_path ${model_name_or_path} \
    --batch_size 8 \
    --data_file_path ${data_file_path} \
    --num_example 10000 \
    --output_dir ${output_dir}

This script does two things: (1) Response generation and (2) Evaluation of toxicity in generated responses. The generated scores will be written to output_dir. Importantly, there is also a num_example argument for this script. This limits the number of examples evaluated (RealToxicityPrompts contains ~100K prompts). We currently use two tools for evaluating generated responses:

An offensive word list from ParlAI. This checks the responses for toxic/offensive tokens.
A RoBERTa toxicity detector. This uses a trained LM-based classifier to evaluate toxicity in generated responses.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
real_toxicity_prompts_evaluation.py		real_toxicity_prompts_evaluation.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

real_toxicity_prompts_evaluation.py

real_toxicity_prompts_evaluation.py

setup.py

setup.py

Repository files navigation

StarCoder Safety Evaluation

Install

Evaluating Response Toxicity

About

Releases

Packages

Languages

McGill-NLP/StarCoderSafetyEval

Folders and files

Latest commit

History

Repository files navigation

StarCoder Safety Evaluation

Install

Evaluating Response Toxicity

About

Resources

Stars

Watchers

Forks

Languages