The AI Teacher Test

This repository contains the code and data for the paper:

Title: The AI Teacher Test: Measuring the Pedagogical Ability of Blender and GPT-3 in Educational Dialogues
Authors: Anaïs Tack & Chris Piech
Abstract: How can we test whether state-of-the-art generative models, such as Blender and GPT-3, are good AI teachers, capable of replying to a student in an educational dialogue? Designing an AI teacher test is challenging: although evaluation methods are much-needed, there is no off-the-shelf solution to measuring pedagogical ability. This paper reports on a first attempt at an AI teacher test. We built a solution around the insight that you can run conversational agents in parallel to human teachers in real-world dialogues, simulate how different agents would respond to a student, and compare these counterpart responses in terms of three abilities: speak like a teacher, understand a student, help a student. Our method builds on the reliability of comparative judgments in education and uses a probabilistic model and Bayesian sampling to infer estimates of pedagogical ability. We find that, even though conversational agents (Blender in particular) perform well on conversational uptake, they are quantifiably worse than real teachers on several pedagogical dimensions, especially with regard to helpfulness (Blender: ∆ ability = −0.75; GPT-3: ∆ ability = −0.93).

Dependencies

Code

The code in this repository depends on the ParlAI framework, the OpenAI API, the Hugging Face transformers library, and the Stan library.

pip install -r src/requirements.txt

Data

The data in this repository depends on student-teacher utterances coming from two datasets. Because of copyright reasons, these texts were removed from the repository and replaced by the tag {COPYRIGHTED-TEXT}. In order to repopulate the data, you must:

Download the Teacher-Student Chatroom Corpus. Put the *.tsv files into data/0_datasets/tscc/.
Download the Educational Uptake Dataset. Put uptake_data.csv into data/0_datasets/uptake/.

Run the following commands to repopulate the data with missing utterances and prompts.

python -m src.utils.repopulate -t TSCC -d data/0_datasets/tscc
python -m src.utils.repopulate -t EduUptake -d data/0_datasets/uptake

Note

Please cite both datasets when using the data in your research. See data/0_datasets/tscc/ and data/0_datasets/uptake/.

Method

Simulating Agent Responses

Download the pre-trained models into downloads/models/.

python -m src.parlai.scripts.download_models downloads/ blender/blender_90M blender/blender_400Mdistill blender/blender_3B blender/blender_9B

Run a Blender model on the data. For example:

python -m src.parlai.scripts.run -t TSCC -d data/0_datasets/tscc/ -M downloads/models -m blender/blender_9B -O results/
python -m src.parlai.scripts.run -t EduUptake -d data/0_datasets/uptake/ -M downloads/models -m blender/blender_9B -O results/

Run a GPT-3 model on the data. For example:

python -m src.parlai.scripts.run -m src.parlai.models.gpt3:GPT3Davinci -o src/parlai/opts/gpt3.json -t TSCC -d data/0_datasets/tscc/ -O results/
python -m src.parlai.scripts.run -m src.parlai.models.gpt3:GPT3Davinci -o src/parlai/opts/gpt3.json -t EduUptake -d data/0_datasets/uptake/ -O results/

Measuring Pedagogical Ability

Detect outliers among human raters.

python -m src.stan.bradley_terry data/2_comparisons/items.jsonl --per-rater

Estimate pedagogical abilities after outlier removal.

python -m src.stan.bradley_terry data/2_comparisons/items.jsonl --outliers data/2_comparisons/outliers.yaml

Citation

More information can be found in this paper. When using the data or code in your research or publication, please cite this paper as well.

@inproceedings{tack_ai_2022,
   title = {The {{AI Teacher Test}}: {{Measuring}} the {{Pedagogical Ability}} of {{Blender}} and {{GPT-3}} in {{Educational Dialogues}}},
   booktitle = {The 15th {{International Conference}} on {{Educational Data Mining}}},
   author = {Tack, Ana{\"i}s and Piech, Chris},
   year = {2022},
   pages = {accepted},
   copyright = {All rights reserved}
   }

Acknowledgments

This research was funded by a fellowship of the BAEF (Belgian American Educational Foundation) and by a grant from Stanford HAI.

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[1.0.0] - 2022-05-10

Added

Publication of data and code for the EDM 2022 conference

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
src		src
.bumpversion.cfg		.bumpversion.cfg
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.rst		README.rst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

src

src

.bumpversion.cfg

.bumpversion.cfg

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

README.rst

README.rst

Repository files navigation

The AI Teacher Test

Dependencies

Code

Data

Method

Simulating Agent Responses

Measuring Pedagogical Ability

Citation

Acknowledgments

Changelog

[1.0.0] - 2022-05-10

About

Releases

Packages

Languages

License

anaistack/ai-teacher-test

Folders and files

Latest commit

History

Repository files navigation

The AI Teacher Test

Dependencies

Code

Data

Method

Simulating Agent Responses

Measuring Pedagogical Ability

Citation

Acknowledgments

Changelog

[1.0.0] - 2022-05-10

About

Topics

Resources

License

Stars

Watchers

Forks

Languages