Aligning Faithful Interpretations with their Social Attribution

This repository has the code for the TACL 2020 paper of the same name. The code here is a simple and minimalistic implementation of the contrastive highlights procedure described in the paper.

Getting Started

This code is based on the AllenNLP library. After cloning this repo, create a new AllenNLP environment:

conda create -n <env-name> python=3.8
conda activate <env-name>
pip install allennlp==1.3.0

And then run these scripts:

bash download_ag.sh  # download the AG News dataset
bash train_sequence_classification.sh  # fine-tune a RoBERTa-Large model on AG News (change the gpu parameter inside the script/jsonnet)

Then, run the contrastive_highlights.ipynb jupyter notebook on the trained model to derive the contrastive highlights for it.

Unfortunately, Github does not allow colored text in their web markdown viewer. I use text coloring to highlight the text inside the jupyter notebook. example_output.png shows the jupyter notebook output with color:

Colored output of the jupyter notebook example

This color will show properly when you open the notebook through jupyter.

Paper

Paper link: https://arxiv.org/abs/2006.01067

Contact: alonjacovi at gmail (please feel free to contact me for any question or discussion)

Abstract:

We find that the requirement of model interpretations to be faithful is vague and incomplete. With interpretation by textual highlights as a case-study, we present several failures cases. Borrowing concepts from social science, we identify that the problem is a misalignment between the causal chain of decisions (causal attribution) and the attribution of human behavior to the interpretation (social attribution). We re-formulate faithfulness as an accurate attribution of causality to the model, and introduce the concept of aligned faithfulness: faithful causal chains that are aligned with their expected social behavior. The two steps of causal attribution and social attribution together complete the process of explaining behavior. With this formalization, we characterize various failures of misaligned faithful highlight interpretations, and propose an alternative causal chain to remedy the issues. Finally, we implement highlight explanations of the proposed causal format using contrastive explanations.

Disclaimer

As mentioned, this code is deliberately kept as simple and minimalistic as possible to help understanding of the paper. If you wish to use the procedure to derive real interpretations, you will likely want to use additional techniques to:

Ensure that masked inputs are in-distribution for the model (e.g., via Interpretation of NLP models through input marginalization), or use other manipulation that keeps the input in-distribution and removes the non-highlighted information.
Ensure that your highlight space is expressive enough for your needs. In this repo, for simplicity I only consider continuous highlights - but you may want to consider a larger or smaller highlight space (e.g., non-continuous, for an exponential space).

On replicating the examples from the paper:

Unfortunately, the exact examples in the paper are coupled to the model I used, which I cannot upload here. The model I used for the examples in the paper is fine-tuned bert-base-cased, and the model shown in the jupyter notebook's outputs is fine-tuned roberta-large.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
training_configs		training_configs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
contrastive_highlights.ipynb		contrastive_highlights.ipynb
download_ag.sh		download_ag.sh
example_output.png		example_output.png
train_sequence_classification.sh		train_sequence_classification.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training_configs

training_configs

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

contrastive_highlights.ipynb

contrastive_highlights.ipynb

download_ag.sh

download_ag.sh

example_output.png

example_output.png

train_sequence_classification.sh

train_sequence_classification.sh

Repository files navigation

Aligning Faithful Interpretations with their Social Attribution

Getting Started

Paper

Disclaimer

About

Releases

Packages

Languages

License

alonjacovi/aligned-highlights

Folders and files

Latest commit

History

Repository files navigation

Aligning Faithful Interpretations with their Social Attribution

Getting Started

Paper

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Languages