Automated interpretability

Quickstart

Setup following the neuron_explainer/README.md
Go to neuron_explainer/demos/generate_and_score_explanation.ipynb and enter your OpenAI API key
Run neuron_explainer/demos/generate_and_score_explanation.ipynb, you can change MODE, to_print and neuron_record variables to look at different methods/neurons.
Compare results with GPT-4 explanations and activations shown in NeuronViewer

Changes summary (Tuomas)

Implemented new load_neuron method in activations.py to make the repo work without Azure account.
Added 3 new Explainer classes in explainer.py to change the way the explainer model is prompted (SummaryExplainer, HighlightExplainer and HighlightSummaryExplainer)
- Also added needed functions for these Explainers in activation_records.py
- Added optrion to_print to Explainer to print the generated prompt
Modified notebooks in demos
- generate_and_score_explanation.ipynb and explain_puzzles.ipynb to work with new explainers (and commented out simulation to save on compute)
- Added compare_explanation.ipynb to easily generate explanations with different prompts and compare.

Code and tools

This repository contains code and tools associated with the Language models can explain neurons in language models paper, specifically:

Code for automatically generating, simulating, and scoring explanations of neuron behavior using the methodology described in the paper. See the neuron-explainer README for more information.
A tool for viewing neuron activations and explanations, accessible here. See the neuron-viewer README for more information.

Public datasets

Together with this code, we're also releasing public datasets of GPT-2 XL neurons and explanations. Here's an overview of those datasets.

Neuron activations: az://openaipublic/neuron-explainer/data/collated-activations/{layer_index}/{neuron_index}.json
- Tokenized text sequences and their activations for the neuron. We provide multiple sets of tokens and activations: top-activating ones, random samples from several quantiles; and a completely random sample. We also provide some basic statistics for the activations.
- Each file contains a JSON-formatted NeuronRecord dataclass.
Neuron explanations: az://openaipublic/neuron-explainer/data/explanations/{layer_index}/{neuron_index}.jsonl
- Scored model-generated explanations of the behavior of the neuron, including simulation results.
- Each file contains a JSON-formatted NeuronSimulationResults dataclass.
Related neurons: az://openaipublic/neuron-explainer/data/related-neurons/weight-based/{layer_index}/{neuron_index}.json
- Lists of the upstream and downstream neurons with the most positive and negative connections (see below for definition).
- Each file contains a JSON-formatted dataclass whose definition is not included in this repo.
Tokens with high average activations: az://openaipublic/neuron-explainer/data/related-tokens/activation-based/{layer_index}/{neuron_index}.json
- Lists of tokens with the highest average activations for individual neurons, and their average activations.
- Each file contains a JSON-formatted TokenLookupTableSummaryOfNeuron dataclass.
Tokens with large inbound and outbound weights: az://openaipublic/neuron-explainer/data/related-tokens/weight-based/{layer_index}/{neuron_index}.json
- List of the most-positive and most-negative input and output tokens for individual neurons, as well as the associated weight (see below for definition).
- Each file contains a JSON-formatted WeightBasedSummaryOfNeuron dataclass.

Definition of connection weights

Refer to GPT-2 model code for understanding of model weight conventions.

Neuron-neuron: For two neurons (l1, n1) and (l2, n2) with l1 < l2, the connection strength is defined as h{l1}.mlp.c_proj.w[:, n1, :] @ diag(h{l2}.ln_2.g) @ h{l2}.mlp.c_fc.w[:, :, n2].

Neuron-token: For token t and neuron (l, n), the input weight is computed as wte[t, :] @ diag(h{l}.ln_2.g) @ h{l}.mlp.c_fc.w[:, :, n] and the output weight is computed as h{l}.mlp.c_proj.w[:, n, :] @ diag(ln_f.g) @ wte[t, :].

Misc Lists of Interesting Neurons

Lists of neurons we thought were interesting according to different criteria, with some preliminary descriptions.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
neuron-explainer		neuron-explainer
neuron-viewer		neuron-viewer
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

neuron-explainer

neuron-explainer

neuron-viewer

neuron-viewer

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Automated interpretability

Quickstart

Changes summary (Tuomas)

Code and tools

Public datasets

Definition of connection weights

Misc Lists of Interesting Neurons

About

Releases

Packages

Languages

justinl66/automated-interpretability-tuning

Folders and files

Latest commit

History

Repository files navigation

Automated interpretability

Quickstart

Changes summary (Tuomas)

Code and tools

Public datasets

Definition of connection weights

Misc Lists of Interesting Neurons

About

Resources

Stars

Watchers

Forks

Languages