This repo contains code for applying sparse coding to activation vectors in language models, including the code used for the results in the paper Sparse Autoencoders Find Highly Interpretable Features in Language Models. Work done with Logan Riggs and Aidan Ewart, advised by Lee Sharkey.
The repo is designed to train multiple sparse autoencoders simultaneously using different L1 values, on either a single GPU or across multiple. big_sweep_experiments contains a number of examples of run functions.
interpret.py contains tools to interpret learned dictionaries using OpenAI's automatic interpretation protocol. Set --load_interpret_autoencoder to the location of the autoencoder you want to test, and --model_name, --layer and --layer_loc to specify the activations that should be used. --activation_tranform should be set to feature_dict for interpreting a learned dictionary but there are many baselines that can also be run, including pca, ica, nmf, neuron_basis, and random.
If you run interpret.py read_results --kwargs.. and select the --model_name, --layer and --layer_loc, this will produce a series of plots comparing the selected plots in terms of their sparsity and fraction of variance left unexplained.
If you'd like to train your own sparse autoencoders, we recommend using the sparse_autoencoder library which is currenty under development and should be easier to use and keep up with best practices as they develop.