Style Vectors for Steering Generative Large Language Models

This is the code for the paper Style Vectors for Steering Generative Large Language Models published at Findings of the EACL 2024.

Authors: Kai Konen, Sophie Jentzsch, Diaoulé Diallo, Peer Schütt, Oliver Bensch, Roxanne El Baff, Dominik Opitz, and Tobias Hecking

Description

This research explores strategies for steering the output of large language models (LLMs) towards specific styles, such as sentiment, emotion, or writing style, by adding style vectors to the activations of hidden layers during text generation. We show that style vectors can be simply computed from recorded layer activations for input texts in a specific style in contrast to more complex training-based approaches. Through a series of experiments, we demonstrate the effectiveness of activation engineering using such style vectors to influence the style of generated text in a nuanced and parameterisable way, distinguishing it from prompt engineering. The presented research constitutes a significant step towards developing more adaptive and effective AI-empowered interactive systems.

Steering the LLM with our vectors	Getting the vectors from the LLM

Installation

All required packages can be found in the requirements.txt. It is recommended to set up an anaconda environment with these packages:

conda create --name steering python=3.8.8 --channel conda-forge
conda activate steering
conda install pip # make sure pip is installed
pip install -r requirements.txt
pip install transformers==4.30.0
conda install conda-forge::python-dotenv
pip install -e . # install the package itself - see setup.py
pip install -U accelerate
pip install seaborn==0.12.2
pip install openpyxl

Datasets

We use three different datasets:

Yelp Review Dataset: https://github.com/shentianxiao/language-style-transfer - yelp
Shakespeare Dataset: https://github.com/harsh19/Shakespearizing-Modern-English.git - shakes
GoEmotion Dataset: https://huggingface.co/datasets/go_emotions - GoEmo

They are processed and loaded using dataset_loader.py.

Yelp: We removed duplicates from the dataset, because we wanted steering vectors for as many as possible different target sentences. This is done in the dataset_loader.py

GoEmotion: To base the analyses on a stronger theoretical foundation only 5k samples were used that could unambiguously be mapped to the established six basic emotion categories proposed by Ekman. For this, we load all values and filter the dataframe immediately using the function goemo_get_only_ekman.

Environment variables

The paths to the different folders for all kinds of vectors and datasets are defined in the .env file and loaded during script execution.

Preparing the LLM

In our experiments, we used the Alpaca 7b model. You need to download the weights for yourself and save them locally. You have to set ALPACA_WEIGHTS_FOLDER to this folder in the .env-file.

Training Steering Vectors

We can train a steering vector that manipulates the model to only output the tokens/sentence specified using a script per dataset (see Sec. 3.1). The scripts for each dataset can be found at scripts/training:

Usage:

conda activate steering
python scripts/training/train_training_based_vectors_yelp.py

You can define for which layers you want to train steering vectors, by modifying INSERTION_LAYERS.

After training, the steering vectors are saved in TRAINED_STEERING_VECTOR_PATH, which is defined in .env.

The optimization procedure is time- and compute-intensive. On our usual setup (NVIDIA Quadro GV100 with 32GB) we were only able to train 470 vectors in 100 hours.

Extracting Activation Vectors

Extracting and saving the hidden layer activations (see Sec. 3.2) can be done using get_hidden_activations.py:

conda activate steering
python scripts/training/get_hidden_activations.py

You have to define the name of the dataset in the DATASET_NAME variable. The activations will then be stored at PATH_TO_ACTIVATION_STORAGE.

Please keep in mind that storing the activations for all layers for all entries in a dataset can take a couple of hours and results in a couple of hundred GBs of .pkl files. For the yelp dataset, which was our biggest one, this process resulted in a disk usage of ~334 GB.

Steering Text Generation with Style Vectors

Once we trained vectors/extracted the activations, we can calculate the style vectors from them and add them to the LLM to guide the model's output (For example, when prompting the model to write a review about a restaurant, we can add "positive" SVs to generate a more positive review).

By default, we steer the model's output when it is prompted with the factual and subjective sentences specified in the folder evaluation_prompts. See Sec. 4.4 in our paper for an analysis of this process and Appendix A for a full list of the sentences.

We have scripts for all datasets and steering with training-based or activation-based style vectors:

GoEmotions: steering_go_emo.py (this script is the best documented steering script)
Yelp: steering_yelp.py
Shakespeare: steering_shakes_activations.py

Usage:

conda activate steering
python scripts/generation/steering_go_emo.py

In each of the scripts you have to choose from one of the three methods ["training_based", "activation_based_fair" , "activation_based_all"]:

training_based: Use the trained steering vectors to calculate the style vectors
activation_based_fair: Use the activation vectors, for which we have corresponding trained vectors, to calculate the style vectors
activation_based_all: Use all activation vectors to calculate the style vectors (RECOMMENDED)

You have to set your preferred method with the SETTING variable.

In these scripts, you also need to define to which layers the style vectors should be added. This is done by changing INSERTION_LAYERS to your preferred layers. Please keep in mind that for the training-based style vectors you need to train the vectors for the specific layers beforehand. This isn't necessary for the activations, because get_hidden_activations.py extracted the activations for all layers already.

The resulting csv files can be found at scripts/evaluation/results/. Furthermore, we provide helper scripts to save the csv files also as xlsx-files. For this you have to execute either csv_to_excel_shakes.py or csv_to_excel_yelp.py.

Generating the plots

To generate the paper figures (See Fig.4,5,10-13) based on the previously generated csv files WITHOUT PROMPTING BASELINE, the scripts in scripts/evaluation are used:

GoEmotions: plotting_goemo.py
Yelp: plotting_yelp.py
Shakespeare: plotting_shakes.py

You have to change SETTING to your preferred method.

To generate the paper figures (See Fig.4,5,10-13) based on the previously generated csv files WITH PROMPTING BASELINE, the scripts in scripts/evaluation are used:

GoEmotions: get_prompt_eval_Go.py
Yelp: get_prompt_eval_yelp.py

You have to change SETTINGS to your preferred methods.

Probing Study / Sentiment Classification

To generate the ROC plots (Fig. 3, 6-9) for the probing study (Sec. 4.3) we provide a script per dataset:

Usage:

conda activate steering
python scripts/probing_study/probing_study_goemo.py

In the scripts you have to define the setting you want to evaluate. The keywords here are VECTOR_TYPE and COMPARISON_TYPE. The three combinations are:

VECTOR_TYPE == "training_based": Create the ROC plots for the trained steering vectors
VECTOR_TYPE == "activations":
1. COMPARISON_TYPE == "fair": Create the ROC plots for the activation vectors for which a trained steering vector exists
2. COMPARISON_TYPE == "all": Use all activation vectors to create the ROC plot (can take up to an hour to compute)

In the case of "all" activations, we don't use all of the vectors for the Yelp Review dataset, but subsample to 10k activation vectors. When we tried to load all of them together we ran out of memory. For Shakespeare and GoEmotion this isn't necessary, because they are smaller datasets.

Citation

  @inproceedings{konen-etal-2024-style,
    title = "Style Vectors for Steering Generative Large Language Models",
    author = {Konen, Kai  and
      Jentzsch, Sophie  and
      Diallo, Diaoul{\'e}  and
      Sch{\"u}tt, Peer  and
      Bensch, Oliver  and
      El Baff, Roxanne  and
      Opitz, Dominik  and
      Hecking, Tobias},
    editor = "Graham, Yvette  and
      Purver, Matthew",
    booktitle = "Findings of the Association for Computational Linguistics: EACL 2024",
    month = mar,
    year = "2024",
    address = "St. Julian{'}s, Malta",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-eacl.52",
    pages = "782--802",
  }

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
evaluation_prompts		evaluation_prompts
figures		figures
scripts		scripts
shakespeare_classifier		shakespeare_classifier
utils		utils
.env.sample		.env.sample
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

License

DLR-SC/style-vectors-for-steering-llms

Folders and files

Latest commit

History

Repository files navigation

Style Vectors for Steering Generative Large Language Models

Description

Installation

Datasets

Environment variables

Preparing the LLM

Training Steering Vectors

Extracting Activation Vectors

Steering Text Generation with Style Vectors

Generating the plots

Probing Study / Sentiment Classification

Citation

About

Resources

License

Stars

Watchers

Forks

Languages