Feature Request: Transformer Debugger - Debugging and controlling the behavior of transformer based LLM models. #1513

abhaskumarsinha · 2024-03-14T14:15:57Z

Short Description

Transformer Debugger (TDB) is a tool developed by OpenAI's Superalignment team with the goal of supporting investigations into specific behaviors of small language models. The tool combines automated interpretability techniques with sparse autoencoders.

TDB enables rapid exploration before needing to write code, with the ability to intervene in the forward pass and see how it affects a particular behavior. It can be used to answer questions like, "Why does the model output token A instead of token B for this prompt?" or "Why does attention head H attend to token T for this prompt?" It does so by identifying specific components (neurons, attention heads, autoencoder latents) that contribute to the behavior, showing automatically generated explanations of what causes those components to activate most strongly, and tracing connections between components to help discover circuits.

Paper
https://arxiv.org/pdf/2211.00593v1.pdf

Existing Implementations

(Official by OAI) https://github.com/openai/transformer-debugger
Official Intrepretatibility in the Wild: https://github.com/redwoodresearch/Easy-Transformer

Other Information
This tool could be a very great guide to people working with the interpretability of LLM models. There are already a lot of LLM models in Keras-nlp and engineers might find it very useful while working on the deployment of the models to ensure the safety, reliability, intrepretability and control of the LLM models available here.

SamanehSaadat · 2024-03-25T21:20:38Z

Hi @abhaskumarsinha

Thanks for the suggestion! How do you envision Transformer Debugger to be incorporated into KerasNLP. Does it require the integration with a tool in our library? Or do we just need to create a guide?

SamanehSaadat · 2024-03-25T21:29:20Z

Note that there is an on-going effort to integrate Learning Interpretability Tool (LIT) with KerasNLP. #1521 is an example of adding the .score() function for interpretability use cases.

abhaskumarsinha · 2024-03-26T05:09:59Z

Hello @SamanehSaadat

Thanks for the suggestion! How do you envision Transformer Debugger to be incorporated into KerasNLP. Does it require the integration with a tool in our library? Or do we just need to create a guide?

I believe we should reserve one whole directory for interpretability tools here: https://github.com/keras-team/keras-nlp/tree/master/keras_nlp we would need to incorporate the whole thing but that's a time-consuming goal.

Here's a one-minute video on how that works: https://www.youtube.com/watch?v=5D_GiJv7O-M

Note that there is an ongoing effort to integrate Learning Interpretability Tool (LIT) with KerasNLP. #1521 is an example of adding the .score() function for interpretability use cases.

Thank you for pointing that out. I'm not an expert here, but LIT sounds like a very general approach while TDB is a very specific LLM approach. TDB could be a very long and lengthy feature to implement, so I'm all up to contribute in case there is any need.

github-actions bot assigned sachinprasadhs Mar 14, 2024

sachinprasadhs added type:feature New feature or request keras-team-review-pending labels Mar 14, 2024

SamanehSaadat removed the keras-team-review-pending label Mar 25, 2024

SamanehSaadat self-assigned this Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Transformer Debugger - Debugging and controlling the behavior of transformer based LLM models. #1513

Feature Request: Transformer Debugger - Debugging and controlling the behavior of transformer based LLM models. #1513

abhaskumarsinha commented Mar 14, 2024 •

edited

SamanehSaadat commented Mar 25, 2024

SamanehSaadat commented Mar 25, 2024

abhaskumarsinha commented Mar 26, 2024

Feature Request: Transformer Debugger - Debugging and controlling the behavior of transformer based LLM models. #1513

Feature Request: Transformer Debugger - Debugging and controlling the behavior of transformer based LLM models. #1513

Comments

abhaskumarsinha commented Mar 14, 2024 • edited

SamanehSaadat commented Mar 25, 2024

SamanehSaadat commented Mar 25, 2024

abhaskumarsinha commented Mar 26, 2024

abhaskumarsinha commented Mar 14, 2024 •

edited