Open problems in activation engineering

TurnTrout, woog, lisathiergart, Monte M and Ulisse Mini

24 Jul 2023 19:46 UTC

LW: 43 AF: 24

AI Open Problems Interpretability (ML & AI)Activation Engineering

Steering GPT-2-XL by adding an activation vector introduced

activation engineering… techniques which steer models by modifying their activations. As a complement to prompt engineering and finetuning, activation engineering is a low-overhead way to steer models at runtime.

These results were recently complemented by Inference-Time Intervention: Eliciting Truthful Answers from a Language Model, which doubled TruthfulQA performance by adding a similarly computed activation vector to forward passes!

We think that activation engineering has a bunch of low-hanging fruit for steering and understanding models. A few open problems from the list:

Try decomposing the residual stream activations over a batch of inputs somehow (e.g. PCA). Using the principal directions as activation addition directions, do they seem to capture something meaningful?
Take a circuit studied from existing literature on GPT2, or find another one using ACDC. Targeting the nodes in these circuits, can you learn anything more about them and generally about how activation additions interact with circuits?
What’s the mechanism by which adding a steering vector with too large a coefficient breaks the model? (Credit: Thomas Kwa; see also @Ulisse Mini’s initial data/explanation.)

If you want to work on activation engineering, come by the Slack server to coordinate research projects and propose new ideas.

What links here?

My Trial Period as an Independent Alignment Researcher by Bart Bussmann (8 Aug 2023 14:16 UTC; 32 points)

TurnTrout, woog, lisathiergart, Monte M and Ulisse Mini

24 Jul 2023 19:46 UTC

LW: 43 AF: 24

2 comments1 min readLW link

AI Open Problems Interpretability (ML & AI)Activation Engineering

Hoagy 28 Jul 2023 3:08 UTC
LW: 5 AF: 4
2
AF
Try decomposing the residual stream activations over a batch of inputs somehow (e.g. PCA). Using the principal directions as activation addition directions, do they seem to capture something meaningful?
It’s not PCA but we’ve been using sparse coding to find important directions in activation space (see original sparse coding post, quantitative results, qualitative results).
We’ve found that they’re on average more interpretable than neurons and I understand that @Logan Riggs and Julie Steele have found some effect using them as directions for activation patching, e.g. using a “this direction activates on curse words” direction to make text more aggressive. If people are interested in exploring this further let me know, say hi at our EleutherAI channel or check out the repo :)
TurnTrout 14 Sep 2023 17:32 UTC
LW: 2 AF: 2
0
AF
(The original post was supposed to also have @Monte M as a coauthor; fixed my oversight.)