Ac­ti­va­tion Engineering

TagLast edit: 29 Aug 2023 3:05 UTC by David Udell

Activation Engineering is the direct manipulation of activation vectors inside of a trained machine learning model. Potentially, it is a way to steer a model’s behavior.

Activation engineering can be contrasted with other strategies for steering models: fine-tuning the models for desired behavior and crafting prompts that get a particular response.

Mo­du­lat­ing syco­phancy in an RLHF model via ac­ti­va­tion steering

Nina Rimsky9 Aug 2023 7:06 UTC
52 points
11 comments12 min readLW link

Ex­tract­ing and Eval­u­at­ing Causal Direc­tion in LLMs’ Activations

14 Dec 2022 14:33 UTC
29 points
5 comments11 min readLW link

Steer­ing GPT-2-XL by adding an ac­ti­va­tion vector

13 May 2023 18:42 UTC
387 points
94 comments50 min readLW link

Maze-solv­ing agents: Add a top-right vec­tor, make the agent go to the top-right

31 Mar 2023 19:20 UTC
101 points
17 comments11 min readLW link

Re­duc­ing syco­phancy and im­prov­ing hon­esty via ac­ti­va­tion steering

Nina Rimsky28 Jul 2023 2:46 UTC
109 points
14 comments9 min readLW link

Ac­tAdd: Steer­ing Lan­guage Models with­out Optimization

6 Sep 2023 17:21 UTC
91 points
3 comments2 min readLW link

Un­der­stand­ing Coun­ter­bal­anced Sub­trac­tions for Bet­ter Ac­ti­va­tion Additions

ojorgensen17 Aug 2023 13:53 UTC
21 points
0 comments14 min readLW link

[ASoT] GPT2 Steer­ing & The Tuned Lens

Ulisse Mini1 Jul 2023 14:12 UTC
23 points
0 comments2 min readLW link

Red-team­ing lan­guage mod­els via ac­ti­va­tion engineering

Nina Rimsky26 Aug 2023 5:52 UTC
59 points
5 comments9 min readLW link

Ac­ti­va­tion ad­di­tions in a sim­ple MNIST network

Garrett Baker18 May 2023 2:49 UTC
26 points
0 comments2 min readLW link

Un­der­stand­ing and vi­su­al­iz­ing syco­phancy datasets

Nina Rimsky16 Aug 2023 5:34 UTC
45 points
0 comments6 min readLW link

Ac­ti­va­tion ad­di­tions in a small resi­d­ual network

Garrett Baker22 May 2023 20:28 UTC
22 points
4 comments3 min readLW link

Open prob­lems in ac­ti­va­tion engineering

24 Jul 2023 19:46 UTC
39 points
2 comments1 min readLW link

De­cod­ing in­ter­me­di­ate ac­ti­va­tions in llama-2-7b

Nina Rimsky21 Jul 2023 5:35 UTC
34 points
0 comments4 min readLW link
No comments.