RSS

Ac­ti­va­tion Engineering

TagLast edit: 29 Aug 2023 3:05 UTC by David Udell

Activation Engineering is the direct manipulation of activation vectors inside of a trained machine learning model. Potentially, it is a way to steer a model’s behavior.

Activation engineering can be contrasted with other strategies for steering models: fine-tuning the models for desired behavior and crafting prompts that get a particular response.

Steer­ing GPT-2-XL by adding an ac­ti­va­tion vector

13 May 2023 18:42 UTC
418 points
97 comments50 min readLW link

Mo­du­lat­ing syco­phancy in an RLHF model via ac­ti­va­tion steering

Nina Rimsky9 Aug 2023 7:06 UTC
65 points
20 comments12 min readLW link

Re­duc­ing syco­phancy and im­prov­ing hon­esty via ac­ti­va­tion steering

Nina Rimsky28 Jul 2023 2:46 UTC
116 points
14 comments9 min readLW link

Ex­tract­ing and Eval­u­at­ing Causal Direc­tion in LLMs’ Activations

14 Dec 2022 14:33 UTC
29 points
5 comments11 min readLW link

Maze-solv­ing agents: Add a top-right vec­tor, make the agent go to the top-right

31 Mar 2023 19:20 UTC
101 points
17 comments11 min readLW link

Eval­u­at­ing hid­den di­rec­tions on the util­ity dataset: clas­sifi­ca­tion, steer­ing and removal

25 Sep 2023 17:19 UTC
25 points
3 comments7 min readLW link

Ac­tAdd: Steer­ing Lan­guage Models with­out Optimization

6 Sep 2023 17:21 UTC
105 points
3 comments2 min readLW link
(arxiv.org)

Un­der­stand­ing Coun­ter­bal­anced Sub­trac­tions for Bet­ter Ac­ti­va­tion Additions

ojorgensen17 Aug 2023 13:53 UTC
21 points
0 comments14 min readLW link

De­cod­ing in­ter­me­di­ate ac­ti­va­tions in llama-2-7b

Nina Rimsky21 Jul 2023 5:35 UTC
35 points
3 comments4 min readLW link

Paper: Un­der­stand­ing and Con­trol­ling a Maze-Solv­ing Policy Network

13 Oct 2023 1:38 UTC
69 points
0 comments1 min readLW link
(arxiv.org)

Im­ple­ment­ing ac­ti­va­tion steering

Annah5 Feb 2024 17:51 UTC
59 points
5 comments7 min readLW link

[Question] What’s the the­ory of im­pact for ac­ti­va­tion vec­tors?

Chris_Leong11 Feb 2024 7:34 UTC
57 points
12 comments1 min readLW link

Com­par­ing rep­re­sen­ta­tion vec­tors be­tween llama 2 base and chat

Nina Rimsky28 Oct 2023 22:54 UTC
35 points
5 comments2 min readLW link

Ac­ti­va­tion ad­di­tions in a sim­ple MNIST network

Garrett Baker18 May 2023 2:49 UTC
26 points
0 comments2 min readLW link

Open prob­lems in ac­ti­va­tion engineering

24 Jul 2023 19:46 UTC
43 points
2 comments1 min readLW link
(coda.io)

Ac­ti­va­tion ad­di­tions in a small resi­d­ual network

Garrett Baker22 May 2023 20:28 UTC
22 points
4 comments3 min readLW link

Red-team­ing lan­guage mod­els via ac­ti­va­tion engineering

Nina Rimsky26 Aug 2023 5:52 UTC
65 points
6 comments9 min readLW link

Un­der­stand­ing and vi­su­al­iz­ing syco­phancy datasets

Nina Rimsky16 Aug 2023 5:34 UTC
45 points
0 comments6 min readLW link

Sparse Cod­ing, for Mechanis­tic In­ter­pretabil­ity and Ac­ti­va­tion Engineering

David Udell23 Sep 2023 19:16 UTC
42 points
7 comments34 min readLW link

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

11 Mar 2023 18:59 UTC
312 points
22 comments23 min readLW link

In­fer­ence-Time In­ter­ven­tion: Elic­it­ing Truth­ful An­swers from a Lan­guage Model

likenneth11 Jun 2023 5:38 UTC
195 points
4 comments1 min readLW link
(arxiv.org)

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

2 Jan 2024 0:47 UTC
119 points
29 comments8 min readLW link
(arxiv.org)

[ASoT] GPT2 Steer­ing & The Tuned Lens

Ulisse Mini1 Jul 2023 14:12 UTC
23 points
0 comments2 min readLW link

Clas­sify­ing rep­re­sen­ta­tions of sparse au­toen­coders (SAEs)

Annah17 Nov 2023 13:54 UTC
15 points
6 comments2 min readLW link

Fea­tures and Ad­ver­saries in MemoryDT

20 Oct 2023 7:32 UTC
30 points
6 comments25 min readLW link

Auto-match­ing hid­den lay­ers in Py­torch LLMs

chanind19 Feb 2024 12:40 UTC
2 points
0 comments3 min readLW link

In­ves­ti­gat­ing Bias Rep­re­sen­ta­tions in LLMs via Ac­ti­va­tion Steering

DawnLu15 Jan 2024 19:39 UTC
29 points
4 comments5 min readLW link

Strik­ing Im­pli­ca­tions for Learn­ing The­ory, In­ter­pretabil­ity — and Safety?

RogerDearnaley5 Jan 2024 8:46 UTC
35 points
4 comments2 min readLW link

How well do truth probes gen­er­al­ise?

mishajw24 Feb 2024 14:12 UTC
86 points
11 comments9 min readLW link
No comments.