RSS

MATS Program

TagLast edit: 21 Oct 2023 17:46 UTC by Ryan Kidd

The ML Alignment & Theory Scholars (MATS) Program is an educational seminar and independent research program that aims to provide talented scholars with talks, workshops, and research mentorship in the field of AI alignment, and connect them with the Berkeley AI safety research community.

SERI MATS Pro­gram—Win­ter 2022 Cohort

8 Oct 2022 19:09 UTC
72 points
12 comments4 min readLW link

SolidGoldMag­ikarp (plus, prompt gen­er­a­tion)

5 Feb 2023 22:02 UTC
668 points
205 comments12 min readLW link

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

11 Mar 2023 18:59 UTC
313 points
23 comments23 min readLW link

Pro­ject pro­posal: Test­ing the IBP defi­ni­tion of agent

9 Aug 2022 1:09 UTC
21 points
4 comments2 min readLW link

SERI MATS—Sum­mer 2023 Cohort

8 Apr 2023 15:32 UTC
71 points
25 comments4 min readLW link

SERI ML Align­ment The­ory Schol­ars Pro­gram 2022

27 Apr 2022 0:43 UTC
67 points
6 comments3 min readLW link

Soft op­ti­miza­tion makes the value tar­get bigger

Jeremy Gillen2 Jan 2023 16:06 UTC
117 points
20 comments12 min readLW link

How MATS ad­dresses “mass move­ment build­ing” concerns

Ryan Kidd4 May 2023 0:55 UTC
62 points
9 comments3 min readLW link

Finite Fac­tored Sets in Pictures

Magdalena Wache11 Dec 2022 18:49 UTC
174 points
35 comments12 min readLW link

Tak­ing the pa­ram­e­ters which seem to mat­ter and ro­tat­ing them un­til they don’t

Garrett Baker26 Aug 2022 18:26 UTC
120 points
48 comments1 min readLW link

Nor­ma­tive vs De­scrip­tive Models of Agency

mattmacdermott2 Feb 2023 20:28 UTC
26 points
5 comments4 min readLW link

In­fra-Bayesian haggling

hannagabor20 May 2024 12:23 UTC
18 points
0 comments20 min readLW link

Pre­dic­tions for shard the­ory mechanis­tic in­ter­pretabil­ity results

1 Mar 2023 5:16 UTC
105 points
10 comments5 min readLW link

My MATS Sum­mer 2023 experience

James Chua20 Mar 2024 11:26 UTC
28 points
0 comments3 min readLW link
(jameschua.net)

Neu­ral Tan­gent Ker­nel Distillation

5 Oct 2022 18:11 UTC
76 points
20 comments8 min readLW link

Mo­du­lat­ing syco­phancy in an RLHF model via ac­ti­va­tion steering

Nina Rimsky9 Aug 2023 7:06 UTC
67 points
20 comments12 min readLW link

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

13 Jun 2024 10:04 UTC
70 points
10 comments2 min readLW link
(arxiv.org)

At­ten­tion SAEs Scale to GPT-2 Small

3 Feb 2024 6:50 UTC
76 points
4 comments8 min readLW link

A dis­til­la­tion of Evan Hub­inger’s train­ing sto­ries (for SERI MATS)

Daphne_W18 Jul 2022 3:38 UTC
15 points
1 comment10 min readLW link

Case Stud­ies in Re­v­erse-Eng­ineer­ing Sparse Au­toen­coder Fea­tures by Us­ing MLP Linearization

14 Jan 2024 2:06 UTC
23 points
0 comments42 min readLW link

Clar­ify­ing mesa-optimization

21 Mar 2023 15:53 UTC
37 points
6 comments10 min readLW link

Ta­lent Needs of Tech­ni­cal AI Safety Teams

24 May 2024 0:36 UTC
109 points
64 comments14 min readLW link

In­for­ma­tion the­o­retic model anal­y­sis may not lend much in­sight, but we may have been do­ing them wrong!

Garrett Baker24 Jul 2022 0:42 UTC
7 points
0 comments10 min readLW link

More find­ings on Me­moriza­tion and dou­ble descent

Marius Hobbhahn1 Feb 2023 18:26 UTC
53 points
2 comments19 min readLW link

More find­ings on max­i­mal data dimension

Marius Hobbhahn2 Feb 2023 18:33 UTC
27 points
1 comment11 min readLW link

Can We Align a Self-Im­prov­ing AGI?

Peter S. Park30 Aug 2022 0:14 UTC
8 points
5 comments11 min readLW link

Race Along Rashomon Ridge

7 Jul 2022 3:20 UTC
50 points
15 comments8 min readLW link

[ASoT] Policy Tra­jec­tory Visualization

Ulisse Mini7 Feb 2023 0:13 UTC
9 points
2 comments1 min readLW link

MATS Sum­mer 2023 Retrospective

1 Dec 2023 23:29 UTC
77 points
34 comments26 min readLW link

Broad Bas­ins and Data Compression

8 Aug 2022 20:33 UTC
33 points
6 comments7 min readLW link

Qual­ities that al­ign­ment men­tors value in ju­nior researchers

Akash14 Feb 2023 23:27 UTC
88 points
14 comments3 min readLW link

Swap and Scale

Stephen Fowler9 Sep 2022 22:41 UTC
17 points
3 comments1 min readLW link

In­ter­ven­ing in the Resi­d­ual Stream

MadHatter22 Feb 2023 6:29 UTC
30 points
1 comment9 min readLW link

Ap­ply for MATS Win­ter 2023-24!

21 Oct 2023 2:27 UTC
104 points
6 comments5 min readLW link

How to Catch an AI Liar: Lie De­tec­tion in Black-Box LLMs by Ask­ing Un­re­lated Questions

28 Sep 2023 18:53 UTC
183 points
37 comments3 min readLW link

What sorts of sys­tems can be de­cep­tive?

Andrei Alexandru31 Oct 2022 22:00 UTC
16 points
0 comments7 min readLW link

My Ad­vice for In­com­ing SERI MATS Scholars

Johannes C. Mayer3 Jan 2023 19:25 UTC
56 points
1 comment4 min readLW link

Au­dit­ing games for high-level interpretability

Paul Colognese1 Nov 2022 10:44 UTC
34 points
1 comment7 min readLW link

Con­tent and Take­aways from SERI MATS Train­ing Pro­gram with John Wentworth

RohanS24 Dec 2022 4:17 UTC
28 points
3 comments12 min readLW link

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

16 Jan 2024 0:26 UTC
82 points
9 comments18 min readLW link

The Ground Truth Prob­lem (Or, Why Eval­u­at­ing In­ter­pretabil­ity Meth­ods Is Hard)

Jessica Rumbelow17 Nov 2022 11:06 UTC
27 points
2 comments2 min readLW link

Uncer­tainty in all its flavours

Cleo Nardo9 Jan 2024 16:21 UTC
27 points
6 comments35 min readLW link

[ASoT] Reflec­tivity in Nar­row AI

Ulisse Mini21 Nov 2022 0:51 UTC
6 points
1 comment1 min readLW link

Be­havi­oural statis­tics for a maze-solv­ing agent

20 Apr 2023 22:26 UTC
46 points
11 comments10 min readLW link

[Closed] Agent Foun­da­tions track in MATS

Vanessa Kosoy31 Oct 2023 8:12 UTC
54 points
1 comment1 min readLW link
(www.matsprogram.org)

Balanc­ing Se­cu­rity Mind­set with Col­lab­o­ra­tive Re­search: A Proposal

MadHatter1 Nov 2023 0:46 UTC
9 points
3 comments4 min readLW link

MATS AI Safety Strat­egy Curriculum

7 Mar 2024 19:59 UTC
67 points
2 comments16 min readLW link

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

30 Apr 2024 18:51 UTC
191 points
37 comments45 min readLW link

Game The­ory with­out Argmax [Part 2]

Cleo Nardo11 Nov 2023 16:02 UTC
31 points
14 comments13 min readLW link

Con­se­quen­tial­ists: One-Way Pat­tern Traps

David Udell16 Jan 2023 20:48 UTC
54 points
3 comments14 min readLW link

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

28 May 2024 16:33 UTC
73 points
5 comments21 min readLW link

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

2 Jan 2024 0:47 UTC
122 points
29 comments8 min readLW link
(arxiv.org)

Con­di­tion­ing Gen­er­a­tive Models for Alignment

Jozdien18 Jul 2022 7:11 UTC
58 points
8 comments20 min readLW link

What Makes an Idea Un­der­stand­able? On Ar­chi­tec­turally and Cul­turally Nat­u­ral Ideas.

16 Aug 2022 2:09 UTC
21 points
2 comments16 min readLW link

Bounded com­plex­ity of solv­ing ELK and its implications

Rubi J. Hudson19 Jul 2022 6:56 UTC
11 points
4 comments18 min readLW link

How com­plex are my­opic imi­ta­tors?

Vivek Hebbar8 Feb 2022 12:00 UTC
26 points
1 comment15 min readLW link

My SERI MATS Application

Daniel Paleka30 May 2022 2:04 UTC
16 points
0 comments8 min readLW link

How (not) to choose a re­search project

9 Aug 2022 0:26 UTC
79 points
11 comments7 min readLW link

Team Shard Sta­tus Report

David Udell9 Aug 2022 5:33 UTC
38 points
8 comments3 min readLW link

Find­ing Skele­tons on Rashomon Ridge

24 Jul 2022 22:31 UTC
30 points
2 comments7 min readLW link

Ex­ter­nal­ized rea­son­ing over­sight: a re­search di­rec­tion for lan­guage model alignment

tamera3 Aug 2022 12:03 UTC
130 points
23 comments6 min readLW link

Trans­lat­ing be­tween La­tent Spaces

30 Jul 2022 3:25 UTC
27 points
2 comments8 min readLW link

Shard The­ory: An Overview

David Udell11 Aug 2022 5:44 UTC
162 points
34 comments10 min readLW link

How Do We Align an AGI Without Get­ting So­cially Eng­ineered? (Hint: Box It)

10 Aug 2022 18:14 UTC
28 points
30 comments11 min readLW link

Iden­ti­fi­ca­tion of Nat­u­ral Modularity

Stephen Fowler25 Jun 2022 15:05 UTC
15 points
3 comments7 min readLW link

How trans­parency changed over time

ViktoriaMalyasova30 Jul 2022 4:36 UTC
21 points
0 comments6 min readLW link

How In­ter­pretabil­ity can be Impactful

Connall Garrod18 Jul 2022 0:06 UTC
18 points
0 comments37 min readLW link

Why you might ex­pect ho­mo­ge­neous take-off: ev­i­dence from ML research

Andrei Alexandru17 Jul 2022 20:31 UTC
24 points
0 comments10 min readLW link

Train­ing goals for large lan­guage models

Johannes Treutlein18 Jul 2022 7:09 UTC
28 points
5 comments19 min readLW link

Notes on Learn­ing the Prior

Spencer Becker-Kahn15 Jul 2022 17:28 UTC
22 points
2 comments25 min readLW link

De­cep­tion?! I ain’t got time for that!

Paul Colognese18 Jul 2022 0:06 UTC
55 points
5 comments13 min readLW link

In­for­ma­tion Loss --> Basin flatness

Vivek Hebbar21 May 2022 12:58 UTC
62 points
31 comments7 min readLW link

[Short ver­sion] In­for­ma­tion Loss --> Basin flatness

Vivek Hebbar21 May 2022 12:59 UTC
12 points
0 comments1 min readLW link

Find­ing Goals in the World Model

22 Aug 2022 18:06 UTC
59 points
8 comments13 min readLW link

The Shard The­ory Align­ment Scheme

David Udell25 Aug 2022 4:52 UTC
47 points
32 comments2 min readLW link

The Core of the Align­ment Prob­lem is...

17 Aug 2022 20:07 UTC
74 points
10 comments9 min readLW link

Mesa-op­ti­miza­tion for goals defined only within a train­ing en­vi­ron­ment is dangerous

Rubi J. Hudson17 Aug 2022 3:56 UTC
6 points
2 comments4 min readLW link

A brief note on Sim­plic­ity Bias

Spencer Becker-Kahn14 Aug 2022 2:05 UTC
19 points
0 comments4 min readLW link

In­ner Align­ment via Superpowers

30 Aug 2022 20:01 UTC
37 points
13 comments4 min readLW link

Be­havi­our Man­i­folds and the Hes­sian of the To­tal Loss—Notes and Criticism

Spencer Becker-Kahn3 Sep 2022 0:15 UTC
35 points
5 comments6 min readLW link

Fram­ing AI Childhoods

David Udell6 Sep 2022 23:40 UTC
37 points
8 comments4 min readLW link

Search­ing for Mo­du­lar­ity in Large Lan­guage Models

8 Sep 2022 2:25 UTC
44 points
3 comments14 min readLW link

Try­ing to find the un­der­ly­ing struc­ture of com­pu­ta­tional systems

Matthias G. Mayer13 Sep 2022 21:16 UTC
17 points
9 comments4 min readLW link

The­o­ret­i­cal Neu­ro­science For Align­ment Theory

Cameron Berg7 Dec 2021 21:50 UTC
62 points
18 comments23 min readLW link

The Nat­u­ral Ab­strac­tion Hy­poth­e­sis: Im­pli­ca­tions and Evidence

CallumMcDougall14 Dec 2021 23:14 UTC
39 points
9 comments19 min readLW link

Mo­ti­va­tions, Nat­u­ral Selec­tion, and Cur­ricu­lum Engineering

Oliver Sourbut16 Dec 2021 1:07 UTC
16 points
0 comments42 min readLW link

Un­der­stand­ing and con­trol­ling auto-in­duced dis­tri­bu­tional shift

L Rudolf L13 Dec 2021 14:59 UTC
33 points
4 comments16 min readLW link

Why I’m Work­ing On Model Ag­nos­tic Interpretability

Jessica Rumbelow11 Nov 2022 9:24 UTC
27 points
9 comments2 min readLW link

A Short Dialogue on the Mean­ing of Re­ward Functions

19 Nov 2022 21:04 UTC
45 points
0 comments3 min readLW link

Guardian AI (Misal­igned sys­tems are all around us.)

Jessica Rumbelow25 Nov 2022 15:55 UTC
15 points
6 comments2 min readLW link

Is the “Valley of Con­fused Ab­strac­tions” real?

jacquesthibs5 Dec 2022 13:36 UTC
19 points
11 comments2 min readLW link

Fore­sight for AGI Safety Strat­egy: Miti­gat­ing Risks and Iden­ti­fy­ing Golden Opportunities

jacquesthibs5 Dec 2022 16:09 UTC
28 points
6 comments8 min readLW link

Work­ing to­wards AI al­ign­ment is better

Johannes C. Mayer9 Dec 2022 15:39 UTC
8 points
2 comments2 min readLW link

Proper scor­ing rules don’t guaran­tee pre­dict­ing fixed points

16 Dec 2022 18:22 UTC
68 points
8 comments21 min readLW link

Get­ting up to Speed on the Speed Prior in 2022

robertzk28 Dec 2022 7:49 UTC
36 points
5 comments65 min readLW link

But is it re­ally in Rome? An in­ves­ti­ga­tion of the ROME model edit­ing technique

jacquesthibs30 Dec 2022 2:40 UTC
103 points
1 comment18 min readLW link

Re­sults from a sur­vey on tool use and work­flows in al­ign­ment research

19 Dec 2022 15:19 UTC
79 points
2 comments19 min readLW link

[Question] How is ARC plan­ning to use ELK?

jacquesthibs15 Dec 2022 20:11 UTC
24 points
5 comments1 min readLW link

Some Notes on the math­e­mat­ics of Toy Au­toen­cod­ing Problems

Spencer Becker-Kahn22 Dec 2022 17:21 UTC
17 points
1 comment12 min readLW link

The Align­ment Problems

Martín Soto12 Jan 2023 22:29 UTC
19 points
0 comments4 min readLW link

Disen­tan­gling Shard The­ory into Atomic Claims

Leon Lang13 Jan 2023 4:23 UTC
85 points
6 comments18 min readLW link

Neu­ral net­works gen­er­al­ize be­cause of this one weird trick

Jesse Hoogland18 Jan 2023 0:10 UTC
167 points
28 comments53 min readLW link
(www.jessehoogland.com)

Ex­per­i­ment Idea: RL Agents Evad­ing Learned Shutdownability

Leon Lang16 Jan 2023 22:46 UTC
31 points
7 comments17 min readLW link
(docs.google.com)

[RFC] Pos­si­ble ways to ex­pand on “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion”.

25 Jan 2023 19:03 UTC
47 points
6 comments12 min readLW link

Stop-gra­di­ents lead to fixed point predictions

28 Jan 2023 22:47 UTC
36 points
2 comments24 min readLW link

Spooky ac­tion at a dis­tance in the loss landscape

28 Jan 2023 0:22 UTC
61 points
4 comments7 min readLW link
(www.jessehoogland.com)

In­ter­view: Ap­pli­ca­tions w/​ Alice Rigg

jacobhaimes19 Dec 2023 19:03 UTC
12 points
0 comments1 min readLW link
(into-ai-safety.github.io)

Gra­di­ent sur­fing: the hid­den role of regularization

Jesse Hoogland6 Feb 2023 3:50 UTC
37 points
9 comments14 min readLW link
(www.jessehoogland.com)

SolidGoldMag­ikarp II: tech­ni­cal de­tails and more re­cent findings

6 Feb 2023 19:09 UTC
109 points
45 comments13 min readLW link

A cir­cuit for Python doc­strings in a 4-layer at­ten­tion-only transformer

20 Feb 2023 19:35 UTC
92 points
8 comments21 min readLW link

The shal­low re­al­ity of ‘deep learn­ing the­ory’

Jesse Hoogland22 Feb 2023 4:16 UTC
34 points
11 comments3 min readLW link
(www.jessehoogland.com)

A Neu­ral Net­work un­der­go­ing Gra­di­ent-based Train­ing as a Com­plex System

Spencer Becker-Kahn19 Feb 2023 22:08 UTC
22 points
1 comment19 min readLW link

Search­ing for a model’s con­cepts by their shape – a the­o­ret­i­cal framework

23 Feb 2023 20:14 UTC
51 points
0 comments19 min readLW link

Why are coun­ter­fac­tu­als elu­sive?

Martín Soto3 Mar 2023 20:13 UTC
14 points
6 comments2 min readLW link

A mechanis­tic ex­pla­na­tion for SolidGoldMag­ikarp-like to­kens in GPT2

MadHatter26 Feb 2023 1:10 UTC
61 points
14 comments6 min readLW link

[Ap­pendix] Nat­u­ral Ab­strac­tions: Key Claims, The­o­rems, and Critiques

16 Mar 2023 16:38 UTC
48 points
0 comments13 min readLW link

Nat­u­ral Ab­strac­tions: Key claims, The­o­rems, and Critiques

16 Mar 2023 16:37 UTC
206 points
20 comments45 min readLW link

Us­ing PICT against Pas­taGPT Jailbreaking

Quentin FEUILLADE--MONTIXI9 Feb 2023 4:30 UTC
17 points
0 comments9 min readLW link

We In­spected Every Head In GPT-2 Small us­ing SAEs So You Don’t Have To

6 Mar 2024 5:03 UTC
57 points
0 comments12 min readLW link

How im­por­tant is AI hack­ing as LLMs ad­vance?

Artyom Karpov29 Jan 2024 18:41 UTC
1 point
0 comments6 min readLW link

Un­der­stand­ing SAE Fea­tures with the Logit Lens

11 Mar 2024 0:16 UTC
55 points
0 comments14 min readLW link

Im­ple­ment­ing ac­ti­va­tion steering

Annah5 Feb 2024 17:51 UTC
62 points
5 comments7 min readLW link

Ophiol­ogy (or, how the Mamba ar­chi­tec­ture works)

9 Apr 2024 19:31 UTC
66 points
8 comments10 min readLW link

End-to-end hack­ing with lan­guage models

tchauvin5 Apr 2024 15:06 UTC
23 points
0 comments8 min readLW link

Transcoders en­able fine-grained in­ter­pretable cir­cuit anal­y­sis for lan­guage models

30 Apr 2024 17:58 UTC
63 points
14 comments17 min readLW link

Towards Mul­ti­modal In­ter­pretabil­ity: Learn­ing Sparse In­ter­pretable Fea­tures in Vi­sion Transformers

hugofry29 Apr 2024 20:57 UTC
76 points
7 comments11 min readLW link

MATS Win­ter 2023-24 Retrospective

11 May 2024 0:09 UTC
80 points
28 comments49 min readLW link

Lan­guage Models Model Us

eggsyntax17 May 2024 21:00 UTC
153 points
50 comments7 min readLW link

Fine-tun­ing is not suffi­cient for ca­pa­bil­ity elicitation

Theodore Chapman14 Jun 2024 18:50 UTC
36 points
1 comment9 min readLW link

Em­piri­cal risk min­i­miza­tion is fun­da­men­tally confused

Jesse Hoogland22 Mar 2023 16:58 UTC
32 points
5 comments1 min readLW link

Ap­prox­i­ma­tion is ex­pen­sive, but the lunch is cheap

19 Apr 2023 14:19 UTC
68 points
3 comments16 min readLW link

Fixed points in mor­tal pop­u­la­tion games

ViktoriaMalyasova14 Mar 2023 7:10 UTC
24 points
0 comments12 min readLW link
(www.lesswrong.com)

A mostly crit­i­cal re­view of in­fra-Bayesianism

matolcsid28 Feb 2023 18:37 UTC
104 points
9 comments29 min readLW link

Perfor­mance guaran­tees in clas­si­cal learn­ing the­ory and in­fra-Bayesianism

matolcsid28 Feb 2023 18:37 UTC
9 points
4 comments31 min readLW link

Non-Uni­tary Quan­tum Logic—SERI MATS Re­search Sprint

Yegreg16 Feb 2023 19:31 UTC
27 points
0 comments7 min readLW link

An open let­ter to SERI MATS pro­gram organisers

Roman Leventov20 Apr 2023 16:34 UTC
25 points
26 comments4 min readLW link

Poly­se­man­tic At­ten­tion Head in a 4-Layer Transformer

9 Nov 2023 16:16 UTC
51 points
0 comments6 min readLW link

Game The­ory with­out Argmax [Part 1]

Cleo Nardo11 Nov 2023 15:59 UTC
64 points
17 comments19 min readLW link

Clas­sify­ing rep­re­sen­ta­tions of sparse au­toen­coders (SAEs)

Annah17 Nov 2023 13:54 UTC
15 points
6 comments2 min readLW link

Re­search agenda: Su­per­vis­ing AIs im­prov­ing AIs

29 Apr 2023 17:09 UTC
76 points
5 comments19 min readLW link

Find­ing Neu­rons in a Haystack: Case Stud­ies with Sparse Probing

3 May 2023 13:30 UTC
33 points
5 comments2 min readLW link
(arxiv.org)

Con­di­tions for math­e­mat­i­cal equiv­alence of Stochas­tic Gra­di­ent Des­cent and Nat­u­ral Selection

Oliver Sourbut9 May 2022 21:38 UTC
70 points
19 comments8 min readLW link1 review
(www.oliversourbut.net)

Some real ex­am­ples of gra­di­ent hacking

Oliver Sourbut22 Nov 2021 0:11 UTC
15 points
8 comments2 min readLW link

Some Sum­maries of Agent Foun­da­tions Work

mattmacdermott15 May 2023 16:09 UTC
58 points
1 comment13 min readLW link

Boomerang—pro­to­col to dis­solve some com­mit­ment races

Filip Sondej30 May 2023 16:21 UTC
37 points
10 comments8 min readLW link

In­fra-Bayesian Logic

5 Jul 2023 19:16 UTC
15 points
2 comments1 min readLW link

Quan­ti­ta­tive cruxes in Alignment

Martín Soto2 Jul 2023 20:38 UTC
19 points
0 comments23 min readLW link

Sources of ev­i­dence in Alignment

Martín Soto2 Jul 2023 20:38 UTC
20 points
0 comments11 min readLW link

Ac­ti­va­tion adding ex­per­i­ments with llama-7b

Nina Rimsky16 Jul 2023 4:17 UTC
51 points
1 comment3 min readLW link

Au­toIn­ter­pre­ta­tion Finds Sparse Cod­ing Beats Alternatives

Hoagy17 Jul 2023 1:41 UTC
55 points
1 comment7 min readLW link

Ac­ti­va­tion adding ex­per­i­ments with FLAN-T5

Nina Rimsky13 Jul 2023 23:32 UTC
21 points
5 comments7 min readLW link

De­cod­ing in­ter­me­di­ate ac­ti­va­tions in llama-2-7b

Nina Rimsky21 Jul 2023 5:35 UTC
37 points
3 comments4 min readLW link

Un­der­stand­ing and Align­ing a Hu­man-like In­duc­tive Bias with Cog­ni­tive Science: a Re­view of Re­lated Liter­a­ture

Claire Short29 Jul 2023 6:10 UTC
23 points
0 comments12 min readLW link

Re­duc­ing syco­phancy and im­prov­ing hon­esty via ac­ti­va­tion steering

Nina Rimsky28 Jul 2023 2:46 UTC
117 points
16 comments9 min readLW link

De­com­pos­ing in­de­pen­dent gen­er­al­iza­tions in neu­ral net­works via Hes­sian analysis

14 Aug 2023 17:04 UTC
81 points
4 comments1 min readLW link

Un­der­stand­ing and vi­su­al­iz­ing syco­phancy datasets

Nina Rimsky16 Aug 2023 5:34 UTC
45 points
0 comments6 min readLW link

Large Lan­guage Models will be Great for Censorship

Ethan Edwards21 Aug 2023 19:03 UTC
183 points
14 comments8 min readLW link
(ethanedwards.substack.com)

The Low-Hang­ing Fruit Prior and sloped valleys in the loss landscape

23 Aug 2023 21:12 UTC
79 points
1 comment13 min readLW link

In­vuln­er­a­ble In­com­plete Prefer­ences: A For­mal Statement

Sami Petersen30 Aug 2023 21:59 UTC
124 points
32 comments35 min readLW link

Red-team­ing lan­guage mod­els via ac­ti­va­tion engineering

Nina Rimsky26 Aug 2023 5:52 UTC
68 points
6 comments9 min readLW link

An ad­ver­sar­ial ex­am­ple for Direct Logit At­tri­bu­tion: mem­ory man­age­ment in gelu-4l

30 Aug 2023 17:36 UTC
17 points
0 comments8 min readLW link
(arxiv.org)

An In­ter­pretabil­ity Illu­sion for Ac­ti­va­tion Patch­ing of Ar­bi­trary Subspaces

29 Aug 2023 1:04 UTC
75 points
4 comments1 min readLW link

Tak­ing fea­tures out of su­per­po­si­tion with sparse au­toen­coders more quickly with in­formed initialization

Pierre Peigné23 Sep 2023 16:21 UTC
30 points
8 comments5 min readLW link

Eval­u­at­ing hid­den di­rec­tions on the util­ity dataset: clas­sifi­ca­tion, steer­ing and removal

25 Sep 2023 17:19 UTC
25 points
3 comments7 min readLW link

[Paper] All’s Fair In Love And Love: Copy Sup­pres­sion in GPT-2 Small

13 Oct 2023 18:32 UTC
82 points
4 comments8 min readLW link

On In­ter­pretabil­ity’s Robustness

WCargo18 Oct 2023 13:18 UTC
11 points
0 comments4 min readLW link

Model­ling Deception

Garrett Baker18 Jul 2022 21:21 UTC
15 points
0 comments7 min readLW link

Abram Dem­ski’s ELK thoughts and pro­posal—distillation

Rubi J. Hudson19 Jul 2022 6:57 UTC
16 points
8 comments16 min readLW link
No comments.