RSS

MATS Program

TagLast edit: 18 Mar 2026 19:49 UTC by Ryan Kidd

The Machine Alignment, Transparency, and Security (MATS) Program is an independent research and educational seminar program that provides emerging researchers with mentorship, talks & workshops, research support, and connections with the SF Bay Area and London AI safety research communities.

SolidGoldMag­ikarp (plus, prompt gen­er­a­tion)

5 Feb 2023 22:02 UTC
675 points
208 comments12 min readLW link1 review

SERI MATS Pro­gram—Win­ter 2022 Cohort

8 Oct 2022 19:09 UTC
72 points
12 comments4 min readLW link

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

11 Mar 2023 18:59 UTC
335 points
28 comments23 min readLW link

Pro­ject pro­posal: Test­ing the IBP defi­ni­tion of agent

9 Aug 2022 1:09 UTC
21 points
4 comments2 min readLW link

How MATS ad­dresses “mass move­ment build­ing” concerns

Ryan Kidd4 May 2023 0:55 UTC
63 points
9 comments3 min readLW link

Soft op­ti­miza­tion makes the value tar­get bigger

Jeremy Gillen2 Jan 2023 16:06 UTC
123 points
20 comments12 min readLW link

SERI ML Align­ment The­ory Schol­ars Pro­gram 2022

27 Apr 2022 0:43 UTC
69 points
6 comments3 min readLW link

SERI MATS—Sum­mer 2023 Cohort

8 Apr 2023 15:32 UTC
71 points
25 comments4 min readLW link

Finite Fac­tored Sets in Pictures

Magdalena Wache11 Dec 2022 18:49 UTC
186 points
35 comments12 min readLW link

Talk: AI safety field­build­ing at MATS

Ryan Kidd23 Jun 2024 23:06 UTC
26 points
2 comments10 min readLW link

Tak­ing the pa­ram­e­ters which seem to mat­ter and ro­tat­ing them un­til they don’t

Garrett Baker26 Aug 2022 18:26 UTC
120 points
48 comments1 min readLW link

Sy­co­phancy Towards Re­searchers Drives Perfor­ma­tive Misalignment

18 Mar 2026 4:59 UTC
5 points
1 comment21 min readLW link

Pre­dic­tions for shard the­ory mechanis­tic in­ter­pretabil­ity results

1 Mar 2023 5:16 UTC
105 points
10 comments5 min readLW link

MATS Spring 2024 Ex­ten­sion Retrospective

12 Feb 2025 22:43 UTC
27 points
1 comment15 min readLW link

Mo­du­lat­ing syco­phancy in an RLHF model via ac­ti­va­tion steering

Nina Panickssery9 Aug 2023 7:06 UTC
69 points
20 comments12 min readLW link

Neu­ral Tan­gent Ker­nel Distillation

5 Oct 2022 18:11 UTC
79 points
20 comments8 min readLW link

My MATS Sum­mer 2023 experience

James Chua20 Mar 2024 11:26 UTC
30 points
0 comments3 min readLW link
(jameschua.net)

In-con­text learn­ing alone can in­duce weird generalisation

25 Feb 2026 2:46 UTC
66 points
1 comment8 min readLW link

Ap­ply­ing to MATS: What the Pro­gram Is Like, and Who It’s For

17 Jan 2026 0:25 UTC
24 points
1 comment5 min readLW link

Effi­cient Dic­tionary Learn­ing with Switch Sparse Autoencoders

Anish Mudide22 Jul 2024 18:45 UTC
118 points
20 comments12 min readLW link

Nor­ma­tive vs De­scrip­tive Models of Agency

mattmacdermott2 Feb 2023 20:28 UTC
26 points
5 comments4 min readLW link

In­fra-Bayesian haggling

hannagabor20 May 2024 12:23 UTC
30 points
1 comment20 min readLW link1 review

Models have lin­ear rep­re­sen­ta­tions of what tasks they like

OscarGilg5 Mar 2026 18:44 UTC
53 points
16 comments11 min readLW link

I found >800 or­thog­o­nal “write code” steer­ing vectors

15 Jul 2024 19:06 UTC
112 points
20 comments7 min readLW link
(jacobgw.com)

Re­con­tex­tu­al­iza­tion Miti­gates Speci­fi­ca­tion Gam­ing Without Mod­ify­ing the Specification

14 Oct 2025 0:53 UTC
144 points
15 comments10 min readLW link

Self-ex­plain­ing SAE features

5 Aug 2024 22:20 UTC
62 points
13 comments10 min readLW link

What’s the Point of the Math?

Ashe Vazquez Nuñez5 Feb 2026 11:30 UTC
45 points
3 comments5 min readLW link

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

2 Jan 2024 0:47 UTC
125 points
29 comments8 min readLW link
(arxiv.org)

Sparse Au­toen­coders Find Highly In­ter­pretable Direc­tions in Lan­guage Models

21 Sep 2023 15:30 UTC
159 points
8 comments5 min readLW link

Petri: An open-source au­dit­ing tool to ac­cel­er­ate AI safety research

Sam Marks7 Oct 2025 20:39 UTC
77 points
0 comments1 min readLW link
(alignment.anthropic.com)

Balanc­ing Se­cu­rity Mind­set with Col­lab­o­ra­tive Re­search: A Proposal

MadHatter1 Nov 2023 0:46 UTC
9 points
3 comments4 min readLW link

Qual­ities that al­ign­ment men­tors value in ju­nior researchers

Orpheus1614 Feb 2023 23:27 UTC
88 points
14 comments3 min readLW link

The Geom­e­try of Feel­ings and Non­sense in Large Lan­guage Models

27 Sep 2024 17:49 UTC
61 points
10 comments4 min readLW link

MATS is hiring!

8 Apr 2025 20:45 UTC
8 points
0 comments6 min readLW link

Inoc­u­la­tion prompt­ing: In­struct­ing mod­els to mis­be­have at train-time can im­prove run-time behavior

8 Oct 2025 22:02 UTC
175 points
37 comments2 min readLW link

Ta­lent Needs of Tech­ni­cal AI Safety Teams

24 May 2024 0:36 UTC
125 points
65 comments14 min readLW link

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

30 Apr 2024 18:51 UTC
225 points
44 comments45 min readLW link1 review

Ap­ply for Align­ment Men­tor­ship from TurnTrout and Alex Cloud

26 Dec 2025 17:20 UTC
41 points
0 comments2 min readLW link
(turntrout.com)

De­bat­ing with More Per­sua­sive LLMs Leads to More Truth­ful Answers

7 Feb 2024 21:28 UTC
89 points
14 comments9 min readLW link
(arxiv.org)

Steer­ing GPT-2-XL by adding an ac­ti­va­tion vector

13 May 2023 18:42 UTC
441 points
98 comments50 min readLW link1 review

Ap­ply for MATS Win­ter 2023-24!

21 Oct 2023 2:27 UTC
104 points
6 comments5 min readLW link

Ap­ply to MATS 9.0!

Ryan Kidd10 Sep 2025 18:04 UTC
47 points
0 comments1 min readLW link

Distil­la­tion Ro­bus­tifies Unlearning

13 Jun 2025 13:45 UTC
236 points
43 comments8 min readLW link
(arxiv.org)

Trans­form­ers Rep­re­sent Belief State Geom­e­try in their Resi­d­ual Stream

Adam Shai16 Apr 2024 21:16 UTC
437 points
102 comments12 min readLW link1 review

Model Or­ganisms for Emer­gent Misalignment

16 Jun 2025 15:46 UTC
118 points
19 comments5 min readLW link

MATS AI Safety Strat­egy Curriculum

7 Mar 2024 19:59 UTC
74 points
2 comments16 min readLW link

In­tro­duc­tion to in­ac­cessible information

Ryan Kidd9 Dec 2021 1:28 UTC
27 points
6 comments8 min readLW link

[Paper] How does in­for­ma­tion ac­cess af­fect LLM mon­i­tors’ abil­ity to de­tect sab­o­tage?

11 Feb 2026 21:25 UTC
26 points
0 comments6 min readLW link

MATS Sum­mer 2023 Retrospective

1 Dec 2023 23:29 UTC
78 points
34 comments26 min readLW link

Show­ing SAE La­tents Are Not Atomic Us­ing Meta-SAEs

24 Aug 2024 0:56 UTC
73 points
10 comments20 min readLW link

Be­havi­oural statis­tics for a maze-solv­ing agent

20 Apr 2023 22:26 UTC
46 points
11 comments10 min readLW link

Cur­rent ac­ti­va­tion or­a­cles are hard to use

3 Mar 2026 19:33 UTC
77 points
3 comments16 min readLW link

Ap­ply to MATS 7.0!

21 Sep 2024 0:23 UTC
32 points
0 comments5 min readLW link

Defend­ing Against Model Weight Exfil­tra­tion Through In­fer­ence Verification

15 Dec 2025 15:26 UTC
120 points
15 comments8 min readLW link

Con­cept Poi­son­ing: Prob­ing LLMs with­out probes

5 Aug 2025 17:00 UTC
60 points
5 comments13 min readLW link

Clar­ify­ing mesa-optimization

21 Mar 2023 15:53 UTC
38 points
6 comments10 min readLW link

Align­ment fak­ing CTFs: Ap­ply to my MATS stream

joshc4 Apr 2025 16:29 UTC
61 points
0 comments4 min readLW link

Broad Bas­ins and Data Compression

8 Aug 2022 20:33 UTC
33 points
6 comments7 min readLW link

MATS men­tor selection

10 Jan 2025 3:12 UTC
44 points
12 comments6 min readLW link

Train­ing a Re­ward Hacker De­spite Perfect Labels

14 Aug 2025 23:57 UTC
139 points
45 comments4 min readLW link

De­com­pos­ing the QK cir­cuit with Bilin­ear Sparse Dic­tionary Learning

2 Jul 2024 13:17 UTC
86 points
7 comments12 min readLW link

Re­fusal in LLMs is me­di­ated by a sin­gle direction

27 Apr 2024 11:13 UTC
255 points
96 comments10 min readLW link1 review

Trends in Eco­nomic In­puts to AI

Jeffrey Heninger11 Sep 2025 21:51 UTC
87 points
6 comments12 min readLW link

Game The­ory with­out Argmax [Part 2]

Cleo Nardo11 Nov 2023 16:02 UTC
31 points
14 comments13 min readLW link

[ASoT] Policy Tra­jec­tory Visualization

Ulisse Mini7 Feb 2023 0:13 UTC
9 points
2 comments1 min readLW link

MATS Ap­pli­ca­tions + Re­search Direc­tions I’m Cur­rently Ex­cited About

Neel Nanda6 Feb 2025 11:03 UTC
73 points
7 comments8 min readLW link

Case Stud­ies in Re­v­erse-Eng­ineer­ing Sparse Au­toen­coder Fea­tures by Us­ing MLP Linearization

14 Jan 2024 2:06 UTC
24 points
0 comments42 min readLW link

Au­dit­ing games for high-level interpretability

Paul Colognese1 Nov 2022 10:44 UTC
33 points
1 comment7 min readLW link

What Makes an Idea Un­der­stand­able? On Ar­chi­tec­turally and Cul­turally Nat­u­ral Ideas.

16 Aug 2022 2:09 UTC
21 points
2 comments16 min readLW link

At­ten­tion SAEs Scale to GPT-2 Small

3 Feb 2024 6:50 UTC
78 points
4 comments8 min readLW link

Ex­per­i­ments with an al­ter­na­tive method to pro­mote spar­sity in sparse autoencoders

Eoin Farrell15 Apr 2024 18:21 UTC
29 points
7 comments12 min readLW link

MATS Alumni Im­pact Analysis

30 Sep 2024 2:35 UTC
62 points
7 comments11 min readLW link

A dis­til­la­tion of Evan Hub­inger’s train­ing sto­ries (for SERI MATS)

Daphne_W18 Jul 2022 3:38 UTC
15 points
1 comment10 min readLW link

Can We Align a Self-Im­prov­ing AGI?

Peter S. Park30 Aug 2022 0:14 UTC
8 points
5 comments11 min readLW link

Cal­en­dar fea­ture ge­om­e­try in GPT-2 layer 8 resi­d­ual stream SAEs

17 Aug 2024 1:16 UTC
54 points
0 comments5 min readLW link

Ctrl-Z: Con­trol­ling AI Agents via Resampling

16 Apr 2025 16:21 UTC
126 points
0 comments20 min readLW link

Stitch­ing SAEs of differ­ent sizes

13 Jul 2024 17:19 UTC
39 points
12 comments12 min readLW link

Race Along Rashomon Ridge

7 Jul 2022 3:20 UTC
52 points
16 comments9 min readLW link

My Ad­vice for In­com­ing SERI MATS Scholars

Johannes C. Mayer3 Jan 2023 19:25 UTC
58 points
6 comments4 min readLW link

MATS AI Safety Strat­egy Cur­ricu­lum v2

7 Oct 2024 22:44 UTC
43 points
6 comments13 min readLW link

Uncer­tainty in all its flavours

Cleo Nardo9 Jan 2024 16:21 UTC
34 points
6 comments35 min readLW link

The Ground Truth Prob­lem (Or, Why Eval­u­at­ing In­ter­pretabil­ity Meth­ods Is Hard)

Jessica Rumbelow17 Nov 2022 11:06 UTC
27 points
2 comments2 min readLW link

In­ter­ven­ing in the Resi­d­ual Stream

MadHatter22 Feb 2023 6:29 UTC
30 points
1 comment9 min readLW link

Steer­ing RL Train­ing: Bench­mark­ing In­ter­ven­tions Against Re­ward Hacking

29 Dec 2025 21:55 UTC
58 points
10 comments19 min readLW link

Swap and Scale

Stephen Fowler9 Sep 2022 22:41 UTC
17 points
3 comments1 min readLW link

Best-of-N Jailbreaking

14 Dec 2024 4:58 UTC
79 points
5 comments2 min readLW link
(arxiv.org)

Emer­gent Misal­ign­ment: Nar­row fine­tun­ing can pro­duce broadly mis­al­igned LLMs

25 Feb 2025 17:39 UTC
335 points
92 comments4 min readLW link

Paper: Prompt Op­ti­miza­tion Makes Misal­ign­ment Legible

12 Feb 2026 19:45 UTC
47 points
7 comments10 min readLW link

In­for­ma­tion the­o­retic model anal­y­sis may not lend much in­sight, but we may have been do­ing them wrong!

Garrett Baker24 Jul 2022 0:42 UTC
7 points
0 comments10 min readLW link

What sorts of sys­tems can be de­cep­tive?

Andrei Alexandru31 Oct 2022 22:00 UTC
17 points
0 comments7 min readLW link

Un­der­stand­ing Agency through Markov Blankets

Ashe Vazquez Nuñez12 Jan 2026 19:32 UTC
25 points
2 comments3 min readLW link

Con­di­tion­ing Gen­er­a­tive Models for Alignment

Jozdien18 Jul 2022 7:11 UTC
60 points
8 comments20 min readLW link

Con­se­quen­tial­ists: One-Way Pat­tern Traps

David Udell16 Jan 2023 20:48 UTC
59 points
3 comments14 min readLW link

More find­ings on max­i­mal data dimension

Marius Hobbhahn2 Feb 2023 18:33 UTC
27 points
1 comment11 min readLW link

Gra­di­ent Rout­ing: Mask­ing Gra­di­ents to Lo­cal­ize Com­pu­ta­tion in Neu­ral Networks

6 Dec 2024 22:19 UTC
177 points
15 comments11 min readLW link1 review
(arxiv.org)

Rea­son­ing Models Strug­gle to Con­trol Their Chains of Thought

5 Mar 2026 22:37 UTC
74 points
9 comments3 min readLW link

[Re­search Note] Op­ti­miz­ing The Fi­nal Out­put Can Obfus­cate CoT

30 Jul 2025 21:26 UTC
201 points
23 comments6 min readLW link

Con­tent and Take­aways from SERI MATS Train­ing Pro­gram with John Wentworth

RohanS24 Dec 2022 4:17 UTC
28 points
3 comments12 min readLW link

Fore­cast­ing Fron­tier Lan­guage Model Agent Capabilities

24 Feb 2025 16:51 UTC
35 points
0 comments5 min readLW link
(www.apolloresearch.ai)

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

16 Jan 2024 0:26 UTC
85 points
9 comments18 min readLW link

Othel­loGPT learned a bag of heuristics

2 Jul 2024 9:12 UTC
111 points
10 comments9 min readLW link

Craft­ing Poly­se­man­tic Trans­former Bench­marks with Known Circuits

23 Aug 2024 22:03 UTC
17 points
0 comments25 min readLW link

Ap­ply to MATS 8.0!

20 Mar 2025 2:17 UTC
64 points
5 comments4 min readLW link

In­ter­pretabil­ity as Com­pres­sion: Re­con­sid­er­ing SAE Ex­pla­na­tions of Neu­ral Ac­ti­va­tions with MDL-SAEs

23 Aug 2024 18:52 UTC
43 points
8 comments16 min readLW link

Nar­row Fine­tun­ing Leaves Clearly Read­able Traces in Ac­ti­va­tion Differences

5 Sep 2025 12:11 UTC
54 points
2 comments7 min readLW link

Take­aways From Our Re­cent Work on SAE Probing

3 Mar 2025 19:50 UTC
30 points
4 comments5 min readLW link

[Closed] Agent Foun­da­tions track in MATS

Vanessa Kosoy31 Oct 2023 8:12 UTC
54 points
1 comment1 min readLW link
(www.matsprogram.org)

More find­ings on Me­moriza­tion and dou­ble descent

Marius Hobbhahn1 Feb 2023 18:26 UTC
53 points
2 comments19 min readLW link

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

13 Jun 2024 10:04 UTC
84 points
10 comments2 min readLW link
(arxiv.org)

[Paper] Out­put Su­per­vi­sion Can Obfus­cate the CoT

20 Nov 2025 22:41 UTC
78 points
3 comments5 min readLW link
(arxiv.org)

[ASoT] Reflec­tivity in Nar­row AI

Ulisse Mini21 Nov 2022 0:51 UTC
6 points
1 comment1 min readLW link

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

28 May 2024 16:33 UTC
85 points
5 comments21 min readLW link

Text Com­pres­sion Can Help Se­cure Model Weights

Roy Rinberg4 Mar 2026 23:30 UTC
42 points
12 comments10 min readLW link

MATS 8.0 Re­search Projects

9 Sep 2025 1:29 UTC
22 points
0 comments1 min readLW link
(substack.com)

mod­els have some pretty funny at­trac­tor states

12 Feb 2026 21:14 UTC
266 points
38 comments18 min readLW link

Au­tomat­ing LLM Au­dit­ing with Devel­op­men­tal Interpretability

4 Sep 2024 15:50 UTC
19 points
0 comments3 min readLW link

A Frame­work for Eval Awareness

LAThomson23 Jan 2026 10:16 UTC
36 points
5 comments8 min readLW link

Can Models be Eval­u­a­tion Aware Without Ex­plicit Ver­bal­iza­tion?

8 Nov 2025 18:26 UTC
26 points
10 comments8 min readLW link

[Question] How is ARC plan­ning to use ELK?

jacquesthibs15 Dec 2022 20:11 UTC
24 points
5 comments1 min readLW link

Train­ing goals for large lan­guage models

Johannes Treutlein18 Jul 2022 7:09 UTC
28 points
5 comments19 min readLW link

At­ten­tion Out­put SAEs Im­prove Cir­cuit Analysis

21 Jun 2024 12:56 UTC
33 points
3 comments19 min readLW link

Ex­plo­ra­tion hack­ing: can rea­son­ing mod­els sub­vert RL?

30 Jul 2025 22:02 UTC
22 points
4 comments9 min readLW link

Can LLMs learn Stegano­graphic Rea­son­ing via RL?

11 Apr 2025 16:33 UTC
30 points
3 comments6 min readLW link

3 Challenges and 2 Hopes for the Safety of Un­su­per­vised Elicitation

27 Feb 2026 17:25 UTC
21 points
0 comments10 min readLW link

Perfor­mance guaran­tees in clas­si­cal learn­ing the­ory and in­fra-Bayesianism

David Matolcsi28 Feb 2023 18:37 UTC
9 points
4 comments31 min readLW link

A Short Dialogue on the Mean­ing of Re­ward Functions

19 Nov 2022 21:04 UTC
45 points
0 comments3 min readLW link

[Re­search log] The board of Alpha­bet would stop Deep­Mind to save the world

Lucie Philippon16 Jul 2024 4:59 UTC
6 points
0 comments4 min readLW link

Why are coun­ter­fac­tu­als elu­sive?

Martín Soto3 Mar 2023 20:13 UTC
14 points
6 comments2 min readLW link

Un­der­stand­ing and Align­ing a Hu­man-like In­duc­tive Bias with Cog­ni­tive Science: a Re­view of Re­lated Liter­a­ture

Claire Short29 Jul 2023 6:10 UTC
27 points
0 comments12 min readLW link

Bounded com­plex­ity of solv­ing ELK and its implications

Rubi J. Hudson19 Jul 2022 6:56 UTC
11 points
4 comments18 min readLW link

Re­sults from a sur­vey on tool use and work­flows in al­ign­ment research

19 Dec 2022 15:19 UTC
79 points
2 comments19 min readLW link

Re­search agenda: Su­per­vis­ing AIs im­prov­ing AIs

29 Apr 2023 17:09 UTC
76 points
5 comments19 min readLW link

Model­ling Deception

Garrett Baker18 Jul 2022 21:21 UTC
15 points
0 comments7 min readLW link

Ex­ter­nal­ized rea­son­ing over­sight: a re­search di­rec­tion for lan­guage model alignment

tamera3 Aug 2022 12:03 UTC
140 points
23 comments6 min readLW link

The Nat­u­ral Ab­strac­tion Hy­poth­e­sis: Im­pli­ca­tions and Evidence

CallumMcDougall14 Dec 2021 23:14 UTC
44 points
9 comments19 min readLW link

Game The­ory with­out Argmax [Part 1]

Cleo Nardo11 Nov 2023 15:59 UTC
78 points
18 comments19 min readLW link

How Do We Align an AGI Without Get­ting So­cially Eng­ineered? (Hint: Box It)

10 Aug 2022 18:14 UTC
28 points
30 comments11 min readLW link

[Ap­pendix] Nat­u­ral Ab­strac­tions: Key Claims, The­o­rems, and Critiques

16 Mar 2023 16:38 UTC
48 points
0 comments13 min readLW link

Con­di­tions for math­e­mat­i­cal equiv­alence of Stochas­tic Gra­di­ent Des­cent and Nat­u­ral Selection

Oliver Sourbut9 May 2022 21:38 UTC
73 points
19 comments8 min readLW link1 review
(www.oliversourbut.net)

A mostly crit­i­cal re­view of in­fra-Bayesianism

David Matolcsi28 Feb 2023 18:37 UTC
109 points
9 comments29 min readLW link

Sources of ev­i­dence in Alignment

Martín Soto2 Jul 2023 20:38 UTC
22 points
0 comments11 min readLW link

Find­ing Neu­rons in a Haystack: Case Stud­ies with Sparse Probing

3 May 2023 13:30 UTC
33 points
6 comments2 min readLW link1 review
(arxiv.org)

Re­duc­ing syco­phancy and im­prov­ing hon­esty via ac­ti­va­tion steering

Nina Panickssery28 Jul 2023 2:46 UTC
122 points
18 comments9 min readLW link1 review

Fram­ing AI Childhoods

David Udell6 Sep 2022 23:40 UTC
37 points
8 comments4 min readLW link

How com­plex are my­opic imi­ta­tors?

Vivek Hebbar8 Feb 2022 12:00 UTC
26 points
1 comment15 min readLW link

Notes on Learn­ing the Prior

carboniferous_umbraculum 15 Jul 2022 17:28 UTC
25 points
2 comments25 min readLW link

A Bunch of Ma­tryoshka SAEs

4 Apr 2025 14:53 UTC
29 points
0 comments8 min readLW link

Search­ing for a model’s con­cepts by their shape – a the­o­ret­i­cal framework

23 Feb 2023 20:14 UTC
51 points
0 comments19 min readLW link

Large Lan­guage Models will be Great for Censorship

Ethan Edwards21 Aug 2023 19:03 UTC
185 points
14 comments8 min readLW link
(ethanedwards.substack.com)

[Job Ad] MATS is hiring!

9 Oct 2024 2:17 UTC
10 points
0 comments5 min readLW link

Poly­se­man­tic At­ten­tion Head in a 4-Layer Transformer

9 Nov 2023 16:16 UTC
51 points
0 comments6 min readLW link

How to Catch an AI Liar: Lie De­tec­tion in Black-Box LLMs by Ask­ing Un­re­lated Questions

28 Sep 2023 18:53 UTC
187 points
39 comments3 min readLW link1 review

Steer­ing Eval­u­a­tion-Aware Models to Act Like They Are Deployed

30 Oct 2025 15:03 UTC
61 points
12 comments18 min readLW link

In­ter­view: Ap­pli­ca­tions w/​ Alice Rigg

jacobhaimes19 Dec 2023 19:03 UTC
12 points
0 comments1 min readLW link
(into-ai-safety.github.io)

The many paths to per­ma­nent dis­em­pow­er­ment even with shut­down­able AIs (MATS pro­ject sum­mary for feed­back)

GideonF29 Jul 2025 23:20 UTC
64 points
8 comments9 min readLW link

Eval­u­at­ing Pre­dic­tion in Acausal Mixed-Mo­tive Settings

Tim Chan31 Aug 2025 22:58 UTC
14 points
0 comments6 min readLW link

Elic­it­ing se­cret knowl­edge from lan­guage models

2 Oct 2025 20:57 UTC
68 points
3 comments2 min readLW link
(arxiv.org)

Red-team­ing lan­guage mod­els via ac­ti­va­tion engineering

Nina Panickssery26 Aug 2023 5:52 UTC
69 points
6 comments9 min readLW link

Power Laws Are Not Enough

CarolusRenniusVitellius19 Feb 2026 4:31 UTC
10 points
3 comments4 min readLW link
(charlesr-w.github.io)

Try­ing to find the un­der­ly­ing struc­ture of com­pu­ta­tional systems

Matthias G. Mayer13 Sep 2022 21:16 UTC
21 points
9 comments4 min readLW link

But is it re­ally in Rome? An in­ves­ti­ga­tion of the ROME model edit­ing technique

jacquesthibs30 Dec 2022 2:40 UTC
105 points
2 comments18 min readLW link

GPT-2 Some­times Fails at IOI

Ronak_Mehta14 Aug 2024 23:24 UTC
13 points
0 comments2 min readLW link
(ronakrm.github.io)

Iden­ti­fi­ca­tion of Nat­u­ral Modularity

Stephen Fowler25 Jun 2022 15:05 UTC
15 points
3 comments7 min readLW link

Deter­min­ing the power of in­vestors over Fron­tier AI Labs is strate­gi­cally im­por­tant to re­duce x-risk

Lucie Philippon25 Jul 2024 1:12 UTC
18 points
7 comments2 min readLW link

How In­ter­pretabil­ity can be Impactful

Connall Garrod18 Jul 2022 0:06 UTC
19 points
0 comments37 min readLW link

MATS Win­ter 2023-24 Retrospective

11 May 2024 0:09 UTC
90 points
28 comments49 min readLW link

Why I’m Work­ing On Model Ag­nos­tic Interpretability

Jessica Rumbelow11 Nov 2022 9:24 UTC
27 points
9 comments2 min readLW link

Bit­ter Les­sons from Distil­la­tion Ro­bus­tifies Unlearning

Bruce W. Lee28 Nov 2025 1:31 UTC
27 points
3 comments7 min readLW link
(www.lesswrong.com)

Au­toIn­ter­pre­ta­tion Finds Sparse Cod­ing Beats Alternatives

Hoagy17 Jul 2023 1:41 UTC
57 points
1 comment7 min readLW link

A Con­cep­tual Frame­work for Ex­plo­ra­tion Hacking

12 Feb 2026 16:33 UTC
25 points
2 comments9 min readLW link

Trans­lat­ing be­tween La­tent Spaces

30 Jul 2022 3:25 UTC
27 points
2 comments8 min readLW link

A mechanis­tic ex­pla­na­tion for SolidGoldMag­ikarp-like to­kens in GPT2

MadHatter26 Feb 2023 1:10 UTC
61 points
14 comments6 min readLW link

Post-hoc rea­son­ing in chain of thought

Kyle Cox5 Feb 2025 18:58 UTC
19 points
0 comments11 min readLW link

In­tri­ca­cies of Fea­ture Geom­e­try in Large Lan­guage Models

7 Dec 2024 18:10 UTC
72 points
2 comments12 min readLW link

End-to-end hack­ing with lan­guage models

tchauvin5 Apr 2024 15:06 UTC
29 points
0 comments8 min readLW link

Un­faith­ful chain-of-thought as nudged reasoning

22 Jul 2025 22:35 UTC
54 points
3 comments10 min readLW link

The shal­low re­al­ity of ‘deep learn­ing the­ory’

Jesse Hoogland22 Feb 2023 4:16 UTC
35 points
11 comments3 min readLW link
(www.jessehoogland.com)

In­ner Align­ment via Superpowers

30 Aug 2022 20:01 UTC
37 points
13 comments4 min readLW link

Stan­dard SAEs Might Be In­co­her­ent: A Choos­ing Prob­lem & A “Con­cise” Solution

Kola Ayonrinde30 Oct 2024 22:50 UTC
27 points
0 comments12 min readLW link

Some Notes on the math­e­mat­ics of Toy Au­toen­cod­ing Problems

carboniferous_umbraculum 22 Dec 2022 17:21 UTC
18 points
1 comment12 min readLW link

Abram Dem­ski’s ELK thoughts and pro­posal—distillation

Rubi J. Hudson19 Jul 2022 6:57 UTC
19 points
8 comments16 min readLW link

Some Sum­maries of Agent Foun­da­tions Work

mattmacdermott15 May 2023 16:09 UTC
63 points
1 comment13 min readLW link

De­com­pos­ing in­de­pen­dent gen­er­al­iza­tions in neu­ral net­works via Hes­sian analysis

14 Aug 2023 17:04 UTC
86 points
4 comments1 min readLW link

Find­ing Skele­tons on Rashomon Ridge

24 Jul 2022 22:31 UTC
30 points
2 comments7 min readLW link

Guardian AI (Misal­igned sys­tems are all around us.)

Jessica Rumbelow25 Nov 2022 15:55 UTC
15 points
6 comments2 min readLW link

Towards data-cen­tric in­ter­pretabil­ity with sparse autoencoders

15 Aug 2025 20:10 UTC
53 points
2 comments18 min readLW link

Nat­u­ral Ab­strac­tions: Key Claims, The­o­rems, and Critiques

16 Mar 2023 16:37 UTC
248 points
26 comments45 min readLW link3 reviews

The Low-Hang­ing Fruit Prior and sloped valleys in the loss landscape

23 Aug 2023 21:12 UTC
84 points
1 comment13 min readLW link

MATS Models

johnswentworth9 Jul 2022 0:14 UTC
95 points
5 comments16 min readLW link

On Tar­geted Ma­nipu­la­tion and De­cep­tion when Op­ti­miz­ing LLMs for User Feedback

7 Nov 2024 15:39 UTC
51 points
7 comments11 min readLW link

Fea­ture Hedg­ing: Another way cor­re­lated fea­tures break SAEs

25 Mar 2025 14:33 UTC
23 points
0 comments18 min readLW link

My SERI MATS Application

Daniel Paleka30 May 2022 2:04 UTC
16 points
0 comments8 min readLW link

On In­ter­pretabil­ity’s Robustness

WCargo18 Oct 2023 13:18 UTC
11 points
0 comments4 min readLW link

On Meta-Level Ad­ver­sar­ial Eval­u­a­tions of (White-Box) Align­ment Auditing

Oliver Daniels10 Feb 2026 17:06 UTC
26 points
5 comments3 min readLW link

Thought An­chors: Which LLM Rea­son­ing Steps Mat­ter?

2 Jul 2025 20:16 UTC
35 points
6 comments6 min readLW link
(www.thought-anchors.com)

Char­ac­ter Train­ing In­duces Mo­ti­va­tion Clar­ifi­ca­tion: A Clue to Claude 3 Opus

Oliver Daniels25 Feb 2026 19:43 UTC
77 points
5 comments8 min readLW link

Em­piri­cal risk min­i­miza­tion is fun­da­men­tally confused

Jesse Hoogland22 Mar 2023 16:58 UTC
32 points
8 comments1 min readLW link

A Neu­ral Net­work un­der­go­ing Gra­di­ent-based Train­ing as a Com­plex System

carboniferous_umbraculum 19 Feb 2023 22:08 UTC
22 points
1 comment19 min readLW link

Is the “Valley of Con­fused Ab­strac­tions” real?

jacquesthibs5 Dec 2022 13:36 UTC
20 points
11 comments2 min readLW link

When fine-tun­ing fails to elicit GPT-3.5′s chess abilities

Theodore Chapman14 Jun 2024 18:50 UTC
42 points
3 comments9 min readLW link

Ex­per­i­ment Idea: RL Agents Evad­ing Learned Shutdownability

Leon Lang16 Jan 2023 22:46 UTC
31 points
7 comments17 min readLW link
(docs.google.com)

Study­ing Mechanis­tic of Align­ment Fak­ing in Llama-3.1-405B

Amina Keldibek25 Nov 2025 11:21 UTC
10 points
0 comments11 min readLW link

SAE Prob­ing: What is it good for?

1 Nov 2024 19:23 UTC
34 points
0 comments11 min readLW link

How im­por­tant is AI hack­ing as LLMs ad­vance?

Artem Karpov29 Jan 2024 18:41 UTC
1 point
0 comments6 min readLW link

Train­ing Agents to Self-Re­port Misbehavior

25 Feb 2026 17:50 UTC
26 points
0 comments8 min readLW link

[RFC] Pos­si­ble ways to ex­pand on “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion”.

25 Jan 2023 19:03 UTC
48 points
6 comments12 min readLW link

Do mod­els know when they are be­ing eval­u­ated?

17 Feb 2025 23:13 UTC
57 points
9 comments12 min readLW link

Get­ting up to Speed on the Speed Prior in 2022

robertzk28 Dec 2022 7:49 UTC
36 points
5 comments65 min readLW link

Re­veal­ing al­ign­ment fak­ing with a sin­gle prompt

Florian_Dietz29 Jan 2025 21:01 UTC
9 points
5 comments4 min readLW link

Boomerang—pro­to­col to dis­solve some com­mit­ment races

Filip Sondej30 May 2023 16:21 UTC
37 points
10 comments8 min readLW link

Disen­tan­gling Shard The­ory into Atomic Claims

Leon Lang13 Jan 2023 4:23 UTC
86 points
6 comments18 min readLW link

Among Us: A Sand­box for Agen­tic Deception

5 Apr 2025 6:24 UTC
114 points
7 comments7 min readLW link

BatchTopK: A Sim­ple Im­prove­ment for TopK-SAEs

20 Jul 2024 2:20 UTC
61 points
0 comments4 min readLW link

Tak­ing fea­tures out of su­per­po­si­tion with sparse au­toen­coders more quickly with in­formed initialization

Pierre Peigné23 Sep 2023 16:21 UTC
30 points
8 comments5 min readLW link

Test your in­ter­pretabil­ity tech­niques by de-cen­sor­ing Chi­nese models

15 Jan 2026 16:33 UTC
90 points
14 comments20 min readLW link

Un­der­stand­ing SAE Fea­tures with the Logit Lens

11 Mar 2024 0:16 UTC
71 points
2 comments14 min readLW link

Can We Change the Goals of a Toy RL Agent?

15 Jun 2025 20:34 UTC
20 points
0 comments9 min readLW link

An In­ter­pretabil­ity Illu­sion for Ac­ti­va­tion Patch­ing of Ar­bi­trary Subspaces

29 Aug 2023 1:04 UTC
77 points
4 comments1 min readLW link

Why you might ex­pect ho­mo­ge­neous take-off: ev­i­dence from ML research

Andrei Alexandru17 Jul 2022 20:31 UTC
24 points
0 comments10 min readLW link

Team Shard Sta­tus Report

David Udell9 Aug 2022 5:33 UTC
38 points
8 comments3 min readLW link

What We Learned Try­ing to Diff Base and Chat Models (And Why It Mat­ters)

30 Jun 2025 17:17 UTC
106 points
2 comments7 min readLW link

Gra­di­ent sur­fing: the hid­den role of regularization

Jesse Hoogland6 Feb 2023 3:50 UTC
37 points
9 comments14 min readLW link
(www.jessehoogland.com)

Find­ing Goals in the World Model

22 Aug 2022 18:06 UTC
59 points
8 comments13 min readLW link

My ex­pe­rience ap­ply­ing to MATS 6.0

mic18 Jul 2024 19:02 UTC
19 points
3 comments5 min readLW link

Some real ex­am­ples of gra­di­ent hacking

Oliver Sourbut22 Nov 2021 0:11 UTC
17 points
8 comments2 min readLW link

Scal­ing Sparse Fea­ture Cir­cuit Find­ing to Gemma 9B

10 Jan 2025 11:08 UTC
88 points
11 comments17 min readLW link

[In­terim re­search re­port] Eval­u­at­ing the Goal-Direct­ed­ness of Lan­guage Models

18 Jul 2024 18:19 UTC
40 points
4 comments11 min readLW link

Ophiol­ogy (or, how the Mamba ar­chi­tec­ture works)

9 Apr 2024 19:31 UTC
67 points
10 comments10 min readLW link

SolidGoldMag­ikarp II: tech­ni­cal de­tails and more re­cent findings

6 Feb 2023 19:09 UTC
114 points
45 comments13 min readLW link

Where do AI Safety Fel­lows go? An­a­lyz­ing a dataset of 600+ alumni

Christopher_Clay2 Jan 2026 18:14 UTC
12 points
1 comment5 min readLW link
(forum.effectivealtruism.org)

Cen­sored LLMs as a Nat­u­ral Testbed for Se­cret Knowl­edge Elicitation

9 Mar 2026 18:50 UTC
30 points
2 comments5 min readLW link

Us­ing PICT against Pas­taGPT Jailbreaking

Quentin FEUILLADE--MONTIXI9 Feb 2023 4:30 UTC
26 points
0 comments9 min readLW link

Split Per­son­al­ity Train­ing: Re­veal­ing La­tent Knowl­edge Through Per­son­al­ity-Shift Tokens

Florian_Dietz10 Mar 2025 16:07 UTC
49 points
7 comments9 min readLW link

Im­ple­ment­ing ac­ti­va­tion steering

Annah5 Feb 2024 17:51 UTC
76 points
8 comments7 min readLW link

In­fra-Bayesian Logic

5 Jul 2023 19:16 UTC
15 points
2 comments1 min readLW link

Towards Mul­ti­modal In­ter­pretabil­ity: Learn­ing Sparse In­ter­pretable Fea­tures in Vi­sion Transformers

hugofry29 Apr 2024 20:57 UTC
94 points
9 comments11 min readLW link

Fixed points in mor­tal pop­u­la­tion games

ViktoriaMalyasova14 Mar 2023 7:10 UTC
31 points
0 comments12 min readLW link
(www.lesswrong.com)

Tips On Em­piri­cal Re­search Slides

8 Jan 2025 5:06 UTC
105 points
4 comments6 min readLW link

Proper scor­ing rules don’t guaran­tee pre­dict­ing fixed points

16 Dec 2022 18:22 UTC
80 points
8 comments21 min readLW link

Do­main-spe­cific SAEs

jacob_drori7 Oct 2024 20:15 UTC
28 points
2 comments5 min readLW link

Clas­sify­ing rep­re­sen­ta­tions of sparse au­toen­coders (SAEs)

Annah17 Nov 2023 13:54 UTC
15 points
6 comments2 min readLW link

Shard The­ory: An Overview

David Udell11 Aug 2022 5:44 UTC
167 points
34 comments10 min readLW link

Neu­ral net­works gen­er­al­ize be­cause of this one weird trick

Jesse Hoogland18 Jan 2023 0:10 UTC
208 points
35 comments15 min readLW link1 review
(www.jessehoogland.com)

OpenAI fine­tun­ing met­rics: What is go­ing on with the loss curves?

24 Nov 2025 18:29 UTC
41 points
5 comments2 min readLW link

Can Aha Mo­ments be Fake? Iden­ti­fy­ing True and Dec­o­ra­tive Think­ing Steps in CoT

Jiachen Zhao23 Feb 2026 11:51 UTC
24 points
0 comments10 min readLW link
(arxiv.org)

The Core of the Align­ment Prob­lem is...

17 Aug 2022 20:07 UTC
76 points
10 comments9 min readLW link

Search­ing for Mo­du­lar­ity in Large Lan­guage Models

8 Sep 2022 2:25 UTC
44 points
3 comments14 min readLW link

Prin­ci­pled In­ter­pretabil­ity of Re­ward Hack­ing in Closed Fron­tier Models

1 Jan 2026 16:37 UTC
24 points
0 comments23 min readLW link

Un­der­stand­ing and vi­su­al­iz­ing syco­phancy datasets

Nina Panickssery16 Aug 2023 5:34 UTC
47 points
0 comments6 min readLW link

Dis­cov­er­ing Back­door Triggers

19 Aug 2025 6:24 UTC
57 points
4 comments13 min readLW link

We In­spected Every Head In GPT-2 Small us­ing SAEs So You Don’t Have To

6 Mar 2024 5:03 UTC
63 points
0 comments12 min readLW link

Why Did My Model Do That? Model In­crim­i­na­tion for Di­ag­nos­ing LLM Misbehavior

27 Feb 2026 3:20 UTC
51 points
1 comment78 min readLW link

Mo­ti­va­tions, Nat­u­ral Selec­tion, and Cur­ricu­lum Engineering

Oliver Sourbut16 Dec 2021 1:07 UTC
16 points
0 comments42 min readLW link

The sling­shot helps with learning

Wilson Wu31 Oct 2024 23:18 UTC
33 points
0 comments8 min readLW link

How (not) to choose a re­search project

9 Aug 2022 0:26 UTC
80 points
11 comments7 min readLW link

Be­havi­our Man­i­folds and the Hes­sian of the To­tal Loss—Notes and Criticism

carboniferous_umbraculum 3 Sep 2022 0:15 UTC
35 points
5 comments6 min readLW link

Ad­dress­ing De­ci­sion The­ory’s Si­mu­la­tion Problem

Ashe Vazquez Nuñez3 Feb 2026 7:02 UTC
11 points
0 comments3 min readLW link

Ap­prox­i­ma­tion is ex­pen­sive, but the lunch is cheap

19 Apr 2023 14:19 UTC
77 points
3 comments16 min readLW link

Towards Sub-agent Dy­nam­ics and Con­flict

Ashe Vazquez Nuñez25 Jan 2026 5:27 UTC
13 points
1 comment3 min readLW link

Am­bigu­ous out-of-dis­tri­bu­tion gen­er­al­iza­tion on an al­gorith­mic task

13 Feb 2025 18:24 UTC
84 points
6 comments11 min readLW link

How to De­sign En­vi­ron­ments for Un­der­stand­ing Model Motives

2 Mar 2026 7:14 UTC
42 points
0 comments10 min readLW link

Quan­ti­ta­tive cruxes in Alignment

Martín Soto2 Jul 2023 20:38 UTC
19 points
0 comments23 min readLW link

Statis­ti­cal sug­ges­tions for mech in­terp re­search and beyond

Paul Bogdan6 Aug 2025 12:45 UTC
65 points
4 comments15 min readLW link

Can Rea­son­ing Models Obfus­cate Rea­son­ing? Stress-Test­ing Chain-of-Thought Monitorability

24 Oct 2025 17:21 UTC
23 points
1 comment5 min readLW link

Stop-gra­di­ents lead to fixed point predictions

28 Jan 2023 22:47 UTC
37 points
2 comments24 min readLW link

An­a­lyz­ing Deep­Mind’s Prob­a­bil­is­tic Meth­ods for Eval­u­at­ing Agent Capabilities

22 Jul 2024 16:17 UTC
69 points
0 comments16 min readLW link

[Short ver­sion] In­for­ma­tion Loss --> Basin flatness

Vivek Hebbar21 May 2022 12:59 UTC
12 points
0 comments1 min readLW link

A brief note on Sim­plic­ity Bias

carboniferous_umbraculum 14 Aug 2022 2:05 UTC
20 points
0 comments4 min readLW link

In­vuln­er­a­ble In­com­plete Prefer­ences: A For­mal Statement

SCP30 Aug 2023 21:59 UTC
139 points
39 comments35 min readLW link

Ac­ti­va­tion adding ex­per­i­ments with FLAN-T5

Nina Panickssery13 Jul 2023 23:32 UTC
21 points
5 comments7 min readLW link

Global CoT Anal­y­sis: Ini­tial at­tempts to un­cover pat­terns across many chains of thought

13 Jan 2026 20:40 UTC
52 points
0 comments18 min readLW link

Do LLMs know what they’re ca­pa­ble of? Why this mat­ters for AI safety, and ini­tial findings

13 Jul 2025 19:54 UTC
53 points
5 comments18 min readLW link

An ad­ver­sar­ial ex­am­ple for Direct Logit At­tri­bu­tion: mem­ory man­age­ment in gelu-4l

30 Aug 2023 17:36 UTC
17 points
0 comments8 min readLW link
(arxiv.org)

Transcoders en­able fine-grained in­ter­pretable cir­cuit anal­y­sis for lan­guage models

30 Apr 2024 17:58 UTC
75 points
14 comments17 min readLW link

Edge Cases in AI Alignment

Florian_Dietz24 Mar 2025 9:27 UTC
19 points
3 comments4 min readLW link

Un­der­stand­ing and con­trol­ling auto-in­duced dis­tri­bu­tional shift

L Rudolf L13 Dec 2021 14:59 UTC
33 points
4 comments16 min readLW link

Lan­guage Models Model Us

eggsyntax17 May 2024 21:00 UTC
159 points
56 comments7 min readLW link1 review

Eval­u­at­ing hid­den di­rec­tions on the util­ity dataset: clas­sifi­ca­tion, steer­ing and removal

25 Sep 2023 17:19 UTC
25 points
3 comments7 min readLW link

Ir­ra­tional­ity as a Defense Mechanism for Re­ward-hacking

Ashe Vazquez Nuñez18 Jan 2026 3:57 UTC
47 points
7 comments4 min readLW link

Ac­ti­va­tion adding ex­per­i­ments with llama-7b

Nina Panickssery16 Jul 2023 4:17 UTC
51 points
1 comment3 min readLW link

Early Ex­per­i­ments in Hu­man Au­dit­ing for AI Control

23 Jan 2025 1:34 UTC
28 points
1 comment7 min readLW link

The Align­ment Problems

Martín Soto12 Jan 2023 22:29 UTC
20 points
0 comments4 min readLW link

A cir­cuit for Python doc­strings in a 4-layer at­ten­tion-only transformer

20 Feb 2023 19:35 UTC
96 points
8 comments21 min readLW link

Elic­it­ing base mod­els with sim­ple un­su­per­vised techniques

23 Jan 2026 18:06 UTC
34 points
2 comments8 min readLW link

In­for­ma­tion Loss --> Basin flatness

Vivek Hebbar21 May 2022 12:58 UTC
62 points
31 comments7 min readLW link

De­cod­ing in­ter­me­di­ate ac­ti­va­tions in llama-2-7b

Nina Panickssery21 Jul 2023 5:35 UTC
39 points
3 comments4 min readLW link

Work­ing to­wards AI al­ign­ment is better

Johannes C. Mayer9 Dec 2022 15:39 UTC
8 points
2 comments2 min readLW link

Build­ing Black-box Schem­ing Monitors

29 Jul 2025 17:41 UTC
45 points
18 comments11 min readLW link

The Shard The­ory Align­ment Scheme

David Udell25 Aug 2022 4:52 UTC
47 points
32 comments2 min readLW link

The­o­ret­i­cal Neu­ro­science For Align­ment Theory

Cameron Berg7 Dec 2021 21:50 UTC
66 points
18 comments23 min readLW link

Mesa-op­ti­miza­tion for goals defined only within a train­ing en­vi­ron­ment is dangerous

Rubi J. Hudson17 Aug 2022 3:56 UTC
6 points
2 comments4 min readLW link

How trans­parency changed over time

ViktoriaMalyasova30 Jul 2022 4:36 UTC
21 points
0 comments6 min readLW link

Early Signs of Stegano­graphic Ca­pa­bil­ities in Fron­tier LLMs

4 Jul 2025 16:36 UTC
33 points
5 comments2 min readLW link

An open let­ter to SERI MATS pro­gram organisers

Roman Leventov20 Apr 2023 16:34 UTC
26 points
26 comments4 min readLW link

Spooky ac­tion at a dis­tance in the loss landscape

28 Jan 2023 0:22 UTC
62 points
4 comments7 min readLW link
(www.jessehoogland.com)

De­cep­tion?! I ain’t got time for that!

Paul Colognese18 Jul 2022 0:06 UTC
55 points
5 comments13 min readLW link

Im­prov­ing Model-Writ­ten Evals for AI Safety Benchmarking

15 Oct 2024 18:25 UTC
30 points
0 comments18 min readLW link

Fore­sight for AGI Safety Strat­egy: Miti­gat­ing Risks and Iden­ti­fy­ing Golden Opportunities

jacquesthibs5 Dec 2022 16:09 UTC
28 points
6 comments8 min readLW link

[Paper] All’s Fair In Love And Love: Copy Sup­pres­sion in GPT-2 Small

13 Oct 2023 18:32 UTC
82 points
4 comments8 min readLW link

Bridg­ing the VLM and mech in­terp com­mu­ni­ties for mul­ti­modal in­ter­pretabil­ity

Sonia Joseph28 Oct 2024 14:41 UTC
19 points
5 comments15 min readLW link

Stress-Test­ing Align­ment Au­dits With Prompt-Level Strate­gic Deception

10 Feb 2026 17:29 UTC
17 points
0 comments1 min readLW link
(arxiv.org)

Ap­ply to MATS Sum­mer 2026!

18 Dec 2025 1:51 UTC
31 points
0 comments1 min readLW link

Non-Uni­tary Quan­tum Logic—SERI MATS Re­search Sprint

Yegreg16 Feb 2023 19:31 UTC
27 points
0 comments7 min readLW link
No comments.