Align­ment Fak­ing in Large Lan­guage Models

18 Dec 2024 17:19 UTC
496 points
85 comments10 min readLW link3 reviews

Re­view: Planecrash

L Rudolf L27 Dec 2024 14:18 UTC
374 points
58 comments22 min readLW link2 reviews
(nosetgauge.substack.com)

What Goes Without Saying

sarahconstantin20 Dec 2024 18:00 UTC
355 points
29 comments5 min readLW link1 review
(sarahconstantin.substack.com)

Biolog­i­cal risk from the mir­ror world

jasoncrawford12 Dec 2024 19:07 UTC
336 points
39 comments7 min readLW link1 review
(newsletter.rootsofprogress.org)

The Field of AI Align­ment: A Post­mortem, and What To Do About It

johnswentworth26 Dec 2024 18:48 UTC
322 points
176 comments8 min readLW link3 reviews

By de­fault, cap­i­tal will mat­ter more than ever af­ter AGI

L Rudolf L28 Dec 2024 17:52 UTC
309 points
108 comments16 min readLW link2 reviews
(nosetgauge.substack.com)

Ori­ent­ing to 3 year AGI timelines

Nikola Jurkovic22 Dec 2024 1:15 UTC
298 points
63 comments8 min readLW link2 reviews

A Three-Layer Model of LLM Psychology

Jan_Kulveit26 Dec 2024 16:49 UTC
250 points
17 comments8 min readLW link2 reviews

Un­der­stand­ing Shap­ley Values with Venn Diagrams

Carson L6 Dec 2024 21:56 UTC
218 points
40 comments4 min readLW link1 review
(medium.com)

Fron­tier Models are Ca­pable of In-con­text Scheming

5 Dec 2024 22:11 UTC
211 points
24 comments7 min readLW link

Com­mu­ni­ca­tions in Hard Mode (My new job at MIRI)

tanagrabeast13 Dec 2024 20:13 UTC
209 points
25 comments5 min readLW link

Shal­low re­view of tech­ni­cal AI safety, 2024

29 Dec 2024 12:01 UTC
202 points
35 comments41 min readLW link

When Is In­surance Worth It?

kqr19 Dec 2024 19:07 UTC
179 points
72 comments4 min readLW link1 review
(entropicthoughts.com)

Gra­di­ent Rout­ing: Mask­ing Gra­di­ents to Lo­cal­ize Com­pu­ta­tion in Neu­ral Networks

6 Dec 2024 22:19 UTC
177 points
15 comments11 min readLW link1 review
(arxiv.org)

o1: A Tech­ni­cal Primer

Jesse Hoogland9 Dec 2024 19:09 UTC
172 points
19 comments9 min readLW link
(www.youtube.com)

“Align­ment Fak­ing” frame is some­what fake

Jan_Kulveit20 Dec 2024 9:51 UTC
166 points
16 comments6 min readLW link1 review

Sub­skills of “Listen­ing to Wis­dom”

Raemon9 Dec 2024 3:01 UTC
165 points
33 comments42 min readLW link1 review

The “Think It Faster” Exercise

Raemon11 Dec 2024 19:14 UTC
156 points
36 comments13 min readLW link1 review

What o3 Be­comes by 2028

Vladimir_Nesov22 Dec 2024 12:37 UTC
154 points
15 comments5 min readLW link

o3

Zach Stein-Perlman20 Dec 2024 18:30 UTC
154 points
164 comments1 min readLW link

Hire (or Be­come) a Think­ing Assistant

Raemon23 Dec 2024 3:58 UTC
141 points
50 comments8 min readLW link1 review

The Dangers of Mir­rored Life

12 Dec 2024 20:58 UTC
121 points
9 comments29 min readLW link
(www.asimov.press)

A break­down of AI ca­pa­bil­ity lev­els fo­cused on AI R&D la­bor acceleration

ryan_greenblatt22 Dec 2024 20:56 UTC
120 points
11 comments6 min readLW link

AIs Will In­creas­ingly At­tempt Shenanigans

Zvi16 Dec 2024 15:20 UTC
119 points
2 comments26 min readLW link
(thezvi.wordpress.com)

The Dream Machine

sarahconstantin5 Dec 2024 0:00 UTC
117 points
6 comments12 min readLW link
(sarahconstantin.substack.com)

Abla­tions for “Fron­tier Models are Ca­pable of In-con­text Schem­ing”

17 Dec 2024 23:58 UTC
116 points
1 comment2 min readLW link

The o1 Sys­tem Card Is Not About o1

Zvi13 Dec 2024 20:30 UTC
116 points
5 comments16 min readLW link
(thezvi.wordpress.com)

Why I’m Mov­ing from Mechanis­tic to Pro­saic Interpretability

Daniel Tan30 Dec 2024 6:35 UTC
115 points
34 comments5 min readLW link

How to repli­cate and ex­tend our al­ign­ment fak­ing demo

Fabien Roger19 Dec 2024 21:44 UTC
114 points
5 comments2 min readLW link
(alignment.anthropic.com)

Sorry for the down­time, looks like we got DDosd

habryka2 Dec 2024 4:14 UTC
112 points
13 comments1 min readLW link

The nihilism of NeurIPS

charlieoneill20 Dec 2024 23:58 UTC
107 points
6 comments4 min readLW link

Deep Causal Transcod­ing: A Frame­work for Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

3 Dec 2024 21:19 UTC
107 points
8 comments41 min readLW link

A short­com­ing of con­crete demon­stra­tions as AGI risk advocacy

Steven Byrnes11 Dec 2024 16:48 UTC
106 points
27 comments2 min readLW link

Takes on “Align­ment Fak­ing in Large Lan­guage Models”

Joe Carlsmith18 Dec 2024 18:22 UTC
105 points
7 comments62 min readLW link

2024 Unoffi­cial LessWrong Cen­sus/​Survey

Screwtape2 Dec 2024 5:30 UTC
103 points
51 comments1 min readLW link2 reviews

[Question] What are the strongest ar­gu­ments for very short timelines?

Kaj_Sotala23 Dec 2024 9:38 UTC
102 points
79 comments1 min readLW link

🇫🇷 An­nounc­ing CeSIA: The French Cen­ter for AI Safety

Charbel-Raphaël20 Dec 2024 14:17 UTC
101 points
2 comments8 min readLW link

Ma­tryoshka Sparse Autoencoders

Noa Nabeshima14 Dec 2024 2:52 UTC
98 points
15 comments11 min readLW link

MIRI’s 2024 End-of-Year Update

Rob Bensinger3 Dec 2024 4:33 UTC
98 points
2 comments4 min readLW link

Is “VNM-agent” one of sev­eral op­tions, for what minds can grow up into?

AnnaSalamon30 Dec 2024 6:36 UTC
97 points
55 comments2 min readLW link

Parable of the vanilla ice cream curse (and how it would pre­vent a car from start­ing!)

Mati_Roy8 Dec 2024 6:57 UTC
92 points
21 comments3 min readLW link

Should you be wor­ried about H5N1?

gw5 Dec 2024 21:11 UTC
89 points
2 comments5 min readLW link
(www.georgeyw.com)

AIs Will In­creas­ingly Fake Alignment

Zvi24 Dec 2024 13:00 UTC
89 points
0 comments52 min readLW link
(thezvi.wordpress.com)

Cir­cling as prac­tice for “just be your­self”

Kaj_Sotala16 Dec 2024 7:40 UTC
87 points
6 comments4 min readLW link
(kajsotala.fi)

Test­ing which LLM ar­chi­tec­tures can do hid­den se­rial reasoning

Filip Sondej16 Dec 2024 13:48 UTC
84 points
9 comments4 min readLW link

Effec­tive Evil’s AI Misal­ign­ment Plan

lsusr15 Dec 2024 7:39 UTC
83 points
9 comments3 min readLW link

Some ar­gu­ments against a land value tax

Matthew Barnett29 Dec 2024 15:17 UTC
83 points
45 comments15 min readLW link

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

11 Dec 2024 6:30 UTC
82 points
6 comments2 min readLW link
(www.neuronpedia.org)

Remap your caps lock key

bilalchughtai15 Dec 2024 14:03 UTC
82 points
21 comments1 min readLW link

Best-of-N Jailbreaking

14 Dec 2024 4:58 UTC
79 points
5 comments2 min readLW link
(arxiv.org)