18 Dec 2024 17:19 UTC

496 points

85 comments10 min readLW link 3 reviews

Review: Planecrash

L Rudolf L27 Dec 2024 14:18 UTC

374 points

58 comments22 min readLW link 2 reviews

(nosetgauge.substack.com)

What Goes Without Saying

sarahconstantin20 Dec 2024 18:00 UTC

355 points

29 comments5 min readLW link 1 review

(sarahconstantin.substack.com)

Biological risk from the mirror world

jasoncrawford12 Dec 2024 19:07 UTC

336 points

40 comments7 min readLW link 1 review

(newsletter.rootsofprogress.org)

The Field of AI Alignment: A Postmortem, and What To Do About It

johnswentworth26 Dec 2024 18:48 UTC

322 points

176 comments8 min readLW link 3 reviews

By default, capital will matter more than ever after AGI

L Rudolf L28 Dec 2024 17:52 UTC

309 points

108 comments16 min readLW link 2 reviews

(nosetgauge.substack.com)

Orienting to 3 year AGI timelines

Nikola Jurkovic22 Dec 2024 1:15 UTC

298 points

63 comments8 min readLW link 2 reviews

A Three-Layer Model of LLM Psychology

Jan_Kulveit26 Dec 2024 16:49 UTC

250 points

17 comments8 min readLW link 2 reviews

Understanding Shapley Values with Venn Diagrams

Carson L6 Dec 2024 21:56 UTC

218 points

40 comments4 min readLW link 1 review

(medium.com)

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer and Mikita Balesni

5 Dec 2024 22:11 UTC

211 points

24 comments7 min readLW link

Communications in Hard Mode (My new job at MIRI)

tanagrabeast13 Dec 2024 20:13 UTC

209 points

25 comments5 min readLW link

Shallow review of technical AI safety, 2024

technicalities, Stag, Stephen McAleese, jordine and Dr. David Mathers

29 Dec 2024 12:01 UTC

202 points

35 comments41 min readLW link

When Is Insurance Worth It?

kqr19 Dec 2024 19:07 UTC

179 points

72 comments4 min readLW link 1 review

(entropicthoughts.com)

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

cloud, Jacob G-W, Evzen, Joseph Miller and TurnTrout

6 Dec 2024 22:19 UTC

177 points

15 comments11 min readLW link 1 review

(arxiv.org)

o1: A Technical Primer

Jesse Hoogland9 Dec 2024 19:09 UTC

172 points

19 comments9 min readLW link

(www.youtube.com)

“Alignment Faking” frame is somewhat fake

Jan_Kulveit20 Dec 2024 9:51 UTC

166 points

16 comments6 min readLW link 1 review

Subskills of “Listening to Wisdom”

Raemon9 Dec 2024 3:01 UTC

165 points

33 comments42 min readLW link 1 review

The “Think It Faster” Exercise

Raemon11 Dec 2024 19:14 UTC

156 points

36 comments13 min readLW link 1 review

What o3 Becomes by 2028

Vladimir_Nesov22 Dec 2024 12:37 UTC

154 points

15 comments5 min readLW link

o3

Zach Stein-Perlman20 Dec 2024 18:30 UTC

154 points

164 comments1 min readLW link

Hire (or Become) a Thinking Assistant

Raemon23 Dec 2024 3:58 UTC

141 points

50 comments8 min readLW link 1 review

The Dangers of Mirrored Life

Niko_McCarty and fin

12 Dec 2024 20:58 UTC

121 points

9 comments29 min readLW link

(www.asimov.press)

A breakdown of AI capability levels focused on AI R&D labor acceleration

ryan_greenblatt22 Dec 2024 20:56 UTC

120 points

11 comments6 min readLW link

AIs Will Increasingly Attempt Shenanigans

Zvi16 Dec 2024 15:20 UTC

119 points

2 comments26 min readLW link

(thezvi.wordpress.com)

The Dream Machine

sarahconstantin5 Dec 2024 0:00 UTC

117 points

6 comments12 min readLW link

(sarahconstantin.substack.com)

Ablations for “Frontier Models are Capable of In-context Scheming”

AlexMeinke, Bronson Schoen, Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer and rusheb

17 Dec 2024 23:58 UTC

116 points

1 comment2 min readLW link

The o1 System Card Is Not About o1

Zvi13 Dec 2024 20:30 UTC

116 points

5 comments16 min readLW link

(thezvi.wordpress.com)

Why I’m Moving from Mechanistic to Prosaic Interpretability

Daniel Tan30 Dec 2024 6:35 UTC

115 points

34 comments5 min readLW link

How to replicate and extend our alignment faking demo

Fabien Roger19 Dec 2024 21:44 UTC

114 points

5 comments2 min readLW link

(alignment.anthropic.com)

Sorry for the downtime, looks like we got DDosd

habryka2 Dec 2024 4:14 UTC

112 points

13 comments1 min readLW link

The nihilism of NeurIPS

charlieoneill20 Dec 2024 23:58 UTC

107 points

6 comments4 min readLW link

Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

3 Dec 2024 21:19 UTC

107 points

8 comments41 min readLW link

A shortcoming of concrete demonstrations as AGI risk advocacy

Steven Byrnes11 Dec 2024 16:48 UTC

106 points

27 comments2 min readLW link

Takes on “Alignment Faking in Large Language Models”

Joe Carlsmith18 Dec 2024 18:22 UTC

105 points

7 comments62 min readLW link

2024 Unofficial LessWrong Census/Survey

Screwtape2 Dec 2024 5:30 UTC

103 points

51 comments1 min readLW link 2 reviews

[Question] What are the strongest arguments for very short timelines?

Kaj_Sotala23 Dec 2024 9:38 UTC

102 points

79 comments1 min readLW link

🇫🇷 Announcing CeSIA: The French Center for AI Safety

Charbel-Raphaël20 Dec 2024 14:17 UTC

101 points

2 comments8 min readLW link

Matryoshka Sparse Autoencoders

Noa Nabeshima14 Dec 2024 2:52 UTC

98 points

15 comments11 min readLW link

MIRI’s 2024 End-of-Year Update

Rob Bensinger3 Dec 2024 4:33 UTC

98 points

2 comments4 min readLW link

Is “VNM-agent” one of several options, for what minds can grow up into?

AnnaSalamon30 Dec 2024 6:36 UTC

97 points

55 comments2 min readLW link

Parable of the vanilla ice cream curse (and how it would prevent a car from starting!)

Mati_Roy8 Dec 2024 6:57 UTC

92 points

21 comments3 min readLW link

Should you be worried about H5N1?

gw5 Dec 2024 21:11 UTC

89 points

2 comments5 min readLW link

(www.georgeyw.com)

AIs Will Increasingly Fake Alignment

Zvi24 Dec 2024 13:00 UTC

89 points

0 comments52 min readLW link

(thezvi.wordpress.com)

Circling as practice for “just be yourself”

Kaj_Sotala16 Dec 2024 7:40 UTC

87 points

6 comments4 min readLW link

(kajsotala.fi)

Testing which LLM architectures can do hidden serial reasoning

Filip Sondej16 Dec 2024 13:48 UTC

84 points

9 comments4 min readLW link

Effective Evil’s AI Misalignment Plan

lsusr15 Dec 2024 7:39 UTC

83 points

9 comments3 min readLW link

Some arguments against a land value tax

Matthew Barnett29 Dec 2024 15:17 UTC

83 points

45 comments15 min readLW link

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks and Neel Nanda

11 Dec 2024 6:30 UTC

82 points

6 comments2 min readLW link

(www.neuronpedia.org)

Remap your caps lock key

bilalchughtai15 Dec 2024 14:03 UTC

82 points

21 comments1 min readLW link

Best-of-N Jailbreaking

John Hughes, saraprice, Aengus Lynch, Rylan Schaeffer, fbarez, Henry Sleight, Ethan Perez and mrinank_sharma

14 Dec 2024 4:58 UTC

79 points

5 comments2 min readLW link

(arxiv.org)