All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 20242025

All Jan FebMarApr May Jun Jul Aug Sep Oct

All1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Will Jesus Christ return in an election year?

Eric Neyman24 Mar 2025 16:50 UTC

405 points

59 comments4 min readLW link

(ericneyman.wordpress.com)

A Bear Case: My Predictions Regarding AI Progress

Thane Ruthenis5 Mar 2025 16:41 UTC

377 points

163 comments9 min readLW link

Recent AI model progress feels mostly like bullshit

lc24 Mar 2025 19:28 UTC

356 points

85 comments8 min readLW link

(zeropath.com)

Policy for LLM Writing on LessWrong

jimrandomh24 Mar 2025 21:41 UTC

334 points

71 comments2 min readLW link

Tracing the Thoughts of a Large Language Model

Adam Jermyn27 Mar 2025 17:20 UTC

305 points

24 comments10 min readLW link

(www.anthropic.com)

Good Research Takes are Not Sufficient for Good Strategic Takes

Neel Nanda22 Mar 2025 10:13 UTC

292 points

28 comments4 min readLW link

(www.neelnanda.io)

Trojan Sky

Richard_Ngo11 Mar 2025 3:14 UTC

252 points

39 comments12 min readLW link

(www.narrativeark.xyz)

METR: Measuring AI Ability to Complete Long Tasks

Zach Stein-Perlman19 Mar 2025 16:00 UTC

241 points

106 comments5 min readLW link

(metr.org)

Explaining British Naval Dominance During the Age of Sail

Arjun Panickssery28 Mar 2025 5:47 UTC

206 points

16 comments4 min readLW link

(arjunpanickssery.substack.com)

Why White-Box Redteaming Makes Me Feel Weird

Zygi Straznickas16 Mar 2025 18:54 UTC

206 points

36 comments3 min readLW link

Intention to Treat

Alicorn20 Mar 2025 20:01 UTC

200 points

5 comments2 min readLW link

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer and Marius Hobbhahn

17 Mar 2025 19:11 UTC

184 points

9 comments6 min readLW link

OpenAI: Detecting misbehavior in frontier reasoning models

Daniel Kokotajlo11 Mar 2025 2:17 UTC

183 points

26 comments4 min readLW link

(openai.com)

So how well is Claude playing Pokémon?

Julian Bradshaw7 Mar 2025 5:54 UTC

171 points

76 comments5 min readLW link

On the Rationality of Deterring ASI

Dan H5 Mar 2025 16:11 UTC

168 points

34 comments4 min readLW link

(nationalsecurity.ai)

Reducing LLM deception at scale with self-other overlap fine-tuning

Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Cameron Berg, Mike Vaiana and Trent Hodgeson

13 Mar 2025 19:09 UTC

162 points

46 comments6 min readLW link

I make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats?

shrimpy16 Mar 2025 16:52 UTC

161 points

26 comments1 min readLW link

Self-fulfilling misalignment data might be poisoning our AI models

TurnTrout2 Mar 2025 19:51 UTC

154 points

29 comments1 min readLW link

(turntrout.com)

Statistical Challenges with Making Super IQ babies

Jan Christian Refsgaard2 Mar 2025 20:26 UTC

154 points

26 comments9 min readLW link

Conceptual Rounding Errors

Jan_Kulveit26 Mar 2025 19:00 UTC

151 points

15 comments3 min readLW link

(boundedlyrational.substack.com)

The Most Forbidden Technique

Zvi12 Mar 2025 13:20 UTC

150 points

9 comments17 min readLW link

(thezvi.wordpress.com)

Methods for strong human germline engineering

TsviBT3 Mar 2025 8:13 UTC

149 points

29 comments108 min readLW link

The Hidden Cost of Our Lies to AI

Nicholas Andresen6 Mar 2025 5:03 UTC

145 points

18 comments7 min readLW link

(substack.com)

The Milton Friedman Model of Policy Change

JohnofCharleston4 Mar 2025 0:38 UTC

143 points

17 comments4 min readLW link

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

13 Mar 2025 19:18 UTC

141 points

15 comments13 min readLW link

[Question] How Much Are LLMs Actually Boosting Real-World Programmer Productivity?

Thane Ruthenis4 Mar 2025 16:23 UTC

141 points

52 comments3 min readLW link

Anthropic, and taking “technical philosophy” more seriously

Raemon13 Mar 2025 1:48 UTC

139 points

29 comments11 min readLW link

The Pando Problem: Rethinking AI Individuality

Jan_Kulveit28 Mar 2025 21:03 UTC

133 points

14 comments13 min readLW link

Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

Fabien Roger11 Mar 2025 11:52 UTC

127 points

23 comments11 min readLW link

(alignment.anthropic.com)

How I’ve run major projects

benkuhn16 Mar 2025 18:40 UTC

127 points

10 comments8 min readLW link

(www.benkuhn.net)

Do models say what they learn?

Andy Arditi, marvinli, Joe Benton and Miles Turpin

22 Mar 2025 15:19 UTC

126 points

12 comments13 min readLW link

[Question] when will LLMs become human-level bloggers?

nostalgebraist9 Mar 2025 21:10 UTC

125 points

34 comments6 min readLW link

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

lewis smith, Senthooran Rajamanoharan, Arthur Conmy, CallumMcDougall, Tom Lieberum, János Kramár, Rohin Shah and Neel Nanda

26 Mar 2025 19:07 UTC

113 points

15 comments29 min readLW link

(deepmindsafetyresearch.medium.com)

2024 Unofficial LessWrong Survey Results

Screwtape14 Mar 2025 22:29 UTC

110 points

28 comments48 min readLW link

How I talk to those above me

Maxwell Peterson30 Mar 2025 6:54 UTC

104 points

16 comments8 min readLW link

AI Control May Increase Existential Risk

Jan_Kulveit11 Mar 2025 14:30 UTC

101 points

13 comments1 min readLW link

Third-wave AI safety needs sociopolitical thinking

Richard_Ngo27 Mar 2025 0:55 UTC

99 points

23 comments26 min readLW link

What the Headlines Miss About the Latest Decision in the Musk vs. OpenAI Lawsuit

garrison6 Mar 2025 19:49 UTC

98 points

0 comments6 min readLW link

(garrisonlovely.substack.com)

Vacuum Decay: Expert Survey Results

JessRiedel13 Mar 2025 18:31 UTC

96 points

26 comments13 min readLW link

Towards a scale-free theory of intelligent agency

Richard_Ngo21 Mar 2025 1:39 UTC

96 points

45 comments13 min readLW link

(www.mindthefuture.info)

Elite Coordination via the Consensus of Power

Richard_Ngo19 Mar 2025 6:56 UTC

92 points

15 comments12 min readLW link

(www.mindthefuture.info)

How I force LLMs to generate correct code

claudio21 Mar 2025 14:40 UTC

91 points

7 comments5 min readLW link

We should start looking for scheming “in the wild”

Marius Hobbhahn6 Mar 2025 13:49 UTC

91 points

4 comments5 min readLW link

What goals will AIs have? A list of hypotheses

Daniel Kokotajlo3 Mar 2025 20:08 UTC

88 points

20 comments18 min readLW link

Elon Musk May Be Transitioning to Bipolar Type I

Cyborg2511 Mar 2025 17:45 UTC

86 points

22 comments4 min readLW link

Open problems in emergent misalignment

Jan Betley and Daniel Tan

1 Mar 2025 9:47 UTC

83 points

17 comments7 min readLW link

OpenAI #11: America Action Plan

Zvi18 Mar 2025 12:50 UTC

83 points

3 comments6 min readLW link

(thezvi.wordpress.com)

Mistral Large 2 (123B) seems to exhibit alignment faking

Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Cameron Berg, Judd Rosenblatt, Mike Vaiana and Trent Hodgeson

27 Mar 2025 15:39 UTC

81 points

4 comments13 min readLW link

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Stuart_Armstrong and rgorman

18 Mar 2025 14:48 UTC

80 points

12 comments5 min readLW link

Eukaryote Skips Town—Why I’m leaving DC

eukaryote26 Mar 2025 17:16 UTC

80 points

1 comment6 min readLW link

(eukaryotewritesblog.com)