All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 202320242025 2026

AllJanFeb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

All1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

There is way too much serendipity

Malmesbury19 Jan 2024 19:37 UTC

396 points

58 comments7 min readLW link 1 review

Gentleness and the artificial Other

Joe Carlsmith2 Jan 2024 18:21 UTC

321 points

34 comments11 min readLW link 1 review

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

evhub, Carson Denison, Meg, Monte M, David Duvenaud, Nicholas Schiefer and Ethan Perez

12 Jan 2024 19:51 UTC

310 points

95 comments3 min readLW link

(arxiv.org)

The case for ensuring that powerful AIs are controlled

ryan_greenblatt and Buck

24 Jan 2024 16:11 UTC

275 points

74 comments28 min readLW link 1 review

MIRI 2024 Mission and Strategy Update

Malo5 Jan 2024 0:20 UTC

223 points

44 comments8 min readLW link

The impossible problem of due process

mingyuan16 Jan 2024 5:18 UTC

222 points

71 comments14 min readLW link 3 reviews

Toward A Mathematical Framework for Computation in Superposition

Dmitry Vaintrob, jake_mendel and Kaarel

18 Jan 2024 21:06 UTC

213 points

19 comments63 min readLW link

This might be the last AI Safety Camp

Remmelt and Linda Linsefors

24 Jan 2024 9:33 UTC

198 points

34 comments1 min readLW link

Introducing Alignment Stress-Testing at Anthropic

evhub12 Jan 2024 23:51 UTC

182 points

23 comments2 min readLW link

Making every researcher seek grants is a broken model

jasoncrawford26 Jan 2024 16:06 UTC

178 points

42 comments4 min readLW link 1 review

(rootsofprogress.org)

Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

Jeremy Gillen and peterbarnett

26 Jan 2024 7:22 UTC

161 points

64 comments57 min readLW link 2 reviews

What’s up with LLMs representing XORs of arbitrary features?

Sam Marks3 Jan 2024 19:44 UTC

159 points

64 comments16 min readLW link

Apologizing is a Core Rationalist Skill

johnswentworth2 Jan 2024 17:47 UTC

156 points

42 comments5 min readLW link

What good is G-factor if you’re dumped in the woods? A field report from a camp counselor.

Hastings12 Jan 2024 13:17 UTC

156 points

24 comments1 min readLW link

Deep atheism and AI risk

Joe Carlsmith4 Jan 2024 18:58 UTC

155 points

24 comments27 min readLW link 2 reviews

Notice When People Are Directionally Correct

Chris_Leong14 Jan 2024 14:12 UTC

153 points

15 comments2 min readLW link

The case for training frontier AIs on Sumerian-only corpus

Alexandre Variengien, Charbel-Raphaël and Jonathan Claybrough

15 Jan 2024 16:40 UTC

143 points

16 comments3 min readLW link

Processor clock speeds are not how fast AIs think

Ege Erdil29 Jan 2024 14:39 UTC

142 points

55 comments2 min readLW link

Steering Llama-2 with contrastive activation additions

Nina Panickssery, Wuschel Schulz, NickGabs, Meg, evhub and TurnTrout

2 Jan 2024 0:47 UTC

125 points

29 comments8 min readLW link

(arxiv.org)

A Shutdown Problem Proposal

johnswentworth and David Lorell

21 Jan 2024 18:12 UTC

125 points

61 comments6 min readLW link

An even deeper atheism

Joe Carlsmith11 Jan 2024 17:28 UTC

125 points

48 comments15 min readLW link 1 review

Gender Exploration

sapphire14 Jan 2024 18:57 UTC

122 points

27 comments5 min readLW link 1 review

(open.substack.com)

Why I take short timelines seriously

NicholasKees28 Jan 2024 22:27 UTC

122 points

29 comments4 min readLW link

The case for more ambitious language model evals

Jozdien30 Jan 2024 0:01 UTC

118 points

30 comments5 min readLW link

Catching AIs red-handed

ryan_greenblatt and Buck

5 Jan 2024 17:43 UTC

113 points

28 comments17 min readLW link 1 review

Four visions of Transformative AI success

Steven Byrnes17 Jan 2024 20:45 UTC

113 points

22 comments15 min readLW link

Practically A Book Review: Appendix to “Nonlinear’s Evidence: Debunking False and Misleading Claims” (ThingOfThings)

tailcalled3 Jan 2024 17:07 UTC

111 points

25 comments2 min readLW link

(thingofthings.substack.com)

Being nicer than Clippy

Joe Carlsmith16 Jan 2024 19:44 UTC

110 points

32 comments27 min readLW link

′ petertodd’’s last stand: The final days of open GPT-3 research

mwatkins22 Jan 2024 18:47 UTC

109 points

16 comments45 min readLW link

2023 in AI predictions

jessicata1 Jan 2024 5:23 UTC

109 points

35 comments5 min readLW link

Deceptive AI ≠ Deceptively-aligned AI

Steven Byrnes7 Jan 2024 16:55 UTC

107 points

19 comments6 min readLW link

Almost everyone I’ve met would be well-served thinking more about what to focus on

Henrik Karlsson5 Jan 2024 21:01 UTC

98 points

9 comments11 min readLW link 1 review

(www.henrikkarlsson.xyz)

On the abolition of man

Joe Carlsmith18 Jan 2024 18:17 UTC

98 points

19 comments41 min readLW link 1 review

RAND report finds no effect of current LLMs on viability of bioterrorism attacks

StellaAthena25 Jan 2024 19:17 UTC

94 points

14 comments1 min readLW link

(www.rand.org)

The Aspiring Rationalist Congregation

maia10 Jan 2024 22:52 UTC

91 points

23 comments10 min readLW link

Epistemic Hell

rogersbacon27 Jan 2024 17:13 UTC

85 points

20 comments14 min readLW link

Sparse Autoencoders Work on Attention Layer Outputs

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

16 Jan 2024 0:26 UTC

85 points

9 comments18 min readLW link

Palworld development blog post

bhauth28 Jan 2024 5:56 UTC

84 points

13 comments1 min readLW link

(note.com)

[Repost] The Copenhagen Interpretation of Ethics

mesaoptimizer25 Jan 2024 15:20 UTC

83 points

4 comments5 min readLW link

(web.archive.org)

Some Vacation Photos

johnswentworth4 Jan 2024 17:15 UTC

83 points

0 comments1 min readLW link

An Introduction To The Mandelbrot Set That Doesn’t Mention Complex Numbers

Yitz17 Jan 2024 9:48 UTC

82 points

11 comments9 min readLW link

Survey of 2,778 AI authors: six parts in pictures

KatjaGrace6 Jan 2024 4:43 UTC

80 points

1 comment2 min readLW link

Universal Love Integration Test: Hitler

Raemon10 Jan 2024 23:55 UTC

77 points

65 comments9 min readLW link

The True Story of How GPT-2 Became Maximally Lewd

Writer and Jai

18 Jan 2024 21:03 UTC

74 points

7 comments6 min readLW link

(youtu.be)

When “yang” goes wrong

Joe Carlsmith8 Jan 2024 16:35 UTC

74 points

6 comments13 min readLW link

We need a Science of Evals

Marius Hobbhahn and Jérémy Scheurer

22 Jan 2024 20:30 UTC

74 points

13 comments9 min readLW link

InterLab – a toolkit for experiments with multi-agent interactions

Tomáš Gavenčiak, Ada Böhm and Jan_Kulveit

22 Jan 2024 18:23 UTC

69 points

0 comments8 min readLW link

(acsresearch.org)

Bayesian updating in real life is mostly about understanding your hypotheses

Max H1 Jan 2024 0:10 UTC

68 points

4 comments11 min readLW link

[Question] Will quantum randomness affect the 2028 election?

Thomas Kwa and habryka

24 Jan 2024 22:54 UTC

66 points

52 comments1 min readLW link

OpenAI’s Preparedness Framework: Praise & Recommendations

Orpheus162 Jan 2024 16:20 UTC

66 points

1 comment7 min readLW link