All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 202320242025 2026

All Jan Feb Mar Apr May JunJulAug Sep Oct Nov Dec

All1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Reliable Sources: The Story of David Gerard

TracingWoodgrains10 Jul 2024 19:50 UTC

411 points

56 comments43 min readLW link 2 reviews

Universal Basic Income and Poverty

Eliezer Yudkowsky26 Jul 2024 7:23 UTC

363 points

150 comments9 min readLW link 1 review

80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly)

Raemon3 Jul 2024 20:34 UTC

274 points

71 comments3 min readLW link

Self-Other Overlap: A Neglected Approach to AI Alignment

Marc Carauleanu, Mike Vaiana, Kvee, Diogo de Lucena, Cameron Berg and Trent Hodgeson

30 Jul 2024 16:22 UTC

247 points

53 comments12 min readLW link 2 reviews

Towards more cooperative AI safety strategies

Richard_Ngo16 Jul 2024 4:36 UTC

236 points

134 comments4 min readLW link 1 review

Optimistic Assumptions, Longterm Planning, and “Cope”

Raemon17 Jul 2024 22:14 UTC

229 points

47 comments7 min readLW link 1 review

Superbabies: Putting The Pieces Together

sarahconstantin11 Jul 2024 20:40 UTC

221 points

42 comments10 min readLW link 3 reviews

(sarahconstantin.substack.com)

This is already your second chance

Malmesbury28 Jul 2024 17:13 UTC

201 points

13 comments8 min readLW link

Safety consultations for AI lab employees

Zach Stein-Perlman27 Jul 2024 15:00 UTC

183 points

6 comments1 min readLW link

Decomposing Agency — capabilities without desires

owencb and Raymond Douglas

11 Jul 2024 9:38 UTC

157 points

33 comments12 min readLW link 1 review

(strangecities.substack.com)

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Neel Nanda7 Jul 2024 17:39 UTC

145 points

17 comments25 min readLW link 1 review

On saying “Thank you” instead of “I’m Sorry”

Michael Cohn8 Jul 2024 3:13 UTC

138 points

16 comments3 min readLW link

“AI achieves silver-medal standard solving International Mathematical Olympiad problems”

gjm25 Jul 2024 15:58 UTC

133 points

38 comments2 min readLW link

(deepmind.google)

Pantheon Interface

Niki Dupuis and Sofia Vanhanen

8 Jul 2024 19:03 UTC

129 points

22 comments6 min readLW link

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex and Nicholas Goldowsky-Dill

18 Jul 2024 14:15 UTC

127 points

18 comments18 min readLW link

What are you getting paid in?

Austin Chen17 Jul 2024 19:23 UTC

120 points

16 comments4 min readLW link 1 review

(www.approachwithalacrity.com)

Dialogue introduction to Singular Learning Theory

Olli Järviniemi8 Jul 2024 16:58 UTC

119 points

16 comments8 min readLW link 1 review

Efficient Dictionary Learning with Switch Sparse Autoencoders

Anish Mudide22 Jul 2024 18:45 UTC

118 points

20 comments12 min readLW link

I found >800 orthogonal “write code” steering vectors

Jacob G-W and TurnTrout

15 Jul 2024 19:06 UTC

114 points

20 comments7 min readLW link

(jacobgw.com)

Most smart and skilled people are outside of the EA/rationalist community: an analysis

titotal12 Jul 2024 12:13 UTC

112 points

39 comments14 min readLW link

(open.substack.com)

Introduction to French AI Policy

Lucie Philippon4 Jul 2024 3:39 UTC

112 points

12 comments6 min readLW link

You should go to ML conferences

Jan_Kulveit24 Jul 2024 11:47 UTC

112 points

13 comments4 min readLW link

OthelloGPT learned a bag of heuristics

Jennifer Lin, JackS, Adam Karvonen and Can

2 Jul 2024 9:12 UTC

111 points

10 comments9 min readLW link

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

L Rudolf L, bilalchughtai, Jan Betley, kaivu, Jérémy Scheurer, Mikita Balesni, Alex Meinke, Owain_Evans and Marius Hobbhahn

8 Jul 2024 22:24 UTC

109 points

40 comments5 min readLW link 1 review

A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication

johnswentworth and David Lorell

26 Jul 2024 0:33 UTC

107 points

8 comments13 min readLW link 1 review

A simple model of math skill

Alex_Altair21 Jul 2024 18:57 UTC

107 points

17 comments8 min readLW link

Poker is a bad game for teaching epistemics. Figgie is a better one.

rossry8 Jul 2024 6:05 UTC

106 points

47 comments11 min readLW link

(blog.rossry.net)

Transformer Circuit Faithfulness Metrics Are Not Robust

Joseph Miller, bilalchughtai and William_S

12 Jul 2024 3:47 UTC

104 points

5 comments7 min readLW link

(arxiv.org)

Against Aschenbrenner: How ‘Situational Awareness’ constructs a narrative that undermines safety and threatens humanity

GideonF15 Jul 2024 18:37 UTC

104 points

17 comments21 min readLW link

(forum.effectivealtruism.org)

Covert Malicious Finetuning

Tony Wang and dannyhalawi

2 Jul 2024 2:41 UTC

103 points

4 comments3 min readLW link

Reflections on Less Online

Error7 Jul 2024 3:49 UTC

92 points

15 comments18 min readLW link

New page: Integrity

Zach Stein-Perlman10 Jul 2024 15:00 UTC

91 points

3 comments1 min readLW link

AI #73: Openly Evil AI

Zvi18 Jul 2024 14:40 UTC

89 points

20 comments52 min readLW link

(thezvi.wordpress.com)

Re: Anthropic’s suggested SB-1047 amendments

RobertM27 Jul 2024 22:32 UTC

87 points

13 comments9 min readLW link

(www.documentcloud.org)

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning

keith_wynroe and Lee Sharkey

2 Jul 2024 13:17 UTC

87 points

7 comments12 min readLW link

Fluent, Cruxy Predictions

Raemon10 Jul 2024 18:00 UTC

86 points

18 comments14 min readLW link 1 review

A simple case for extreme inner misalignment

Richard_Ngo13 Jul 2024 15:40 UTC

86 points

41 comments7 min readLW link

Scalable oversight as a quantitative rather than qualitative problem

Buck6 Jul 2024 17:42 UTC

86 points

11 comments3 min readLW link

3C’s: A Recipe For Mathing Concepts

johnswentworth and David Lorell

3 Jul 2024 1:06 UTC

84 points

6 comments7 min readLW link

Consider the humble rock (or: why the dumb thing kills you)

Ouro4 Jul 2024 13:54 UTC

79 points

12 comments4 min readLW link 1 review

D&D.Sci Scenario Index

aphyer and abstractapplic

23 Jul 2024 2:00 UTC

78 points

1 comment3 min readLW link 1 review

On the CrowdStrike Incident

Zvi22 Jul 2024 12:40 UTC

75 points

14 comments17 min readLW link

(thezvi.wordpress.com)

Interpreting Preference Models w/ Sparse Autoencoders

Logan Riggs and Jannik Brinkmann

1 Jul 2024 21:35 UTC

75 points

12 comments9 min readLW link

LK-99 in retrospect

bhauth7 Jul 2024 2:06 UTC

74 points

21 comments3 min readLW link

(www.bhauth.com)

Multiplex Gene Editing: Where Are We Now?

sarahconstantin16 Jul 2024 20:50 UTC

73 points

6 comments7 min readLW link

(sarahconstantin.substack.com)

Yoshua Bengio: Reasoning through arguments against taking AI safety seriously

Kvee11 Jul 2024 23:53 UTC

72 points

3 comments1 min readLW link

(yoshuabengio.org)

Friendship is transactional, unconditional friendship is insurance

Ruby17 Jul 2024 22:52 UTC

70 points

25 comments2 min readLW link 1 review

A framework for thinking about AI power-seeking

Joe Carlsmith24 Jul 2024 22:41 UTC

70 points

15 comments16 min readLW link

Analyzing DeepMind’s Probabilistic Methods for Evaluating Agent Capabilities

Axel Højmark, fidgetsinner, Arjun Panickssery, Marius Hobbhahn and Jérémy Scheurer

22 Jul 2024 16:17 UTC

69 points

0 comments16 min readLW link

Indecision and internalized authority figures

Kaj_Sotala6 Jul 2024 10:10 UTC

69 points

1 comment2 min readLW link

(kajsotala.fi)