All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 202220232024 2025 2026

All Jan Feb Mar Apr May Jun Jul Aug SepOctNov Dec

All1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Zac Hatfield-Dodds5 Oct 2023 21:01 UTC

289 points

22 comments2 min readLW link 1 review

(transformer-circuits.pub)

Alignment Implications of LLM Successes: a Debate in One Act

Zack_M_Davis21 Oct 2023 15:22 UTC

269 points

56 comments13 min readLW link 2 reviews

Book Review: Going Infinite

Zvi24 Oct 2023 15:00 UTC

248 points

113 comments97 min readLW link 1 review

(thezvi.wordpress.com)

Comp Sci in 2027 (Short story by Eliezer Yudkowsky)

sudo29 Oct 2023 23:09 UTC

223 points

27 comments10 min readLW link 1 review

(nitter.net)

Announcing MIRI’s new CEO and leadership team

Gretta Duleba10 Oct 2023 19:22 UTC

222 points

52 comments3 min readLW link

Thoughts on responsible scaling policies and regulation

paulfchristiano24 Oct 2023 22:21 UTC

220 points

34 comments6 min readLW link

Labs should be explicit about why they are building AGI

peterbarnett17 Oct 2023 21:09 UTC

215 points

18 comments1 min readLW link 1 review

We’re Not Ready: thoughts on “pausing” and responsible scaling policies

HoldenKarnofsky27 Oct 2023 15:19 UTC

200 points

33 comments8 min readLW link

AI as a science, and three obstacles to alignment strategies

So8res25 Oct 2023 21:00 UTC

197 points

80 comments11 min readLW link

Evaluating the historical value misspecification argument

Matthew Barnett5 Oct 2023 18:34 UTC

192 points

163 comments7 min readLW link 3 reviews

Announcing Timaeus

Jesse Hoogland, Daniel Murfet, Alexander Gietelink Oldenziel and Stan van Wingerden

22 Oct 2023 11:59 UTC

188 points

15 comments4 min readLW link

Thomas Kwa’s MIRI research experience

Thomas Kwa, peterbarnett, Vivek Hebbar, Jeremy Gillen, Bird Concept and Raemon

2 Oct 2023 16:42 UTC

173 points

53 comments1 min readLW link

President Biden Issues Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence

Tristan W30 Oct 2023 11:15 UTC

171 points

39 comments3 min readLW link

(www.whitehouse.gov)

Architects of Our Own Demise: We Should Stop Developing AI Carelessly

Roko26 Oct 2023 0:36 UTC

170 points

75 comments3 min readLW link

RSPs are pauses done right

evhub14 Oct 2023 4:06 UTC

165 points

79 comments7 min readLW link 1 review

Holly Elmore and Rob Miles dialogue on AI Safety Advocacy

Bird Concept, Robert Miles and Holly_Elmore

20 Oct 2023 21:04 UTC

163 points

30 comments27 min readLW link

Announcing Dialogues

Ben Pace7 Oct 2023 2:57 UTC

158 points

59 comments4 min readLW link

Will no one rid me of this turbulent pest?

Metacelsus14 Oct 2023 15:27 UTC

154 points

23 comments10 min readLW link

(denovo.substack.com)

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

Simon Lermen and Jeffrey Ladish

12 Oct 2023 19:58 UTC

151 points

29 comments14 min readLW link

At 87, Pearl is still able to change his mind

rotatingpaguro18 Oct 2023 4:46 UTC

149 points

15 comments5 min readLW link

The 99% principle for personal problems

Kaj_Sotala2 Oct 2023 8:20 UTC

146 points

20 comments2 min readLW link

(kajsotala.fi)

Graphical tensor notation for interpretability

Jordan Taylor4 Oct 2023 8:04 UTC

141 points

11 comments19 min readLW link

Don’t Dismiss Simple Alignment Approaches

Chris_Leong7 Oct 2023 0:35 UTC

138 points

9 comments4 min readLW link

Comparing Anthropic’s Dictionary Learning to Ours

Robert_AIZI7 Oct 2023 23:30 UTC

137 points

8 comments4 min readLW link

Response to Quintin Pope’s Evolution Provides No Evidence For the Sharp Left Turn

Zvi5 Oct 2023 11:39 UTC

129 points

29 comments9 min readLW link

Symbol/Referent Confusions in Language Model Alignment Experiments

johnswentworth26 Oct 2023 19:49 UTC

127 points

51 comments6 min readLW link 1 review

Goodhart’s Law in Reinforcement Learning

jacek, Joar Skalse, OliverHayman, charlie_griffin and Xingjian Bai

16 Oct 2023 0:54 UTC

126 points

22 comments7 min readLW link

I Would Have Solved Alignment, But I Was Worried That Would Advance Timelines

307th20 Oct 2023 16:37 UTC

125 points

33 comments9 min readLW link

Improving the Welfare of AIs: A Nearcasted Proposal

ryan_greenblatt30 Oct 2023 14:51 UTC

123 points

9 comments20 min readLW link 1 review

Responsible Scaling Policies Are Risk Management Done Wrong

simeon_c25 Oct 2023 23:46 UTC

123 points

35 comments22 min readLW link 1 review

(www.navigatingrisks.ai)

Stampy’s AI Safety Info soft launch

steven0461 and Robert Miles

5 Oct 2023 22:13 UTC

120 points

9 comments2 min readLW link

Revealing Intentionality In Language Models Through AdaVAE Guided Sampling

jdp20 Oct 2023 7:32 UTC

119 points

15 comments22 min readLW link

A new intro to Quantum Physics, with the math fixed

titotal29 Oct 2023 15:11 UTC

117 points

24 comments17 min readLW link

(titotal.substack.com)

unRLHF—Efficiently undoing LLM safeguards

Pranav Gade, Jeffrey Ladish and Simon Lermen

12 Oct 2023 19:58 UTC

117 points

15 comments20 min readLW link

The Witching Hour

Richard_Ngo10 Oct 2023 0:19 UTC

116 points

1 comment9 min readLW link

(www.narrativeark.xyz)

Charbel-Raphaël and Lucius discuss interpretability

Mateusz Bagiński, Charbel-Raphaël and Lucius Bushnaq

30 Oct 2023 5:50 UTC

112 points

7 comments21 min readLW link

Value systematization: how values become coherent (and misaligned)

Richard_Ngo27 Oct 2023 19:06 UTC

109 points

49 comments13 min readLW link

Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation

Fabien Roger and Buck

23 Oct 2023 16:37 UTC

107 points

3 comments8 min readLW link

TOMORROW: the largest AI Safety protest ever!

Holly_Elmore20 Oct 2023 18:15 UTC

105 points

26 comments2 min readLW link

Apply for MATS Winter 2023-24!

utilistrutil, Ryan Kidd and LauraVaughan

21 Oct 2023 2:27 UTC

104 points

6 comments5 min readLW link

[Question] Lying to chess players for alignment

Zane25 Oct 2023 17:47 UTC

101 points

55 comments1 min readLW link

Trying to understand John Wentworth’s research agenda

johnswentworth, habryka and David Lorell

20 Oct 2023 0:05 UTC

100 points

13 comments12 min readLW link

Truthseeking when your disagreements lie in moral philosophy

Elizabeth and Tristan W

10 Oct 2023 0:00 UTC

99 points

4 comments4 min readLW link

(acesounderglass.com)

What’s up with “Responsible Scaling Policies”?

habryka and ryan_greenblatt

29 Oct 2023 4:17 UTC

99 points

9 comments20 min readLW link 1 review

What’s Hard About The Shutdown Problem

johnswentworth20 Oct 2023 21:13 UTC

98 points

33 comments4 min readLW link

I don’t find the lie detection results that surprising (by an author of the paper)

JanB4 Oct 2023 17:10 UTC

97 points

8 comments3 min readLW link

Sam Altman’s sister claims Sam sexually abused her—Part 1: Introduction, outline, author’s notes

pythagoras50157 Oct 2023 21:06 UTC

96 points

108 comments8 min readLW link

Investigating the learning coefficient of modular addition: hackathon project

Nina Panickssery and Dmitry Vaintrob

17 Oct 2023 19:51 UTC

95 points

5 comments12 min readLW link

Open Source Replication & Commentary on Anthropic’s Dictionary Learning Paper

Neel Nanda23 Oct 2023 22:38 UTC

93 points

12 comments9 min readLW link

You’re Measuring Model Complexity Wrong

Jesse Hoogland and Stan van Wingerden

11 Oct 2023 11:46 UTC

93 points

17 comments13 min readLW link