1 Apr 2024 7:33 UTC

589 points

185 comments11 min readLW link

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai16 Apr 2024 21:16 UTC

442 points

103 comments12 min readLW link 1 review

Thoughts on seed oil

dynomight20 Apr 2024 12:29 UTC

367 points

131 comments17 min readLW link 1 review

(dynomight.net)

[April Fools’ Day] Introducing Open Asteroid Impact

Linch1 Apr 2024 8:14 UTC

357 points

35 comments1 min readLW link 3 reviews

(openasteroidimpact.org)

Express interest in an “FHI of the West”

habryka18 Apr 2024 3:32 UTC

268 points

41 comments3 min readLW link

Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib111, wesg and Neel Nanda

27 Apr 2024 11:13 UTC

260 points

96 comments10 min readLW link 1 review

Paul Christiano named as US AI Safety Institute Head of AI Safety

Joel Burget16 Apr 2024 16:22 UTC

257 points

61 comments1 min readLW link

(www.commerce.gov)

Funny Anecdote of Eliezer From His Sister

Noah Birnbaum22 Apr 2024 22:05 UTC

235 points

7 comments2 min readLW link

[Question] Examples of Highly Counterfactual Discoveries?

johnswentworth23 Apr 2024 22:19 UTC

204 points

116 comments1 min readLW link

On Not Pulling The Ladder Up Behind You

Screwtape26 Apr 2024 21:58 UTC

201 points

24 comments9 min readLW link 2 reviews

OMMC Announces RIP

Adam Scholl and aysja

1 Apr 2024 23:20 UTC

194 points

6 comments2 min readLW link 1 review

Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer

johnswentworth and David Lorell

18 Apr 2024 0:27 UTC

190 points

21 comments7 min readLW link

FHI (Future of Humanity Institute) has shut down (2005–2024)

gwern17 Apr 2024 13:54 UTC

176 points

22 comments1 min readLW link

(www.futureofhumanityinstitute.org)

Reconsider the anti-cavity bacteria if you are Asian

Lao Mein15 Apr 2024 7:02 UTC

174 points

43 comments4 min readLW link

Ironing Out the Squiggles

Zack_M_Davis29 Apr 2024 16:13 UTC

171 points

37 comments11 min readLW link

Priors and Prejudice

MathiasKB22 Apr 2024 15:00 UTC

157 points

32 comments7 min readLW link 1 review

Daniel Dennett has died (1942-2024)

kave19 Apr 2024 16:17 UTC

151 points

5 comments1 min readLW link

(dailynous.com)

When is a mind me?

Rob Bensinger17 Apr 2024 5:56 UTC

148 points

134 comments15 min readLW link

LLMs for Alignment Research: a safety priority?

abramdemski4 Apr 2024 20:03 UTC

148 points

25 comments11 min readLW link

A Dozen Ways to Get More Dakka

Davidmanheim8 Apr 2024 4:45 UTC

143 points

12 comments3 min readLW link 1 review

My experience using financial commitments to overcome akrasia

Will_Howard15 Apr 2024 22:57 UTC

141 points

39 comments18 min readLW link 1 review

Simple probes can catch sleeper agents

Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez and evhub

23 Apr 2024 21:10 UTC

131 points

21 comments1 min readLW link

(www.anthropic.com)

Carl Sagan, nuking the moon, and not nuking the moon

eukaryote13 Apr 2024 4:08 UTC

125 points

8 comments6 min readLW link

(eukaryotewritesblog.com)

RTFB: On the New Proposed CAIP AI Bill

Zvi10 Apr 2024 18:30 UTC

119 points

14 comments34 min readLW link

(thezvi.wordpress.com)

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Sam Marks18 Apr 2024 16:17 UTC

117 points

10 comments12 min readLW link

[Question] What convincing warning shot could help prevent extinction from AI?

Charbel-Raphaël and cozyfractal

13 Apr 2024 18:09 UTC

114 points

23 comments2 min readLW link

A Selection of Randomly Selected SAE Features

CallumMcDougall and Joseph Bloom

1 Apr 2024 9:09 UTC

109 points

2 comments4 min readLW link

The first future and the best future

KatjaGrace25 Apr 2024 6:40 UTC

107 points

12 comments1 min readLW link

(worldspiritsockpuppet.com)

Sparsify: A mechanistic interpretability research agenda

Lee Sharkey3 Apr 2024 12:34 UTC

97 points

23 comments22 min readLW link

MIRI’s April 2024 Newsletter

Harlan12 Apr 2024 23:38 UTC

95 points

0 comments3 min readLW link

(intelligence.org)

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

hugofry29 Apr 2024 20:57 UTC

94 points

9 comments11 min readLW link

Partial value takeover without world takeover

KatjaGrace5 Apr 2024 6:20 UTC

92 points

26 comments3 min readLW link 1 review

(worldspiritsockpuppet.com)

Constructability: Plainly-coded AGIs may be feasible in the near future

Épiphanie Gédéon and Charbel-Raphaël

27 Apr 2024 16:04 UTC

91 points

15 comments13 min readLW link

Rejecting Television

Declan Molony23 Apr 2024 4:59 UTC

91 points

10 comments6 min readLW link

The Inner Ring by C. S. Lewis

Saul Munn24 Apr 2024 22:48 UTC

87 points

8 comments13 min readLW link 1 review

(www.lewissociety.org)

Essay competition on the Automation of Wisdom and Philosophy — $25k in prizes

owencb and AI Impacts

16 Apr 2024 10:10 UTC

84 points

17 comments8 min readLW link

(blog.aiimpacts.org)

Coherence of Caches and Agents

johnswentworth1 Apr 2024 23:04 UTC

80 points

13 comments11 min readLW link

[Full Post] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

19 Apr 2024 19:06 UTC

80 points

10 comments8 min readLW link

A couple productivity tips for overthinkers

Steven Byrnes20 Apr 2024 16:05 UTC

79 points

13 comments4 min readLW link

Best in Class Life Improvement

sapphire4 Apr 2024 1:51 UTC

77 points

20 comments1 min readLW link

Creating unrestricted AI Agents with Command R+

Simon Lermen16 Apr 2024 14:52 UTC

77 points

13 comments5 min readLW link

Mid-conditional love

KatjaGrace17 Apr 2024 4:00 UTC

76 points

21 comments2 min readLW link

(worldspiritsockpuppet.com)

Announcing Suffering For Good

Garrett Baker1 Apr 2024 17:08 UTC

76 points

5 comments1 min readLW link

AISC9 has ended and there will be an AISC10

Linda Linsefors29 Apr 2024 10:53 UTC

75 points

4 comments2 min readLW link

A gentle introduction to mechanistic anomaly detection

Erik Jenner3 Apr 2024 23:06 UTC

74 points

2 comments11 min readLW link

Duct Tape security

Isaac King26 Apr 2024 18:57 UTC

74 points

12 comments5 min readLW link

How We Picture Bayesian Agents

johnswentworth and David Lorell

8 Apr 2024 18:12 UTC

73 points

14 comments7 min readLW link

Prompts for Big-Picture Planning

Raemon13 Apr 2024 3:04 UTC

73 points

3 comments3 min readLW link

[Summary] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

19 Apr 2024 19:06 UTC

73 points

0 comments3 min readLW link

A Gentle Introduction to Risk Frameworks Beyond Forecasting

pendingsurvival11 Apr 2024 18:03 UTC

73 points

10 comments27 min readLW link