All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 202320242025 2026

All Jan Feb Mar Apr May JunJulAug Sep Oct Nov Dec

All 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 202122 23 24 25 26 27 28 29 30 31

Freedom and Privacy of Thought Architectures

SebastianG 20 Jul 2024 21:43 UTC

5 points

2 comments1 min readLW link

Why Georgism Lost Its Popularity

Zero Contradictions20 Jul 2024 15:08 UTC

49 points

55 comments1 min readLW link

(zerocontradictions.net)

Only Fools Avoid Hindsight Bias

Kevin Dorst20 Jul 2024 13:42 UTC

−11 points

5 comments6 min readLW link

(kevindorst.substack.com)

A more systematic case for inner misalignment

Richard_Ngo20 Jul 2024 5:03 UTC

31 points

4 comments5 min readLW link

BatchTopK: A Simple Improvement for TopK-SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

20 Jul 2024 2:20 UTC

62 points

0 comments4 min readLW link

Krona Compare

jefftk20 Jul 2024 1:10 UTC

10 points

0 comments2 min readLW link

(www.jefftk.com)

(Approximately) Deterministic Natural Latents

johnswentworth and David Lorell

19 Jul 2024 23:02 UTC

45 points

1 comment4 min readLW link

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions

Lidor Banuel Dabbah and Aviel Boag

19 Jul 2024 20:32 UTC

59 points

6 comments16 min readLW link

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Senthooran Rajamanoharan, Tom Lieberum, nps29, Arthur Conmy, Vikrant Varma, János Kramár and Neel Nanda

19 Jul 2024 16:10 UTC

55 points

10 comments1 min readLW link

(storage.googleapis.com)

Truth is Universal: Robust Detection of Lies in LLMs

Lennart Buerger19 Jul 2024 14:07 UTC

24 points

4 comments2 min readLW link

(arxiv.org)

Sustainability of Digital Life Form Societies

Hiroshi Yamakawa19 Jul 2024 13:59 UTC

19 points

1 comment20 min readLW link

Romae Industriae

Maxwell Tabarrok19 Jul 2024 13:03 UTC

36 points

2 comments7 min readLW link

(www.maximum-progress.com)

[Question] Have people given up on iterated distillation and amplification?

Chris_Leong19 Jul 2024 12:23 UTC

20 points

1 comment1 min readLW link

How do we know that “good research” is good? (aka “direct evaluation” vs “eigen-evaluation”)

Ruby19 Jul 2024 0:31 UTC

49 points

21 comments6 min readLW link

Linkpost: Surely you can be serious

kave18 Jul 2024 22:18 UTC

64 points

8 comments1 min readLW link

(www.experimental-history.com)

My experience applying to MATS 6.0

mic18 Jul 2024 19:02 UTC

19 points

3 comments5 min readLW link

[Question] What are the actual arguments in favor of computationalism as a theory of identity?

sunwillrise18 Jul 2024 18:44 UTC

16 points

27 comments5 min readLW link

Yet Another Critique of “Luxury Beliefs”

ymeskhout18 Jul 2024 18:37 UTC

6 points

9 comments9 min readLW link

(www.ymeskhout.com)

[Interim research report] Evaluating the Goal-Directedness of Language Models

Rauno Arike, Elizabeth Donoway and Marius Hobbhahn

18 Jul 2024 18:19 UTC

40 points

4 comments11 min readLW link

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Karolis Jucys, george_adams and Sonia Joseph

18 Jul 2024 17:02 UTC

9 points

0 comments1 min readLW link

(arxiv.org)

Activation Engineering Theories of Impact

Jakub K. Nowak🔸18 Jul 2024 16:44 UTC

6 points

1 comment2 min readLW link

[Question] Me & My Clone

SimonBaars18 Jul 2024 16:25 UTC

27 points

22 comments1 min readLW link

AI #73: Openly Evil AI

Zvi18 Jul 2024 14:40 UTC

89 points

20 comments52 min readLW link

(thezvi.wordpress.com)

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex and Nicholas Goldowsky-Dill

18 Jul 2024 14:15 UTC

127 points

18 comments18 min readLW link

SAEs (usually) Transfer Between Base and Chat Models

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

18 Jul 2024 10:29 UTC

67 points

0 comments10 min readLW link

[Question] Should we exclude alignment research from LLM training datasets?

Ben Millwood18 Jul 2024 10:27 UTC

3 points

5 comments1 min readLW link

Keeping content out of LLM training datasets

Ben Millwood18 Jul 2024 10:27 UTC

4 points

0 comments5 min readLW link

The Assassination of Trump’s Ear is Evidence for Time-Travel

elv18 Jul 2024 7:01 UTC

−9 points

5 comments5 min readLW link

Friendship is transactional, unconditional friendship is insurance

Ruby17 Jul 2024 22:52 UTC

70 points

25 comments2 min readLW link 1 review

D&D.Sci: Whom Shall You Call? [Evaluation and Ruleset]

abstractapplic17 Jul 2024 22:34 UTC

17 points

5 comments5 min readLW link

Optimistic Assumptions, Longterm Planning, and “Cope”

Raemon17 Jul 2024 22:14 UTC

229 points

47 comments7 min readLW link 1 review

Baking vs Patissing vs Cooking, the HPS explanation

adamShimi17 Jul 2024 20:29 UTC

30 points

16 comments3 min readLW link

(epistemologicalfascinations.substack.com)

Launching the Respiratory Outlook 2024/25 Forecasting Series

ChristianWilliams17 Jul 2024 19:51 UTC

5 points

0 comments1 min readLW link

(www.metaculus.com)

What are you getting paid in?

Austin Chen17 Jul 2024 19:23 UTC

120 points

16 comments4 min readLW link 1 review

(www.approachwithalacrity.com)

Individually incentivized safe Pareto improvements in open-source bargaining

Nicolas Macé, Anthony DiGiovanni and JesseClifton

17 Jul 2024 18:26 UTC

41 points

3 comments17 min readLW link

Profit and Value

kwang17 Jul 2024 18:06 UTC

22 points

3 comments6 min readLW link

(open.substack.com)

So You’ve Learned To Teleport by Tom Scott

landscape_kiwi17 Jul 2024 18:04 UTC

4 points

0 comments1 min readLW link

(www.youtube.com)

How does generalized accessibility compare to targeted accessibility?

ErioirE17 Jul 2024 17:07 UTC

3 points

0 comments2 min readLW link

Housing Roundup #9: Restricting Supply

Zvi17 Jul 2024 12:50 UTC

25 points

8 comments44 min readLW link

(thezvi.wordpress.com)

We ran an AI safety conference in Tokyo. It went really well. Come next year!

Blaine17 Jul 2024 6:55 UTC

46 points

1 comment6 min readLW link

Agency in Politics

Martin Sustrik17 Jul 2024 5:30 UTC

35 points

2 comments3 min readLW link

(250bpm.substack.com)

Arrakis—A toolkit to conduct, track and visualize mechanistic interpretability experiments.

Yash Srivastava17 Jul 2024 2:02 UTC

3 points

2 comments5 min readLW link

Announcing Open Philanthropy’s AI governance and policy RFP

Julian Hazell17 Jul 2024 2:02 UTC

25 points

0 comments1 min readLW link

(www.openphilanthropy.org)

Turning Your Back On Traffic

jefftk17 Jul 2024 1:00 UTC

37 points

7 comments1 min readLW link

(www.jefftk.com)

[Question] Opinions on Eureka Labs

jmh17 Jul 2024 0:16 UTC

6 points

2 comments1 min readLW link

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Rubi J. Hudson16 Jul 2024 22:44 UTC

46 points

27 comments5 min readLW link

Multiplex Gene Editing: Where Are We Now?

sarahconstantin16 Jul 2024 20:50 UTC

73 points

6 comments7 min readLW link

(sarahconstantin.substack.com)

Recursion in AI is scary. But let’s talk solutions.

Oleg Trott16 Jul 2024 20:34 UTC

5 points

10 comments2 min readLW link

How to wash your hands precisely and thoroughly

dkl916 Jul 2024 18:29 UTC

12 points

0 comments1 min readLW link

(dkl9.net)

Francois Chollet inadvertently limits his claim on ARC-AGI

Noosphere8916 Jul 2024 17:32 UTC

12 points

4 comments1 min readLW link 1 review

(x.com)