All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 20252026

All Jan Feb Mar Apr MayJun

All 1 2 3 4 5 6 789 10 11 12 13 14 15 16 17 18

Mental causation is not load-bearing

jessicata7 Jun 2026 20:43 UTC

38 points

4 comments10 min readLW link

How Far Apart Does a Model Think Its Tokens Are?

Brendan Long7 Jun 2026 20:20 UTC

47 points

9 comments10 min readLW link

(www.brendanlong.com)

Autopilot Thinking

XelaP7 Jun 2026 20:20 UTC

10 points

4 comments6 min readLW link

Secret Loyalties Likely Raise Remote-Influenceability

Kaustubh Kislay7 Jun 2026 17:51 UTC

13 points

0 comments6 min readLW link

From One Piece to One Pace - Vision and mission in coordination of agents

a unemployed pastor- de S Brito7 Jun 2026 17:07 UTC

2 points

0 comments4 min readLW link

Neglected Basics of AI Alignment

Quirinus_Quirrell7 Jun 2026 9:02 UTC

28 points

2 comments6 min readLW link

The Hats of LessOnline

AprilSR7 Jun 2026 8:57 UTC

15 points

2 comments3 min readLW link

(aprilsr.substack.com)

Can activation verbalizers surface an internal chain of thought?

oakhu and ryan_greenblatt

7 Jun 2026 4:24 UTC

122 points

0 comments16 min readLW link

Frontier Models Still Lag Behind Humans at Robust Belief-State Tracking

Lukas Frei6 Jun 2026 23:54 UTC

13 points

6 comments5 min readLW link

Coming Around To Political Donations

jefftk6 Jun 2026 21:30 UTC

59 points

8 comments2 min readLW link

(www.jefftk.com)

Analysis of Metastable States in the Transformer Activation Space

Zach Baker6 Jun 2026 21:30 UTC

10 points

0 comments20 min readLW link

The Diamond Lemma

Isaac Newton6 Jun 2026 21:15 UTC

21 points

0 comments7 min readLW link

(archimedeanmonoid.substack.com)

Iliad is Hiring

Peter Jean6 Jun 2026 21:08 UTC

13 points

0 comments1 min readLW link

Against Corrigibility

peralice6 Jun 2026 20:28 UTC

66 points

17 comments12 min readLW link

The Residual Stream Has a Geometry of Time

Fodenthal6 Jun 2026 19:57 UTC

23 points

0 comments8 min readLW link

Exponential Solitude

PeterMaui6 Jun 2026 19:49 UTC

5 points

1 comment9 min readLW link

Freud heard a rumor that Science existed, and had a wonderful dream

Bruce Middleton6 Jun 2026 14:47 UTC

8 points

8 comments6 min readLW link

Coalitional Darwinism and the Instrumental Utility of Individuality

CarolusRenniusVitellius6 Jun 2026 12:53 UTC

25 points

5 comments17 min readLW link

(charlesr-w.github.io)

Why Software Automation Is Hard

silentbob6 Jun 2026 8:56 UTC

114 points

20 comments12 min readLW link

What if Anthropic unilaterally paused capabilities development right now?

Karl von Wendt6 Jun 2026 7:39 UTC

61 points

15 comments3 min readLW link

Optimisation over non-stationary distributions creates weirder minds

Samuel Ratnam and Pjain

6 Jun 2026 0:05 UTC

36 points

8 comments4 min readLW link

[Question] Does robotics capabilities research accelerate AGI timelines?

Master Chief5 Jun 2026 23:32 UTC

4 points

3 comments1 min readLW link

Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs

dgros5 Jun 2026 22:43 UTC

15 points

0 comments11 min readLW link

Two More Methods for Consistency Training and Some New Ways to Apply It

David Africa, Sukrati_Gautam, Neil Shah and arav-dhoot

5 Jun 2026 21:06 UTC

18 points

0 comments7 min readLW link

Revisiting GSM-Symbolic: models seem to reason okay, actually

Sturb5 Jun 2026 20:54 UTC

24 points

0 comments5 min readLW link

Accepting Death & Adult Responsibility

Unreal5 Jun 2026 19:23 UTC

−19 points

10 comments4 min readLW link

The Masochistic Prior

Modulo.Roland5 Jun 2026 19:05 UTC

12 points

2 comments2 min readLW link

(substack.com)

Beyond the lexical personality traits: What is the structure of personality?

tailcalled5 Jun 2026 19:05 UTC

60 points

1 comment5 min readLW link

Do not try to write your first research publication as a single author

Mikhail Mironov5 Jun 2026 18:31 UTC

12 points

0 comments5 min readLW link

Do We Want a Superintelligent People-Pleaser?

GenericHousewife_B5 Jun 2026 18:07 UTC

1 point

0 comments6 min readLW link

Explaining SAE Features With Foreign Natural Language Autoencoders

fzaffino5 Jun 2026 17:51 UTC

17 points

1 comment8 min readLW link

SecureBio Detection is Hiring Software Engineers

jefftk5 Jun 2026 16:50 UTC

33 points

2 comments1 min readLW link

(www.jefftk.com)

One Year of PauseAI UK

Joseph Miller and PauseAI UK

5 Jun 2026 16:41 UTC

94 points

7 comments11 min readLW link

(pauseai.uk)

Learnings from starting an AI safety research team

draganover and Erin Robertson

5 Jun 2026 16:27 UTC

101 points

7 comments6 min readLW link

Preparing for Warning Shots to Catalyze International Cooperation on AGI Risks

Mark Kagach, EliasSchlie, Thomas Van Damme and JustinShovelain

5 Jun 2026 15:49 UTC

40 points

1 comment5 min readLW link

My research: a computational cognitive neuroscience perspective on alignment

Seth Herd5 Jun 2026 14:19 UTC

52 points

0 comments18 min readLW link

Editing is Easy, but Revision is Hard

IanWS5 Jun 2026 11:58 UTC

5 points

0 comments3 min readLW link

(write.ianwsperber.com)

OpenAI Offers A New Policy Blueprint

Zvi5 Jun 2026 11:41 UTC

31 points

3 comments7 min readLW link

(thezvi.wordpress.com)

[Paper] Dictionary Learning Identifiability for Understanding SAEs

William Dorrell5 Jun 2026 0:28 UTC

12 points

0 comments3 min readLW link

What Does Abliteration Actually Cost?

christian-mc5 Jun 2026 0:28 UTC

3 points

0 comments4 min readLW link

Lunar bombardment of earth is practical

anithite4 Jun 2026 23:25 UTC

27 points

0 comments4 min readLW link

Endurance: Shackleton’s Incredible Voyage Review

nomagicpill4 Jun 2026 22:19 UTC

6 points

0 comments11 min readLW link

Rent from oil: a goldmine

TerriLeaf4 Jun 2026 21:05 UTC

15 points

5 comments5 min readLW link

Book of Cron Job

suchow4 Jun 2026 18:58 UTC

4 points

0 comments1 min readLW link

(www.nature.com)

(Mis)generalization of Helpful-Only Fine-tuning

Omar Khursheed, Baram Sosis and Fabien Roger

4 Jun 2026 18:40 UTC

55 points

7 comments11 min readLW link

Defeating Introspection Adapters (and Why Threat Models Matter)

Nick Merrill and zekem

4 Jun 2026 18:39 UTC

10 points

0 comments5 min readLW link

Building Better Activation Oracles

ceselder, Jan Bauer, Niclas Luick, Adam Karvonen and Neel Nanda

4 Jun 2026 18:34 UTC

62 points

1 comment7 min readLW link

What Separates an Optimizer From Something We Merely Describe as Optimizing?

stewart leland jansen4 Jun 2026 18:30 UTC

3 points

2 comments1 min readLW link

Rohin Shah on AGI Safety

anaguma4 Jun 2026 16:57 UTC

38 points

2 comments90 min readLW link

(80000hours.org)

Training Deliberative Monitors for Black-Box Scheming Detection

aksh-n, adityasinha, Victor Gillioz, Simon Storf, Kilian Merkelbach, richbc, Axel Højmark and Marius Hobbhahn

4 Jun 2026 16:43 UTC

33 points

6 comments6 min readLW link