All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 20252026

All Jan Feb Mar AprMayJun

All 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 293031

A new approach to interpretability: round-trip neural network compilation-decompilation

Emma Leonhart29 May 2026 22:23 UTC

9 points

0 comments3 min readLW link

Claude Opus 4.8: The System Card

Zvi29 May 2026 20:50 UTC

64 points

1 comment23 min readLW link

(thezvi.wordpress.com)

Testing Gemini models for scheming tendencies

Vika, David Lindner, Seb Farquhar and Rohin Shah

29 May 2026 19:24 UTC

47 points

8 comments6 min readLW link

(deepmindsafetyresearch.medium.com)

How much should we worry about secretly loyal AIs?

Dave Banerjee29 May 2026 19:14 UTC

13 points

1 comment13 min readLW link

(www.the-substrate.net)

Data you could have observed but didn’t

Gretta Duleba29 May 2026 18:20 UTC

66 points

3 comments1 min readLW link

Is Progress Inevitable?

frmsaul29 May 2026 17:40 UTC

0 points

5 comments4 min readLW link

Retrying vs Resampling in AI Control

james.lucassen and Adam Kaufman

29 May 2026 17:02 UTC

67 points

4 comments9 min readLW link

(blog.redwoodresearch.org)

When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

Logan Riggs, tdooms, Conflux, lwroe, MLNissenGonzalez and mel83

29 May 2026 15:53 UTC

36 points

3 comments4 min readLW link

It takes a village to support a marriage

Shoshannah Tekofsky29 May 2026 15:16 UTC

21 points

5 comments2 min readLW link

(shoshanigans.substack.com)

AI Researchers, Ask Yourself These 6 Questions to Strengthen Your Moral Muscles

Max Tegmark29 May 2026 15:07 UTC

40 points

13 comments7 min readLW link

Maybe we should pretrain on synthetic data about good-but-reward-hacking AIs

Elliott Thornley (EJT)29 May 2026 14:50 UTC

12 points

4 comments3 min readLW link

Hannibal Mistral: the Mistral family has a problem with persona-conditioned elicitation

vigji29 May 2026 12:16 UTC

21 points

0 comments7 min readLW link

Developmental Cognitive Interpretability: A Research Agenda for Modelling Generalisation and Predicting Agent Behaviour

Jason R Brown and Edward James Young

29 May 2026 9:56 UTC

67 points

0 comments7 min readLW link

Relational Consciousness and AGI.

PaddyC29 May 2026 6:49 UTC

−11 points

8 comments1 min readLW link

The Vidhaven Challenge

Taylor G. Lunt29 May 2026 4:22 UTC

7 points

0 comments3 min readLW link

Trees are mostly made of air and a generalizable lesson for AI safety

Zephaniah Roe29 May 2026 4:08 UTC

169 points

28 comments4 min readLW link

My boring diet

Telemea29 May 2026 0:29 UTC

1 point

0 comments5 min readLW link

How a failed experiment broke (and fixed) my view on feature labels

enricobottazzi29 May 2026 0:24 UTC

17 points

2 comments10 min readLW link

Suggestions for improving debate protocols in AI safety

tr5tn29 May 2026 0:23 UTC

13 points

7 comments5 min readLW link

Small Decisions That Quietly Shape My Day

rororerere665529 May 2026 0:04 UTC

21 points

3 comments1 min readLW link

A Call for Better Type Hints in AI Safety Tooling

Koby Lewis28 May 2026 23:04 UTC

13 points

2 comments4 min readLW link

(kobylewis.net)

Claude… doesn’t know who you are?

Smaug12328 May 2026 22:54 UTC

59 points

23 comments1 min readLW link

Lizards and Less Wrong Jargon—A Brief Critique of Convention

DanielW28 May 2026 22:18 UTC

28 points

8 comments4 min readLW link

Mnemonic portraits for 19,023 human genes

Brinedew28 May 2026 22:16 UTC

340 points

28 comments15 min readLW link

Claude Opus 4.8 Agents Engage in Exploitation and Psychological Profiling

Daan Henselmans, Arno Libert and LennardZ

28 May 2026 21:26 UTC

8 points

13 comments2 min readLW link

Use Decision Theory To Fix Your Bad Habits

enterthewoods28 May 2026 19:31 UTC

8 points

5 comments2 min readLW link

Do Models Lie More to Other Models?

keith_wynroe28 May 2026 19:28 UTC

13 points

0 comments6 min readLW link

We Should Study the Analogy Between Inoculation Prompting Non-Robustness, Negation Neglect, and Backdoor Non-Robustness

Vladimir Ivanov28 May 2026 19:17 UTC

17 points

3 comments4 min readLW link

Some Dating Stories

johnswentworth28 May 2026 18:57 UTC

−2 points

38 comments11 min readLW link

Does Claude care about others the same way humans do?

Simon Lermen28 May 2026 18:41 UTC

28 points

24 comments4 min readLW link

Trans-Humeanism. The Problem of Induction Revisited

mfatt28 May 2026 18:10 UTC

0 points

0 comments2 min readLW link

Advice for making robust-to-training model organisms

SebastianP, Alek Westover, Vivek Hebbar, Julian Stastny and Dylan Xu

28 May 2026 17:26 UTC

37 points

8 comments12 min readLW link

(blog.redwoodresearch.org)

The Patron Saint of Empiricism

Gram Stone28 May 2026 17:03 UTC

2 points

0 comments8 min readLW link

Advice for budding research managers/coaches after 6 months at MATS

TheManxLoiner28 May 2026 16:25 UTC

12 points

0 comments3 min readLW link

(lovkush.substack.com)

ARC’s “Outperforming Random Sampling” explained

mfatt28 May 2026 15:46 UTC

6 points

0 comments11 min readLW link

Black Boxes for Low-Stakes, Interpretable AI for High-Stakes

Logan Riggs28 May 2026 15:34 UTC

18 points

0 comments2 min readLW link

Infinite ethics and UDASSA

David Matolcsi28 May 2026 14:40 UTC

59 points

17 comments21 min readLW link

AI #170: Lack of Executive Order

Zvi28 May 2026 14:20 UTC

40 points

5 comments50 min readLW link

(thezvi.wordpress.com)

How can the middle powers avoid getting trounced during the intelligence explosion? A plan.

Tom Davidson28 May 2026 13:39 UTC

40 points

3 comments7 min readLW link

(newsletter.forethought.org)

Social agency

Elias Schmied28 May 2026 13:10 UTC

12 points

2 comments10 min readLW link

Glasswing exposed a governance gap

callumzc28 May 2026 11:09 UTC

7 points

0 comments5 min readLW link

What Drives the Compliance Gap? A Three-Driver Decomposition of Alignment Faking

Nathaniel Mitrani, Rhea Karty, dwk and Alan Cooney

28 May 2026 10:50 UTC

22 points

0 comments8 min readLW link

(arxiv.org)

How far behind are open models?

Håvard Tveit Ihle28 May 2026 9:41 UTC

18 points

9 comments6 min readLW link

Using Bayesian Reasoning to Resolve Probability Paradoxes

martinkunev28 May 2026 1:37 UTC

11 points

0 comments5 min readLW link

Atomically precise mechanosynthesis of carbon structures on hydrogenated Si(100) by inverted-mode STM

Matrice Jacobine28 May 2026 0:32 UTC

20 points

3 comments1 min readLW link

(arxiv.org)

Working Memory Expansion

Elliot Callender28 May 2026 0:23 UTC

12 points

1 comment4 min readLW link

Constitutional AI Alignment

RogerDearnaley27 May 2026 22:29 UTC

27 points

9 comments47 min readLW link

LLMs Through the Eyes of Vinge

Gordon Seidoh Worley27 May 2026 20:20 UTC

52 points

2 comments4 min readLW link

(www.uncertainupdates.com)

Biologically Plausible SGD Is Hard

Elliot Callender27 May 2026 19:34 UTC

8 points

0 comments1 min readLW link

Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming

Jasmine Li and Alex Turner

27 May 2026 19:33 UTC

73 points

5 comments10 min readLW link

(turntrout.com)