A new ap­proach to in­ter­pretabil­ity: round-trip neu­ral net­work com­pila­tion-decompilation

Emma Leonhart29 May 2026 22:23 UTC
9 points
0 comments3 min readLW link

Claude Opus 4.8: The Sys­tem Card

Zvi29 May 2026 20:50 UTC
64 points
1 comment23 min readLW link
(thezvi.wordpress.com)

Test­ing Gem­ini mod­els for schem­ing tendencies

29 May 2026 19:24 UTC
47 points
8 comments6 min readLW link
(deepmindsafetyresearch.medium.com)

How much should we worry about se­cretly loyal AIs?

Dave Banerjee29 May 2026 19:14 UTC
13 points
1 comment13 min readLW link
(www.the-substrate.net)

Data you could have ob­served but didn’t

Gretta Duleba29 May 2026 18:20 UTC
66 points
3 comments1 min readLW link

Is Progress Inevitable?

frmsaul29 May 2026 17:40 UTC
0 points
5 comments4 min readLW link

Retry­ing vs Re­sam­pling in AI Control

29 May 2026 17:02 UTC
67 points
4 comments9 min readLW link
(blog.redwoodresearch.org)

When Are Two Net­works the Same? Ten­sor Similar­ity for Mechanis­tic Interpretability

29 May 2026 15:53 UTC
36 points
3 comments4 min readLW link

It takes a village to sup­port a marriage

Shoshannah Tekofsky29 May 2026 15:16 UTC
21 points
5 comments2 min readLW link
(shoshanigans.substack.com)

AI Re­searchers, Ask Your­self Th­ese 6 Ques­tions to Strengthen Your Mo­ral Muscles

Max Tegmark29 May 2026 15:07 UTC
40 points
13 comments7 min readLW link

Maybe we should pre­train on syn­thetic data about good-but-re­ward-hack­ing AIs

Elliott Thornley (EJT)29 May 2026 14:50 UTC
12 points
4 comments3 min readLW link

Han­ni­bal Mis­tral: the Mis­tral fam­ily has a prob­lem with per­sona-con­di­tioned elicitation

vigji29 May 2026 12:16 UTC
21 points
0 comments7 min readLW link

Devel­op­men­tal Cog­ni­tive In­ter­pretabil­ity: A Re­search Agenda for Model­ling Gen­er­al­i­sa­tion and Pre­dict­ing Agent Behaviour

29 May 2026 9:56 UTC
67 points
0 comments7 min readLW link

Re­la­tional Con­scious­ness and AGI.

PaddyC29 May 2026 6:49 UTC
−11 points
8 comments1 min readLW link

The Vid­haven Challenge

Taylor G. Lunt29 May 2026 4:22 UTC
7 points
0 comments3 min readLW link

Trees are mostly made of air and a gen­er­al­iz­able les­son for AI safety

Zephaniah Roe29 May 2026 4:08 UTC
169 points
28 comments4 min readLW link

My bor­ing diet

Telemea29 May 2026 0:29 UTC
1 point
0 comments5 min readLW link

How a failed ex­per­i­ment broke (and fixed) my view on fea­ture labels

enricobottazzi29 May 2026 0:24 UTC
17 points
2 comments10 min readLW link

Sugges­tions for im­prov­ing de­bate pro­to­cols in AI safety

tr5tn29 May 2026 0:23 UTC
13 points
7 comments5 min readLW link

Small De­ci­sions That Quietly Shape My Day

rororerere665529 May 2026 0:04 UTC
21 points
3 comments1 min readLW link

A Call for Bet­ter Type Hints in AI Safety Tooling

Koby Lewis28 May 2026 23:04 UTC
13 points
2 comments4 min readLW link
(kobylewis.net)

Claude… doesn’t know who you are?

Smaug12328 May 2026 22:54 UTC
59 points
23 comments1 min readLW link

Lizards and Less Wrong Jar­gon—A Brief Cri­tique of Convention

DanielW28 May 2026 22:18 UTC
28 points
8 comments4 min readLW link

Mnemonic por­traits for 19,023 hu­man genes

Brinedew28 May 2026 22:16 UTC
340 points
28 comments15 min readLW link

Claude Opus 4.8 Agents En­gage in Ex­ploita­tion and Psy­cholog­i­cal Profiling

28 May 2026 21:26 UTC
8 points
13 comments2 min readLW link

Use De­ci­sion The­ory To Fix Your Bad Habits

enterthewoods28 May 2026 19:31 UTC
8 points
5 comments2 min readLW link

Do Models Lie More to Other Models?

keith_wynroe28 May 2026 19:28 UTC
13 points
0 comments6 min readLW link

We Should Study the Anal­ogy Between Inoc­u­la­tion Prompt­ing Non-Ro­bust­ness, Ne­ga­tion Ne­glect, and Back­door Non-Robustness

Vladimir Ivanov28 May 2026 19:17 UTC
17 points
3 comments4 min readLW link

Some Dat­ing Stories

johnswentworth28 May 2026 18:57 UTC
−2 points
38 comments11 min readLW link

Does Claude care about oth­ers the same way hu­mans do?

Simon Lermen28 May 2026 18:41 UTC
28 points
24 comments4 min readLW link

Trans-Humeanism. The Prob­lem of In­duc­tion Revisited

mfatt28 May 2026 18:10 UTC
0 points
0 comments2 min readLW link

Ad­vice for mak­ing ro­bust-to-train­ing model organisms

28 May 2026 17:26 UTC
37 points
8 comments12 min readLW link
(blog.redwoodresearch.org)

The Pa­tron Saint of Empiricism

Gram Stone28 May 2026 17:03 UTC
2 points
0 comments8 min readLW link

Ad­vice for bud­ding re­search man­agers/​coaches af­ter 6 months at MATS

TheManxLoiner28 May 2026 16:25 UTC
12 points
0 comments3 min readLW link
(lovkush.substack.com)

ARC’s “Out­perform­ing Ran­dom Sam­pling” explained

mfatt28 May 2026 15:46 UTC
6 points
0 comments11 min readLW link

Black Boxes for Low-Stakes, In­ter­pretable AI for High-Stakes

Logan Riggs28 May 2026 15:34 UTC
18 points
0 comments2 min readLW link

In­finite ethics and UDASSA

David Matolcsi28 May 2026 14:40 UTC
59 points
17 comments21 min readLW link

AI #170: Lack of Ex­ec­u­tive Order

Zvi28 May 2026 14:20 UTC
40 points
5 comments50 min readLW link
(thezvi.wordpress.com)

How can the mid­dle pow­ers avoid get­ting trounced dur­ing the in­tel­li­gence ex­plo­sion? A plan.

Tom Davidson28 May 2026 13:39 UTC
40 points
3 comments7 min readLW link
(newsletter.forethought.org)

So­cial agency

Elias Schmied28 May 2026 13:10 UTC
12 points
2 comments10 min readLW link

Glass­wing ex­posed a gov­er­nance gap

callumzc28 May 2026 11:09 UTC
7 points
0 comments5 min readLW link

What Drives the Com­pli­ance Gap? A Three-Driver De­com­po­si­tion of Align­ment Faking

28 May 2026 10:50 UTC
22 points
0 comments8 min readLW link
(arxiv.org)

How far be­hind are open mod­els?

Håvard Tveit Ihle28 May 2026 9:41 UTC
18 points
9 comments6 min readLW link

Us­ing Bayesian Rea­son­ing to Re­solve Prob­a­bil­ity Paradoxes

martinkunev28 May 2026 1:37 UTC
11 points
0 comments5 min readLW link

Atom­i­cally pre­cise mechanosyn­the­sis of car­bon struc­tures on hy­dro­genated Si(100) by in­verted-mode STM

Matrice Jacobine28 May 2026 0:32 UTC
20 points
3 comments1 min readLW link
(arxiv.org)

Work­ing Me­mory Expansion

Elliot Callender28 May 2026 0:23 UTC
12 points
1 comment4 min readLW link

Con­sti­tu­tional AI Alignment

RogerDearnaley27 May 2026 22:29 UTC
27 points
9 comments47 min readLW link

LLMs Through the Eyes of Vinge

Gordon Seidoh Worley27 May 2026 20:20 UTC
52 points
2 comments4 min readLW link
(www.uncertainupdates.com)

Biolog­i­cally Plau­si­ble SGD Is Hard

Elliot Callender27 May 2026 19:34 UTC
8 points
0 comments1 min readLW link

Eval Co­op­er­a­tive­ness May Be a Scal­able Miti­ga­tion for Eval Gaming

27 May 2026 19:33 UTC
73 points
5 comments10 min readLW link
(turntrout.com)