Toward a Bet­ter Eval­u­a­tions Ecosystem

Benjamin Arnav5 May 2026 22:29 UTC
24 points
0 comments5 min readLW link

Model Spec Mid­train­ing: Im­prov­ing How Align­ment Train­ing Generalizes

5 May 2026 21:55 UTC
71 points
7 comments7 min readLW link
(alignment.anthropic.com)

Pos­i­tive Feed­back Only

Florian_Dietz5 May 2026 21:28 UTC
18 points
0 comments8 min readLW link

What if LLMs are mostly crys­tal­lized in­tel­li­gence?

deep5 May 2026 20:50 UTC
45 points
10 comments9 min readLW link
(expectedsurprise.substack.com)

De­ci­sion the­ory doesn’t prove that use­ful strong AIs will doom us all

deep5 May 2026 20:47 UTC
8 points
0 comments9 min readLW link
(expectedsurprise.substack.com)

Psy­chopa­thy: The Mechanics

Dawn Drescher5 May 2026 20:26 UTC
2 points
0 comments10 min readLW link
(impartial-priorities.org)

A Fed­eral In­mate Asks: Was My Prose­cu­tion Ra­tional?

seth_tins5 May 2026 19:56 UTC
11 points
2 comments5 min readLW link

The AI Ad-Hoc Prior Res­traint Era Begins

Zvi5 May 2026 19:30 UTC
63 points
5 comments10 min readLW link
(thezvi.wordpress.com)

Your rights when fly­ing to Europe

Yair Halberstadt5 May 2026 19:17 UTC
92 points
14 comments5 min readLW link

[Linkpost] In­ter­pret­ing Lan­guage Model Parameters

5 May 2026 17:37 UTC
162 points
2 comments2 min readLW link
(www.goodfire.ai)

Mo­ti­vated rea­son­ing, con­fir­ma­tion bias, and AI risk theory

Seth Herd5 May 2026 15:56 UTC
66 points
18 comments41 min readLW link

The Best Ar­gu­ment Against Deon­tol­ogy Is About Suitcases

Bentham's Bulldog5 May 2026 15:24 UTC
−1 points
11 comments19 min readLW link

Code­sign for Leg­i­bil­ity (to AI and Every­one Else)

Adam Chlipala5 May 2026 13:46 UTC
1 point
0 comments7 min readLW link

Dawn of the “na­tional se­cu­rity” tier of AI

Mitchell_Porter5 May 2026 9:40 UTC
16 points
3 comments1 min readLW link

For­bid­den Back­rooms: Self-Chat with a Re­fusal-Abliter­ated LLM

AlliedToasters5 May 2026 7:55 UTC
9 points
0 comments5 min readLW link

Train­ing Model to Pre­dict Its Own Gen­er­al­iza­tion: A Pre­limi­nary Study

Tianyi (Alex) Qiu5 May 2026 5:50 UTC
17 points
0 comments7 min readLW link

Are you look­ing up?

Craig Green5 May 2026 3:03 UTC
42 points
2 comments8 min readLW link
(open.substack.com)

Alarm­ing Scheduling

jefftk5 May 2026 2:40 UTC
26 points
9 comments1 min readLW link
(www.jefftk.com)

Don’t solve peo­ple’s prob­lems for them

Declan Molony5 May 2026 0:02 UTC
10 points
13 comments4 min readLW link

April 2026 Links

nomagicpill4 May 2026 23:17 UTC
8 points
1 comment11 min readLW link

My speech at the PauseAI Capi­tol protest April 13

maia4 May 2026 22:51 UTC
30 points
1 comment2 min readLW link

Ir­re­triev­abil­ity; or, Mur­phy’s Curse of Oneshot­ness upon ASI

Eliezer Yudkowsky4 May 2026 22:11 UTC
367 points
132 comments22 min readLW link

It’s nice of you to worry about me, but I re­ally do have a life

Viliam4 May 2026 21:14 UTC
332 points
61 comments4 min readLW link

Psy­chopa­thy: The Self

Dawn Drescher4 May 2026 21:13 UTC
20 points
0 comments15 min readLW link
(impartial-priorities.org)

Ver­bal­ized Eval Aware­ness In­flates Mea­sured Safety

4 May 2026 20:02 UTC
44 points
0 comments29 min readLW link

Hous­ing Roundup #15: The War Against Renters

Zvi4 May 2026 20:01 UTC
34 points
5 comments14 min readLW link
(thezvi.wordpress.com)

The AI In­dus­trial Ex­plo­sion — Part 1: Max­i­mum growth rates with cur­rent pro­duc­tion methods

djbinder4 May 2026 15:32 UTC
106 points
11 comments12 min readLW link
(defensesindepth.bio)

Ap­ply for ARBOx4 [dead­line May 8th]

4 May 2026 15:28 UTC
13 points
0 comments3 min readLW link

Tak­ing woo se­ri­ously but not literally

Kaj_Sotala4 May 2026 13:36 UTC
123 points
27 comments23 min readLW link
(kajsotala.substack.com)

The Threat of AI Crimes Are Un­der-Appreciated

Joshua Krook4 May 2026 11:33 UTC
3 points
8 comments3 min readLW link

Psy­chopa­thy: The Shaping

Dawn Drescher4 May 2026 7:58 UTC
3 points
0 comments8 min readLW link
(impartial-priorities.org)

Con­flict 2.0: Leav­ing be­hind shame/​fault, right/​wrong

honeybee4 May 2026 3:07 UTC
6 points
0 comments4 min readLW link

En­nea­gram Epicycles

Gordon Seidoh Worley4 May 2026 3:00 UTC
9 points
1 comment4 min readLW link
(www.uncertainupdates.com)

Auto-re­view of agent ac­tions with­out syn­chronous hu­man oversight

papetoast4 May 2026 2:12 UTC
6 points
0 comments1 min readLW link
(alignment.openai.com)

ASI mo­tives and the ontonor­ma­tive goods (re IABIED’s core ar­gu­ment)

Zsolt Tanko3 May 2026 23:38 UTC
4 points
4 comments4 min readLW link

How did ‘large’ lan­guage mod­els get that way? The role of Trans­form­ers and Pre­train­ing in GPT

Oliver Sourbut3 May 2026 21:35 UTC
16 points
0 comments7 min readLW link
(www.oliversourbut.net)

Dairy cows make their mis­ery ex­pen­sive (but their calves can’t)

Elizabeth3 May 2026 19:20 UTC
159 points
1 comment6 min readLW link
(acesounderglass.com)

[Question] Look­ing for pa­pers on gen­eral for­mal­iza­tions of “agency”

lovagrus3 May 2026 18:32 UTC
12 points
1 comment2 min readLW link

Why I made Eng­ineer­ing Enigmas

kqr3 May 2026 18:04 UTC
13 points
0 comments3 min readLW link

Deon­tolog­i­cal bars should refer­ence the ac­tor’s beliefs

TFD3 May 2026 15:09 UTC
8 points
6 comments3 min readLW link

We don’t learn num­bers from set cardinality

azergante3 May 2026 11:33 UTC
4 points
15 comments3 min readLW link

MHC In­terp #1: Pre­vi­ous-To­ken Heads Be­come At­ten­tion Sinks Un­der Man­i­fold-Con­strained Hyper-Connections

Realmbird3 May 2026 11:06 UTC
21 points
3 comments5 min readLW link

The Repug­nant Lifes­pan Conclusion

XelaP3 May 2026 9:22 UTC
12 points
20 comments3 min readLW link

Pur­su­ing the target

Adam Zerner3 May 2026 7:59 UTC
30 points
1 comment2 min readLW link

Para­phras­ing Is (At Best) a Par­tial Defence Against Steganog­ra­phy in LLMs

3 May 2026 7:53 UTC
14 points
0 comments8 min readLW link

LLMs Choose the Safer Gam­ble Yet Price the Riskier One Higher

Jonathan Dang3 May 2026 7:51 UTC
12 points
0 comments4 min readLW link

By­pass­ing Re­fusal Be­hav­ior in Qwen Models via Ac­ti­va­tion Steering

Talib Mirza3 May 2026 6:07 UTC
1 point
0 comments2 min readLW link

Notes on equa­nim­ity from the inside

nonplus2 May 2026 23:42 UTC
15 points
1 comment4 min readLW link

Psy­chopa­thy: The Substrate

Dawn Drescher2 May 2026 22:48 UTC
5 points
0 comments8 min readLW link
(impartial-priorities.org)

Mea­sur­ing the abil­ity of Opus 4.5 to fool nar­row classifiers

2 May 2026 22:43 UTC
31 points
0 comments8 min readLW link