RSS

Con­trolAI 2025 Im­pact Report

27 Mar 2026 18:10 UTC
64 points
0 comments4 min readLW link
(controlai.com)

Test your best meth­ods on our hard CoT in­terp tasks

26 Mar 2026 19:24 UTC
51 points
2 comments19 min readLW link

A Toy En­vi­ron­ment For Ex­plor­ing Rea­son­ing About Reward

25 Mar 2026 20:29 UTC
55 points
7 comments3 min readLW link

Me­tagam­ing mat­ters for train­ing, eval­u­a­tion, and oversight

18 Mar 2026 21:26 UTC
56 points
5 comments1 min readLW link
(alignment.openai.com)

“Act-based ap­proval-di­rected agents”, for IDA skeptics

Steven Byrnes18 Mar 2026 18:47 UTC
55 points
8 comments5 min readLW link

New RFP on In­ter­pretabil­ity from Sch­midt Sciences

Peter Hase17 Mar 2026 16:08 UTC
15 points
0 comments6 min readLW link
(schmidtsciences.smapply.io)

Power Steer­ing: Be­hav­ior Steer­ing via Layer-to-Layer Ja­co­bian Sin­gu­lar Vectors

Omar Ayyub13 Mar 2026 3:55 UTC
20 points
0 comments17 min readLW link

Oper­a­tional­iz­ing FDT

Vivek Hebbar13 Mar 2026 0:12 UTC
84 points
11 comments6 min readLW link

How well do mod­els fol­low their con­sti­tu­tions?

12 Mar 2026 0:07 UTC
94 points
5 comments26 min readLW link

The Refined Coun­ter­fac­tual Pri­soner’s Dilemma: An At­tempt to Ex­plode De­ci­sion-The­o­retic Con­se­quen­tial­ism

Chris_Leong11 Mar 2026 12:32 UTC
18 points
20 comments2 min readLW link

AIs will be used in “un­hinged” configurations

Arthur Conmy11 Mar 2026 11:19 UTC
56 points
3 comments4 min readLW link

The case for sa­ti­at­ing cheaply-satis­fied AI preferences

Alex Mallen10 Mar 2026 18:09 UTC
102 points
7 comments23 min readLW link

Cen­sored LLMs as a Nat­u­ral Testbed for Se­cret Knowl­edge Elicitation

9 Mar 2026 18:50 UTC
30 points
2 comments5 min readLW link

Payo­rian co­op­er­a­tion is easy with Kripke frames

transhumanist_atom_understander9 Mar 2026 0:29 UTC
70 points
7 comments8 min readLW link

Can gov­ern­ments quickly and cheaply slow AI train­ing?

joshc7 Mar 2026 19:11 UTC
58 points
7 comments14 min readLW link

How to De­sign En­vi­ron­ments for Un­der­stand­ing Model Motives

2 Mar 2026 7:14 UTC
42 points
0 comments10 min readLW link

Schel­ling Good­ness, and Shared Mo­ral­ity as a Goal

Andrew_Critch28 Feb 2026 4:25 UTC
121 points
60 comments41 min readLW link

Co­her­ent Care

abramdemski27 Feb 2026 21:59 UTC
41 points
2 comments16 min readLW link

Why Did My Model Do That? Model In­crim­i­na­tion for Di­ag­nos­ing LLM Misbehavior

27 Feb 2026 3:20 UTC
51 points
1 comment78 min readLW link

The per­sona se­lec­tion model

Sam Marks23 Feb 2026 22:56 UTC
168 points
52 comments43 min readLW link
(alignment.anthropic.com)