RSS

Ro­ta­tions in Superposition

15 Dec 2025 14:58 UTC
54 points
0 comments11 min readLW link

My AGI safety re­search—2025 re­view, ’26 plans

Steven Byrnes11 Dec 2025 17:05 UTC
127 points
4 comments12 min readLW link

6 rea­sons why “al­ign­ment-is-hard” dis­course seems alien to hu­man in­tu­itions, and vice-versa

Steven Byrnes3 Dec 2025 18:37 UTC
287 points
54 comments17 min readLW link

Au­dit­ing Games for Sand­bag­ging [pa­per]

9 Dec 2025 18:37 UTC
101 points
4 comments10 min readLW link

Align­ment re­mains a hard, un­solved problem

evhub27 Nov 2025 8:45 UTC
339 points
94 comments14 min readLW link

An Am­bi­tious Vi­sion for Interpretability

leogao5 Dec 2025 22:57 UTC
163 points
7 comments4 min readLW link

We need a field of Re­ward Func­tion Design

Steven Byrnes8 Dec 2025 19:15 UTC
106 points
8 comments5 min readLW link

The be­hav­ioral se­lec­tion model for pre­dict­ing AI motivations

4 Dec 2025 18:46 UTC
157 points
21 comments16 min readLW link

Re­ward Func­tion De­sign: a starter pack

Steven Byrnes8 Dec 2025 19:15 UTC
77 points
7 comments16 min readLW link

Nat­u­ral emer­gent mis­al­ign­ment from re­ward hack­ing in pro­duc­tion RL

21 Nov 2025 20:00 UTC
258 points
32 comments9 min readLW link

A Prag­matic Vi­sion for Interpretability

1 Dec 2025 13:05 UTC
133 points
35 comments27 min readLW link

Leg­ible vs. Illeg­ible AI Safety Problems

Wei Dai4 Nov 2025 21:39 UTC
361 points
93 comments2 min readLW link

Reliti­gat­ing the Race to Build Friendly AI

Wei Dai3 Dec 2025 11:34 UTC
67 points
43 comments3 min readLW link

Ab­stract ad­vice to re­searchers tack­ling the difficult core prob­lems of AGI alignment

TsviBT22 Nov 2025 0:53 UTC
129 points
10 comments8 min readLW link

[Paper] Does Self-Eval­u­a­tion En­able Wire­head­ing in Lan­guage Models?

David Africa8 Dec 2025 16:03 UTC
25 points
2 comments2 min readLW link

Align­ment will hap­pen by de­fault. What’s next?

Adrià Garriga-alonso25 Nov 2025 7:14 UTC
98 points
128 comments6 min readLW link
(open.substack.com)

How Can In­ter­pretabil­ity Re­searchers Help AGI Go Well?

1 Dec 2025 13:05 UTC
62 points
1 comment14 min readLW link

Eval­u­a­tion as a (Co­op­er­a­tion-En­abling?) Tool

VojtaKovarik10 Dec 2025 18:54 UTC
13 points
0 comments28 min readLW link

Please, Don’t Roll Your Own Metaethics

Wei Dai12 Nov 2025 22:17 UTC
152 points
64 comments2 min readLW link

Rea­son­ing Models Some­times Out­put Illeg­ible Chains of Thought

Jozdien24 Nov 2025 18:24 UTC
82 points
8 comments6 min readLW link