RSS

In (highly con­tin­gent!) defense of in­ter­pretabil­ity-in-the-loop ML training

Steven Byrnes6 Feb 2026 16:32 UTC
68 points
3 comments3 min readLW link

In­creas­ing AI Strate­gic Com­pe­tence as a Safety Approach

Wei Dai3 Feb 2026 1:08 UTC
45 points
8 comments1 min readLW link

“Fea­tures” aren’t always the true com­pu­ta­tional prim­i­tives of a model, but that might be fine any­ways

LawrenceC2 Feb 2026 18:41 UTC
18 points
0 comments5 min readLW link

Are there les­sons from high-re­li­a­bil­ity en­g­ineer­ing for AGI safety?

Steven Byrnes2 Feb 2026 15:26 UTC
90 points
6 comments8 min readLW link

Fit­ness-Seek­ers: Gen­er­al­iz­ing the Re­ward-Seek­ing Threat Model

Alex Mallen29 Jan 2026 19:42 UTC
83 points
4 comments17 min readLW link

AlgZoo: un­in­ter­preted mod­els with fewer than 1,500 parameters

Jacob_Hilton26 Jan 2026 17:30 UTC
176 points
7 comments10 min readLW link
(www.alignment.org)

New ver­sion of “In­tro to Brain-Like-AGI Safety”

Steven Byrnes23 Jan 2026 16:21 UTC
58 points
1 comment19 min readLW link

When should we train against a schem­ing mon­i­tor?

Mary Phuong21 Jan 2026 20:48 UTC
22 points
4 comments5 min readLW link

No in­stru­men­tal con­ver­gence with­out AI psychology

TurnTrout20 Jan 2026 22:16 UTC
68 points
7 comments6 min readLW link
(turntrout.com)

Pre­train­ing on Aligned AI Data Dra­mat­i­cally Re­duces Misal­ign­ment—Even After Post-Training

RogerDearnaley19 Jan 2026 21:24 UTC
102 points
12 comments11 min readLW link
(arxiv.org)

Could LLM al­ign­ment re­search re­duce x-risk if the first takeover-ca­pa­ble AI is not an LLM?

Tim Hua19 Jan 2026 18:09 UTC
24 points
1 comment6 min readLW link

Desider­ata of good prob­lems to hand off to AIs

Jozdien19 Jan 2026 16:55 UTC
29 points
1 comment4 min readLW link

Grad­ual Paths to Col­lec­tive Flourishing

Nora_Ammann19 Jan 2026 7:52 UTC
37 points
10 comments13 min readLW link

Test your in­ter­pretabil­ity tech­niques by de-cen­sor­ing Chi­nese models

15 Jan 2026 16:33 UTC
85 points
10 comments20 min readLW link

Ap­ply to Vanessa’s men­tor­ship at PIBBSS

Vanessa Kosoy14 Jan 2026 9:15 UTC
39 points
0 comments2 min readLW link

Global CoT Anal­y­sis: Ini­tial at­tempts to un­cover pat­terns across many chains of thought

13 Jan 2026 20:40 UTC
48 points
0 comments18 min readLW link

Brief Ex­plo­ra­tions in LLM Value Rankings

12 Jan 2026 18:16 UTC
37 points
1 comment11 min readLW link

Prac­ti­cal challenges of con­trol mon­i­tor­ing in fron­tier AI deployments

12 Jan 2026 16:45 UTC
19 points
0 comments1 min readLW link
(arxiv.org)

My 2003 Post on the Evolu­tion­ary Ar­gu­ment for AI Misalignment

Wei Dai6 Jan 2026 20:45 UTC
37 points
7 comments2 min readLW link

How hard is it to in­oc­u­late against mis­al­ign­ment gen­er­al­iza­tion?

Jozdien6 Jan 2026 17:30 UTC
42 points
4 comments14 min readLW link