RSS

Mechanis­tic es­ti­ma­tion for ex­pec­ta­tions of ran­dom products

15 May 2026 16:50 UTC
14 points
0 comments5 min readLW link
(www.alignment.org)

The safe-to-dan­ger­ous shift is a fun­da­men­tal prob­lem for eval re­al­ism; but also for mea­sur­ing awareness

14 May 2026 17:05 UTC
59 points
1 comment3 min readLW link

Em­pow­er­ment, cor­rigi­bil­ity, etc. are sim­ple ab­strac­tions (of a messed-up on­tol­ogy)

Steven Byrnes11 May 2026 17:48 UTC
113 points
50 comments16 min readLW link

Clar­ify­ing the role of the be­hav­ioral se­lec­tion model

Alex Mallen10 May 2026 19:41 UTC
17 points
0 comments4 min readLW link

Nat­u­ral Lan­guage Au­toen­coders Pro­duce Un­su­per­vised Ex­pla­na­tions of LLM Activations

7 May 2026 20:21 UTC
211 points
29 comments8 min readLW link

Mechanis­tic es­ti­ma­tion for wide ran­dom MLPs

Jacob_Hilton7 May 2026 16:20 UTC
75 points
5 comments5 min readLW link
(www.alignment.org)

[Linkpost] In­ter­pret­ing Lan­guage Model Parameters

5 May 2026 17:37 UTC
159 points
2 comments2 min readLW link
(www.goodfire.ai)

Mo­ti­vated rea­son­ing, con­fir­ma­tion bias, and AI risk theory

Seth Herd5 May 2026 15:56 UTC
49 points
13 comments41 min readLW link

Ex­plo­ra­tion Hack­ing: Can LLMs Learn to Re­sist RL Train­ing?

1 May 2026 20:54 UTC
23 points
0 comments8 min readLW link

Risk from fit­ness-seek­ing AIs: mechanisms and mitigations

Alex Mallen1 May 2026 17:42 UTC
99 points
0 comments32 min readLW link

Re­search Sab­o­tage in ML Codebases

30 Apr 2026 0:26 UTC
62 points
3 comments6 min readLW link

Re­cur­sive fore­cast­ing: Elic­it­ing long-term fore­casts from my­opic fit­ness-seekers

28 Apr 2026 18:00 UTC
55 points
2 comments7 min readLW link

Sleeper Agent Back­door Re­sults Are Messy

28 Apr 2026 1:55 UTC
81 points
4 comments7 min readLW link

Lan­guage mod­els know what mat­ters and the foun­da­tions of ethics bet­ter than you

Michele Campolo27 Apr 2026 14:00 UTC
6 points
2 comments90 min readLW link

From noth­ing to im­por­tant ac­tions: agents that act morally

Michele Campolo27 Apr 2026 13:59 UTC
1 point
0 comments22 min readLW link

The other pa­per that kil­led deep learn­ing theory

LawrenceC27 Apr 2026 6:57 UTC
83 points
6 comments8 min readLW link

The pa­per that kil­led deep learn­ing theory

LawrenceC26 Apr 2026 6:55 UTC
186 points
10 comments6 min readLW link

Quick Paper Re­view: “There Will Be a Scien­tific The­ory of Deep Learn­ing”

LawrenceC25 Apr 2026 6:55 UTC
108 points
3 comments6 min readLW link

A “Lay” In­tro­duc­tion to “On the Com­plex­ity of Neu­ral Com­pu­ta­tion in Su­per­po­si­tion”

LawrenceC22 Apr 2026 2:26 UTC
33 points
0 comments5 min readLW link

Prevent­ing ex­tinc­tion from ASI on a $50M yearly budget

21 Apr 2026 16:39 UTC
212 points
54 comments21 min readLW link
(controlai.com)