RSS

Nat­u­ral Lan­guage Au­toen­coders Pro­duce Un­su­per­vised Ex­pla­na­tions of LLM Activations

7 May 2026 20:21 UTC
194 points
27 comments8 min readLW link

A Con­flict Between AI Align­ment and Philo­soph­i­cal Competence

Wei Dai27 Dec 2025 21:32 UTC
70 points
14 comments2 min readLW link

1. The CAST Strategy

Max Harms7 Jun 2024 22:29 UTC
58 points
25 comments38 min readLW link

What failure looks like

paulfchristiano17 Mar 2019 20:18 UTC
453 points
55 comments8 min readLW link2 reviews

Op­ti­miza­tion Con­cepts in the Game of Life

16 Oct 2021 20:51 UTC
74 points
16 comments10 min readLW link

Cen­sored LLMs as a Nat­u­ral Testbed for Se­cret Knowl­edge Elicitation

9 Mar 2026 18:50 UTC
38 points
3 comments5 min readLW link

Mechanis­tic es­ti­ma­tion for wide ran­dom MLPs

Jacob_Hilton7 May 2026 16:20 UTC
73 points
3 comments5 min readLW link
(www.alignment.org)

Mo­ti­vated rea­son­ing, con­fir­ma­tion bias, and AI risk theory

Seth Herd5 May 2026 15:56 UTC
44 points
11 comments41 min readLW link

the case for CoT un­faith­ful­ness is overstated

nostalgebraist29 Sep 2024 22:07 UTC
271 points
45 comments11 min readLW link1 review

Sleeper Agent Back­door Re­sults Are Messy

28 Apr 2026 1:55 UTC
79 points
4 comments7 min readLW link

In­ter­pretabil­ity Will Not Reli­ably Find De­cep­tive AI

Neel Nanda4 May 2025 16:32 UTC
336 points
69 comments7 min readLW link

How I stopped be­ing sure LLMs are just mak­ing up their in­ter­nal ex­pe­rience (but the topic is still con­fus­ing)

Kaj_Sotala13 Dec 2025 12:38 UTC
202 points
68 comments29 min readLW link

Steer­ing Lan­guage Models with Weight Arithmetic

11 Nov 2025 16:30 UTC
87 points
6 comments5 min readLW link

Utility Max­i­miza­tion = De­scrip­tion Length Minimization

johnswentworth18 Feb 2021 18:04 UTC
226 points
54 comments6 min readLW link

Prob­a­bil­ity is Real, and Value is Complex

abramdemski20 Jul 2018 5:24 UTC
82 points
23 comments6 min readLW link

[Linkpost] In­ter­pret­ing Lan­guage Model Parameters

5 May 2026 17:37 UTC
153 points
2 comments2 min readLW link
(www.goodfire.ai)

What 2026 looks like

Daniel Kokotajlo6 Aug 2021 16:14 UTC
617 points
172 comments16 min readLW link1 review

Fact Find­ing: At­tempt­ing to Re­v­erse-Eng­ineer Fac­tual Re­call on the Neu­ron Level (Post 1)

23 Dec 2023 2:44 UTC
109 points
12 comments22 min readLW link2 reviews

Sim­plex Progress Re­port—July 2025

28 Jul 2025 21:58 UTC
111 points
3 comments15 min readLW link

[In­tro to brain-like-AGI safety] 6. Big pic­ture of mo­ti­va­tion, de­ci­sion-mak­ing, and RL

Steven Byrnes2 Mar 2022 15:26 UTC
84 points
23 comments17 min readLW link