RSS

Ac­ti­va­tion space in­ter­pretabil­ity may be doomed

8 Jan 2025 12:49 UTC
146 points
31 comments8 min readLW link

How AI Takeover Might Hap­pen in 2 Years

joshc7 Feb 2025 17:10 UTC
299 points
72 comments29 min readLW link
(x.com)

Re­search di­rec­tions Open Phil wants to fund in tech­ni­cal AI safety

8 Feb 2025 1:40 UTC
115 points
19 comments58 min readLW link
(www.openphilanthropy.org)

Grad­ual Disem­pow­er­ment, Shell Games and Flinches

Jan_Kulveit2 Feb 2025 14:47 UTC
125 points
35 comments6 min readLW link

At­tri­bu­tion-based pa­ram­e­ter decomposition

25 Jan 2025 13:12 UTC
105 points
17 comments4 min readLW link
(publications.apolloresearch.ai)

De­tect­ing Strate­gic De­cep­tion Us­ing Lin­ear Probes

6 Feb 2025 15:46 UTC
100 points
7 comments2 min readLW link
(arxiv.org)

What 2026 looks like

Daniel Kokotajlo6 Aug 2021 16:14 UTC
528 points
158 comments16 min readLW link1 review

[Question] Se­ri­ously, what goes wrong with “re­ward the agent when it makes you smile”?

TurnTrout11 Aug 2022 22:22 UTC
87 points
43 comments2 min readLW link

Ten peo­ple on the inside

Buck28 Jan 2025 16:41 UTC
131 points
27 comments4 min readLW link

The case for un­learn­ing that re­moves in­for­ma­tion from LLM weights

Fabien Roger14 Oct 2024 14:08 UTC
96 points
16 comments6 min readLW link

MATS Ap­pli­ca­tions + Re­search Direc­tions I’m Cur­rently Ex­cited About

Neel Nanda6 Feb 2025 11:03 UTC
69 points
2 comments8 min readLW link

“Sharp Left Turn” dis­course: An opinionated review

Steven Byrnes28 Jan 2025 18:47 UTC
173 points
22 comments31 min readLW link

UDT shows that de­ci­sion the­ory is more puz­zling than ever

Wei Dai13 Sep 2023 12:26 UTC
218 points
56 comments1 min readLW link

In Defense of Open-Minded UDT

abramdemski12 Aug 2024 18:27 UTC
77 points
28 comments11 min readLW link

Sim­ple ver­sus Short: Higher-or­der de­gen­er­acy and er­ror-correction

Daniel Murfet11 Mar 2024 7:52 UTC
109 points
8 comments13 min readLW link

Re­fusal in LLMs is me­di­ated by a sin­gle direction

27 Apr 2024 11:13 UTC
244 points
95 comments10 min readLW link

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhub29 May 2020 20:38 UTC
214 points
36 comments38 min readLW link2 reviews

Tips and Code for Em­piri­cal Re­search Workflows

20 Jan 2025 22:31 UTC
78 points
10 comments20 min readLW link

Fron­tier Models are Ca­pable of In-con­text Scheming

5 Dec 2024 22:11 UTC
203 points
24 comments7 min readLW link

Up­date­less­ness doesn’t solve most problems

Martín Soto8 Feb 2024 17:30 UTC
130 points
45 comments12 min readLW link