RSS

The Shut­down Prob­lem: Three Theorems

EJT23 Oct 2023 21:00 UTC
70 points
21 comments1 min readLW link
(philpapers.org)

The Shut­down Prob­lem: In­com­plete Prefer­ences as a Solution

EJT23 Feb 2024 16:01 UTC
43 points
2 comments41 min readLW link

What 2026 looks like

Daniel Kokotajlo6 Aug 2021 16:14 UTC
465 points
150 comments16 min readLW link1 review

QAPR 5: grokking is maybe not *that* big a deal?

Quintin Pope23 Jul 2023 20:14 UTC
114 points
15 comments9 min readLW link

In­stru­men­tal de­cep­tion and ma­nipu­la­tion in LLMs—a case study

Olli Järviniemi24 Feb 2024 2:07 UTC
24 points
3 comments12 min readLW link

Why does gen­er­al­iza­tion work?

Martín Soto20 Feb 2024 17:51 UTC
39 points
10 comments4 min readLW link

Com­par­ing An­thropic’s Dic­tionary Learn­ing to Ours

Robert_AIZI7 Oct 2023 23:30 UTC
136 points
8 comments4 min readLW link

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

8 Aug 2023 1:30 UTC
304 points
26 comments18 min readLW link

AI Timelines

10 Nov 2023 5:28 UTC
234 points
73 comments51 min readLW link

A list of core AI safety prob­lems and how I hope to solve them

davidad26 Aug 2023 15:12 UTC
157 points
23 comments5 min readLW link

Cri­tiques of the AI con­trol agenda

Jozdien14 Feb 2024 19:25 UTC
47 points
14 comments9 min readLW link

When do “brains beat brawn” in Chess? An experiment

titotal28 Jun 2023 13:33 UTC
293 points
79 comments7 min readLW link
(titotal.substack.com)

Brain­storm of things that could force an AI team to burn their lead

So8res24 Jul 2022 23:58 UTC
134 points
7 comments13 min readLW link

A Case for the Least For­giv­ing Take On Alignment

Thane Ruthenis2 May 2023 21:34 UTC
98 points
82 comments22 min readLW link

Sur­vey for al­ign­ment re­searchers!

2 Feb 2024 20:41 UTC
71 points
11 comments1 min readLW link

Ex­tinc­tion Risks from AI: In­visi­ble to Science?

21 Feb 2024 18:07 UTC
24 points
7 comments1 min readLW link
(philpapers.org)

How LLMs are and are not myopic

janus25 Jul 2023 2:19 UTC
122 points
14 comments8 min readLW link

Difficulty classes for al­ign­ment properties

Jozdien20 Feb 2024 9:08 UTC
26 points
5 comments2 min readLW link

A Bird’s Eye View of the ML Field [Prag­matic AI Safety #2]

9 May 2022 17:18 UTC
162 points
6 comments35 min readLW link

Weak vs Quan­ti­ta­tive Ex­tinc­tion-level Good­hart’s Law

21 Feb 2024 17:38 UTC
17 points
1 comment2 min readLW link