RSS

Analo­gies be­tween scal­ing labs and mis­al­igned su­per­in­tel­li­gent AI

scasper21 Feb 2024 19:29 UTC
52 points
2 comments4 min readLW link

Ex­tinc­tion Risks from AI: In­visi­ble to Science?

21 Feb 2024 18:07 UTC
22 points
6 comments1 min readLW link
(philpapers.org)

Ex­tinc­tion-level Good­hart’s Law as a Prop­erty of the Environment

21 Feb 2024 17:56 UTC
18 points
0 comments10 min readLW link

Which Model Prop­er­ties are Ne­c­es­sary for Eval­u­at­ing an Ar­gu­ment?

21 Feb 2024 17:52 UTC
15 points
0 comments7 min readLW link

Dy­nam­ics Cru­cial to AI Risk Seem to Make for Com­pli­cated Models

21 Feb 2024 17:54 UTC
15 points
1 comment9 min readLW link

Weak vs Quan­ti­ta­tive Ex­tinc­tion-level Good­hart’s Law

21 Feb 2024 17:38 UTC
14 points
0 comments2 min readLW link

Why does gen­er­al­iza­tion work?

Martín Soto20 Feb 2024 17:51 UTC
36 points
9 comments4 min readLW link

Difficulty classes for al­ign­ment properties

Jozdien20 Feb 2024 9:08 UTC
23 points
4 comments2 min readLW link

Pro­to­col eval­u­a­tions: good analo­gies vs control

Fabien Roger19 Feb 2024 18:00 UTC
29 points
8 comments11 min readLW link

Fix­ing Fea­ture Sup­pres­sion in SAEs

16 Feb 2024 18:32 UTC
69 points
2 comments10 min readLW link

Self-Aware­ness: Tax­on­omy and eval suite proposal

Daniel Kokotajlo17 Feb 2024 1:47 UTC
52 points
0 comments11 min readLW link

The Poin­ter Re­s­olu­tion Problem

Jozdien16 Feb 2024 21:25 UTC
39 points
18 comments3 min readLW link

Up­date­less­ness doesn’t solve most problems

Martín Soto8 Feb 2024 17:30 UTC
116 points
34 comments12 min readLW link

Cri­tiques of the AI con­trol agenda

Jozdien14 Feb 2024 19:25 UTC
47 points
11 comments9 min readLW link

The case for en­sur­ing that pow­er­ful AIs are controlled

24 Jan 2024 16:11 UTC
234 points
66 comments28 min readLW link

De­bat­ing with More Per­sua­sive LLMs Leads to More Truth­ful Answers

7 Feb 2024 21:28 UTC
86 points
13 comments9 min readLW link
(arxiv.org)

Sleeper Agents: Train­ing De­cep­tive LLMs that Per­sist Through Safety Training

12 Jan 2024 19:51 UTC
279 points
94 comments3 min readLW link
(arxiv.org)

How to train your own “Sleeper Agents”

evhub7 Feb 2024 0:31 UTC
88 points
4 comments1 min readLW link

Ret­ro­spec­tive: PIBBSS Fel­low­ship 2023

16 Feb 2024 17:48 UTC
29 points
0 comments8 min readLW link

Without fun­da­men­tal ad­vances, mis­al­ign­ment and catas­tro­phe are the de­fault out­comes of train­ing pow­er­ful AI

26 Jan 2024 7:22 UTC
148 points
59 comments57 min readLW link