RSS

At­tempts at For­ward­ing Speed Priors

24 Sep 2022 5:49 UTC
18 points
0 comments18 min readLW link

In­ter­pret­ing Neu­ral Net­works through the Poly­tope Lens

23 Sep 2022 17:58 UTC
106 points
8 comments33 min readLW link

Method­olog­i­cal Ther­apy: An Agenda For Tack­ling Re­search Bottlenecks

22 Sep 2022 18:41 UTC
47 points
2 comments9 min readLW link

Toy Models of Superposition

evhub21 Sep 2022 23:48 UTC
54 points
2 comments5 min readLW link
(transformer-circuits.pub)

An­nounc­ing AISIC 2022 - the AI Safety Is­rael Con­fer­ence, Oc­to­ber 19-20

Davidmanheim21 Sep 2022 19:32 UTC
12 points
0 comments1 min readLW link

Nearcast-based “de­ploy­ment prob­lem” analysis

HoldenKarnofsky21 Sep 2022 18:52 UTC
61 points
2 comments26 min readLW link

Towards de­con­fus­ing wire­head­ing and re­ward maximization

leogao21 Sep 2022 0:36 UTC
47 points
7 comments4 min readLW link

Do­ing over­sight from the very start of train­ing seems hard

peterbarnett20 Sep 2022 17:21 UTC
14 points
3 comments3 min readLW link

PIBBSS (AI al­ign­ment) is hiring for a Pro­ject Manager

Nora_Ammann19 Sep 2022 13:54 UTC
9 points
0 comments1 min readLW link

The In­ter-Agent Facet of AI Alignment

Michael Oesterle18 Sep 2022 20:39 UTC
12 points
1 comment5 min readLW link

Sum­maries: Align­ment Fun­da­men­tals Curriculum

Leon Lang18 Sep 2022 13:08 UTC
43 points
3 comments1 min readLW link
(docs.google.com)

In­ner al­ign­ment: what are we point­ing at?

lcmgcd18 Sep 2022 11:09 UTC
7 points
1 comment1 min readLW link

Refine’s Third Blog Post Day/​Week

adamShimi17 Sep 2022 17:03 UTC
17 points
0 comments1 min readLW link

Prize and fast track to al­ign­ment re­search at ALTER

Vanessa Kosoy17 Sep 2022 16:58 UTC
60 points
4 comments3 min readLW link

Take­aways from our ro­bust in­jury clas­sifier pro­ject [Red­wood Re­search]

DMZ17 Sep 2022 3:55 UTC
120 points
9 comments6 min readLW link

Refine Blog­post Day #3: The short­forms I did write

Self-Embedded Agent16 Sep 2022 21:03 UTC
20 points
0 comments1 min readLW link

Levels of goals and alignment

zeshen16 Sep 2022 16:44 UTC
25 points
4 comments6 min readLW link

or­der­ing ca­pa­bil­ity thresholds

carado16 Sep 2022 16:36 UTC
24 points
0 comments4 min readLW link
(carado.moe)

Rep­re­sen­ta­tional Tethers: Ty­ing AI La­tents To Hu­man Ones

Paul Bricman16 Sep 2022 14:45 UTC
26 points
0 comments16 min readLW link

AGI safety re­searchers should fo­cus (only/​mostly) on de­cep­tive alignment

Marius Hobbhahn15 Sep 2022 13:38 UTC
38 points
12 comments12 min readLW link