Let’s think about slow­ing down AI

KatjaGrace22 Dec 2022 17:40 UTC
543 points
183 comments38 min readLW link3 reviews
(aiimpacts.org)

Star­ing into the abyss as a core life skill

benkuhn22 Dec 2022 15:30 UTC
320 points
20 comments12 min readLW link1 review
(www.benkuhn.net)

Models Don’t “Get Re­ward”

Sam Ringer30 Dec 2022 10:37 UTC
306 points
61 comments5 min readLW link1 review

A challenge for AGI or­ga­ni­za­tions, and a challenge for readers

1 Dec 2022 23:11 UTC
301 points
33 comments2 min readLW link

Sazen

[DEACTIVATED] Duncan Sabien21 Dec 2022 7:54 UTC
274 points
83 comments12 min readLW link2 reviews

AI al­ign­ment is dis­tinct from its near-term applications

paulfchristiano13 Dec 2022 7:10 UTC
254 points
21 comments2 min readLW link
(ai-alignment.com)

How “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion” Fits Into a Broader Align­ment Scheme

Collin15 Dec 2022 18:22 UTC
243 points
39 comments16 min readLW link1 review

Jailbreak­ing ChatGPT on Re­lease Day

Zvi2 Dec 2022 13:10 UTC
242 points
77 comments6 min readLW link1 review
(thezvi.wordpress.com)

The Plan − 2022 Update

johnswentworth1 Dec 2022 20:43 UTC
239 points
37 comments8 min readLW link1 review

Causal Scrub­bing: a method for rigor­ously test­ing in­ter­pretabil­ity hy­pothe­ses [Red­wood Re­search]

3 Dec 2022 0:58 UTC
195 points
35 comments20 min readLW link1 review

What AI Safety Ma­te­ri­als Do ML Re­searchers Find Com­pel­ling?

28 Dec 2022 2:03 UTC
175 points
34 comments2 min readLW link

The next decades might be wild

Marius Hobbhahn15 Dec 2022 16:10 UTC
175 points
42 comments41 min readLW link1 review

Finite Fac­tored Sets in Pictures

Magdalena Wache11 Dec 2022 18:49 UTC
174 points
35 comments12 min readLW link

Us­ing GPT-Eliezer against ChatGPT Jailbreaking

6 Dec 2022 19:54 UTC
170 points
85 comments9 min readLW link

Things that can kill you quickly: What ev­ery­one should know about first aid

jasoncrawford27 Dec 2022 16:23 UTC
166 points
21 comments2 min readLW link1 review
(jasoncrawford.org)

Log­i­cal in­duc­tion for soft­ware engineers

Alex Flint3 Dec 2022 19:55 UTC
160 points
8 comments27 min readLW link1 review

A Year of AI In­creas­ing AI Progress

ThomasW30 Dec 2022 2:09 UTC
148 points
3 comments2 min readLW link

Up­dat­ing my AI timelines

Matthew Barnett5 Dec 2022 20:46 UTC
143 points
50 comments2 min readLW link

In­ner and outer al­ign­ment de­com­pose one hard prob­lem into two ex­tremely hard problems

TurnTrout2 Dec 2022 2:43 UTC
139 points
22 comments47 min readLW link3 reviews

[Question] How to Con­vince my Son that Drugs are Bad

concerned_dad17 Dec 2022 18:47 UTC
139 points
84 comments2 min readLW link

Shard The­ory in Nine Th­e­ses: a Distil­la­tion and Crit­i­cal Appraisal

LawrenceC19 Dec 2022 22:52 UTC
138 points
30 comments18 min readLW link

[In­terim re­search re­port] Tak­ing fea­tures out of su­per­po­si­tion with sparse autoencoders

13 Dec 2022 15:41 UTC
137 points
22 comments22 min readLW link2 reviews

K-com­plex­ity is silly; use cross-en­tropy instead

So8res20 Dec 2022 23:06 UTC
136 points
53 comments4 min readLW link2 reviews

Shared re­al­ity: a key driver of hu­man behavior

kdbscott24 Dec 2022 19:35 UTC
126 points
25 comments4 min readLW link

Re-Ex­am­in­ing LayerNorm

Eric Winsor1 Dec 2022 22:20 UTC
124 points
12 comments5 min readLW link

Did ChatGPT just gaslight me?

ThomasW1 Dec 2022 5:41 UTC
123 points
45 comments9 min readLW link
(aiwatchtower.substack.com)

[Question] Why The Fo­cus on Ex­pected Utility Max­imisers?

DragonGod27 Dec 2022 15:49 UTC
116 points
84 comments3 min readLW link

The case against AI alignment

andrew sauer24 Dec 2022 6:57 UTC
115 points
110 comments5 min readLW link

De­con­fus­ing Direct vs Amor­tised Optimization

beren2 Dec 2022 11:30 UTC
107 points
17 comments10 min readLW link

Try­ing to dis­am­biguate differ­ent ques­tions about whether RLHF is “good”

Buck14 Dec 2022 4:03 UTC
106 points
47 comments7 min readLW link1 review

Lan­guage mod­els are nearly AGIs but we don’t no­tice it be­cause we keep shift­ing the bar

philosophybear30 Dec 2022 5:15 UTC
105 points
13 comments7 min readLW link

Slightly against al­ign­ing with neo-luddites

Matthew Barnett26 Dec 2022 22:46 UTC
104 points
31 comments4 min readLW link

[Linkpost] The Story Of VaccinateCA

hath9 Dec 2022 23:54 UTC
103 points
4 comments10 min readLW link
(www.worksinprogress.co)

Ap­plied Lin­ear Alge­bra Lec­ture Series

johnswentworth22 Dec 2022 6:57 UTC
102 points
7 comments1 min readLW link

But is it re­ally in Rome? An in­ves­ti­ga­tion of the ROME model edit­ing technique

jacquesthibs30 Dec 2022 2:40 UTC
102 points
1 comment18 min readLW link

Thoughts on AGI or­ga­ni­za­tions and ca­pa­bil­ities work

7 Dec 2022 19:46 UTC
102 points
17 comments5 min readLW link

200 Con­crete Open Prob­lems in Mechanis­tic In­ter­pretabil­ity: Introduction

Neel Nanda28 Dec 2022 21:06 UTC
102 points
0 comments10 min readLW link

Find­ing gliders in the game of life

paulfchristiano1 Dec 2022 20:40 UTC
101 points
7 comments16 min readLW link
(ai-alignment.com)

Bad at Arith­metic, Promis­ing at Math

cohenmacaulay18 Dec 2022 5:40 UTC
100 points
19 comments20 min readLW link1 review

Dis­cov­er­ing Lan­guage Model Be­hav­iors with Model-Writ­ten Evaluations

20 Dec 2022 20:08 UTC
100 points
34 comments1 min readLW link
(www.anthropic.com)

[Link] Why I’m op­ti­mistic about OpenAI’s al­ign­ment approach

janleike5 Dec 2022 22:51 UTC
98 points
15 comments1 min readLW link
(aligned.substack.com)

The LessWrong 2021 Re­view: In­tel­lec­tual Cir­cle Expansion

1 Dec 2022 21:17 UTC
95 points
55 comments8 min readLW link

Re­vis­it­ing al­gorith­mic progress

13 Dec 2022 1:39 UTC
94 points
15 comments2 min readLW link1 review
(arxiv.org)

Towards Hodge-podge Alignment

Cleo Nardo19 Dec 2022 20:12 UTC
91 points
30 comments9 min readLW link

Set­ting the Zero Point

[DEACTIVATED] Duncan Sabien9 Dec 2022 6:06 UTC
90 points
43 comments20 min readLW link1 review

Con­sider us­ing re­versible au­tomata for al­ign­ment research

Alex_Altair11 Dec 2022 1:00 UTC
88 points
30 comments2 min readLW link

Can we effi­ciently dis­t­in­guish differ­ent mechanisms?

paulfchristiano27 Dec 2022 0:20 UTC
88 points
30 comments16 min readLW link
(ai-alignment.com)

Lo­cal Memes Against Geo­met­ric Rationality

Scott Garrabrant21 Dec 2022 3:53 UTC
85 points
3 comments6 min readLW link

You can still fetch the coffee to­day if you’re dead tomorrow

davidad9 Dec 2022 14:06 UTC
84 points
19 comments5 min readLW link

A hun­dredth of a bit of ex­tra entropy

Adam Scherlis24 Dec 2022 21:12 UTC
83 points
4 comments3 min readLW link