The Waluigi Effect (mega-post)

Cleo Nardo3 Mar 2023 3:22 UTC
648 points
188 comments16 min readLW link

My Ob­jec­tions to “We’re All Gonna Die with Eliezer Yud­kowsky”

Quintin Pope21 Mar 2023 0:06 UTC
363 points
233 comments39 min readLW link1 review

Shut­ting Down the Light­cone Offices

14 Mar 2023 22:47 UTC
339 points
103 comments17 min readLW link2 reviews

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

11 Mar 2023 18:59 UTC
334 points
28 comments23 min readLW link

The Parable of the King and the Ran­dom Process

moridinamael1 Mar 2023 22:18 UTC
315 points
26 comments6 min readLW link3 reviews

Paus­ing AI Devel­op­ments Isn’t Enough. We Need to Shut it All Down by Eliezer Yudkowsky

jacquesthibs29 Mar 2023 23:16 UTC
294 points
297 comments3 min readLW link
(time.com)

Dis­cus­sion with Nate Soares on a key al­ign­ment difficulty

HoldenKarnofsky13 Mar 2023 21:20 UTC
276 points
43 comments22 min readLW link1 review

Deep Deceptiveness

So8res21 Mar 2023 2:51 UTC
268 points
60 comments14 min readLW link1 review

“Care­fully Boot­strapped Align­ment” is or­ga­ni­za­tion­ally hard

Raemon17 Mar 2023 18:00 UTC
266 points
23 comments11 min readLW link1 review

Nat­u­ral Ab­strac­tions: Key Claims, The­o­rems, and Critiques

16 Mar 2023 16:37 UTC
247 points
26 comments45 min readLW link3 reviews

The salt in pasta wa­ter fallacy

Thomas Sepulchre27 Mar 2023 14:53 UTC
244 points
52 comments3 min readLW link2 reviews

More in­for­ma­tion about the dan­ger­ous ca­pa­bil­ity eval­u­a­tions we did with GPT-4 and Claude.

Beth Barnes19 Mar 2023 0:25 UTC
233 points
54 comments8 min readLW link
(evals.alignment.org)

Ac­tu­ally, Othello-GPT Has A Lin­ear Emer­gent World Representation

Neel Nanda29 Mar 2023 22:13 UTC
213 points
26 comments19 min readLW link
(neelnanda.io)

An AI risk ar­gu­ment that res­onates with NYTimes readers

Julian Bradshaw12 Mar 2023 23:09 UTC
212 points
14 comments1 min readLW link

Acausal normalcy

Andrew_Critch3 Mar 2023 23:34 UTC
204 points
40 comments8 min readLW link1 review

GPT-4 Plugs In

Zvi27 Mar 2023 12:10 UTC
198 points
47 comments6 min readLW link
(thezvi.wordpress.com)

Why Not Just… Build Weak AI Tools For AI Align­ment Re­search?

johnswentworth5 Mar 2023 0:12 UTC
187 points
18 comments6 min readLW link

ChatGPT (and now GPT4) is very eas­ily dis­tracted from its rules

dmcs15 Mar 2023 17:55 UTC
180 points
42 comments1 min readLW link

A rough and in­com­plete re­view of some of John Went­worth’s research

So8res28 Mar 2023 18:52 UTC
176 points
18 comments18 min readLW link

An­thropic’s Core Views on AI Safety

Zac Hatfield-Dodds9 Mar 2023 16:55 UTC
173 points
39 comments2 min readLW link
(www.anthropic.com)

Why I’m not into the Free En­ergy Principle

Steven Byrnes2 Mar 2023 19:27 UTC
170 points
55 comments9 min readLW link1 review

A stylized di­alogue on John Went­worth’s claims about mar­kets and optimization

So8res25 Mar 2023 22:32 UTC
169 points
22 comments8 min readLW link

What Dis­cov­er­ing La­tent Knowl­edge Did and Did Not Find

Fabien Roger13 Mar 2023 19:29 UTC
166 points
17 comments11 min readLW link

Towards un­der­stand­ing-based safety evaluations

evhub15 Mar 2023 18:18 UTC
164 points
16 comments5 min readLW link

POC || GTFO cul­ture as par­tial an­ti­dote to al­ign­ment wordcelism

lc15 Mar 2023 10:21 UTC
162 points
17 comments7 min readLW link2 reviews

In­side the mind of a su­per­hu­man Go model: How does Leela Zero read lad­ders?

Haoxing Du1 Mar 2023 1:47 UTC
159 points
8 comments30 min readLW link

Why Not Just Out­source Align­ment Re­search To An AI?

johnswentworth9 Mar 2023 21:49 UTC
159 points
50 comments9 min readLW link1 review

What would a com­pute mon­i­tor­ing plan look like? [Linkpost]

Orpheus1626 Mar 2023 19:33 UTC
158 points
10 comments4 min readLW link
(arxiv.org)

AI: Prac­ti­cal Ad­vice for the Worried

Zvi1 Mar 2023 12:30 UTC
156 points
49 comments16 min readLW link2 reviews
(thezvi.wordpress.com)

GPT-4

nz14 Mar 2023 17:02 UTC
151 points
150 comments1 min readLW link
(openai.com)

Com­ments on OpenAI’s “Plan­ning for AGI and be­yond”

So8res3 Mar 2023 23:01 UTC
149 points
2 comments14 min readLW link

Dan Luu on “You can only com­mu­ni­cate one top pri­or­ity”

Raemon18 Mar 2023 18:55 UTC
149 points
18 comments3 min readLW link
(twitter.com)

Re­marks 1–18 on GPT (com­pressed)

Cleo Nardo20 Mar 2023 22:27 UTC
147 points
35 comments31 min readLW link

The Translu­cent Thoughts Hy­pothe­ses and Their Implications

Fabien Roger9 Mar 2023 16:30 UTC
142 points
7 comments19 min readLW link

Speed run­ning ev­ery­one through the bad al­ign­ment bingo. $5k bounty for a LW con­ver­sa­tional agent

ArthurB9 Mar 2023 9:26 UTC
140 points
33 comments2 min readLW link

Against LLM Reductionism

Erich_Grunewald8 Mar 2023 15:52 UTC
140 points
17 comments18 min readLW link
(www.erichgrunewald.com)

Con­ced­ing a short timelines bet early

Matthew Barnett16 Mar 2023 21:49 UTC
134 points
17 comments1 min readLW link

Good News, Every­one!

jbash25 Mar 2023 13:48 UTC
133 points
23 comments2 min readLW link

We have to Upgrade

Jed McCaleb23 Mar 2023 17:53 UTC
131 points
35 comments2 min readLW link

High Sta­tus Eschews Quan­tifi­ca­tion of Performance

niplav19 Mar 2023 22:14 UTC
128 points
36 comments5 min readLW link

[Linkpost] Some high-level thoughts on the Deep­Mind al­ign­ment team’s strategy

7 Mar 2023 11:55 UTC
128 points
13 comments5 min readLW link
(drive.google.com)

FLI open let­ter: Pause gi­ant AI experiments

Zach Stein-Perlman29 Mar 2023 4:04 UTC
126 points
123 comments2 min readLW link
(futureoflife.org)

How bad a fu­ture do ML re­searchers ex­pect?

KatjaGrace9 Mar 2023 4:50 UTC
122 points
8 comments2 min readLW link
(aiimpacts.org)

Man­i­fold: If okay AGI, why?

Eliezer Yudkowsky25 Mar 2023 22:43 UTC
121 points
37 comments1 min readLW link
(manifold.markets)

ARC tests to see if GPT-4 can es­cape hu­man con­trol; GPT-4 failed to do so

Christopher King15 Mar 2023 0:29 UTC
116 points
22 comments2 min readLW link

Par­a­sitic Lan­guage Games: main­tain­ing am­bi­guity to hide con­flict while burn­ing the commons

Hazard12 Mar 2023 5:25 UTC
116 points
18 comments13 min readLW link

GPT can write Quines now (GPT-4)

Andrew_Critch14 Mar 2023 19:18 UTC
112 points
30 comments1 min readLW link

“Pub­lish or Per­ish” (a quick note on why you should try to make your work leg­ible to ex­ist­ing aca­demic com­mu­ni­ties)

David Scott Krueger (formerly: capybaralet)18 Mar 2023 19:01 UTC
112 points
49 comments1 min readLW link1 review

Here, have a calm­ness video

Kaj_Sotala16 Mar 2023 10:00 UTC
112 points
15 comments2 min readLW link
(www.youtube.com)

“Liquidity” vs “solvency” in bank runs (and some notes on Sili­con Valley Bank)

rossry12 Mar 2023 9:16 UTC
108 points
27 comments12 min readLW link