The Waluigi Effect (mega-post)

Cleo Nardo3 Mar 2023 3:22 UTC
615 points
187 comments16 min readLW link

My Ob­jec­tions to “We’re All Gonna Die with Eliezer Yud­kowsky”

Quintin Pope21 Mar 2023 0:06 UTC
362 points
218 comments39 min readLW link

Shut­ting Down the Light­cone Offices

14 Mar 2023 22:47 UTC
337 points
93 comments17 min readLW link

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

11 Mar 2023 18:59 UTC
312 points
22 comments23 min readLW link

Paus­ing AI Devel­op­ments Isn’t Enough. We Need to Shut it All Down by Eliezer Yudkowsky

jacquesthibs29 Mar 2023 23:16 UTC
298 points
296 comments3 min readLW link
(time.com)

The Parable of the King and the Ran­dom Process

moridinamael1 Mar 2023 22:18 UTC
286 points
22 comments6 min readLW link

“Care­fully Boot­strapped Align­ment” is or­ga­ni­za­tion­ally hard

Raemon17 Mar 2023 18:00 UTC
258 points
22 comments11 min readLW link

Dis­cus­sion with Nate Soares on a key al­ign­ment difficulty

HoldenKarnofsky13 Mar 2023 21:20 UTC
250 points
37 comments22 min readLW link

More in­for­ma­tion about the dan­ger­ous ca­pa­bil­ity eval­u­a­tions we did with GPT-4 and Claude.

Beth Barnes19 Mar 2023 0:25 UTC
233 points
54 comments8 min readLW link
(evals.alignment.org)

Deep Deceptiveness

So8res21 Mar 2023 2:51 UTC
231 points
58 comments14 min readLW link

Ac­tu­ally, Othello-GPT Has A Lin­ear Emer­gent World Representation

Neel Nanda29 Mar 2023 22:13 UTC
210 points
24 comments19 min readLW link
(neelnanda.io)

Nat­u­ral Ab­strac­tions: Key claims, The­o­rems, and Critiques

16 Mar 2023 16:37 UTC
206 points
20 comments45 min readLW link

An AI risk ar­gu­ment that res­onates with NYTimes readers

Julian Bradshaw12 Mar 2023 23:09 UTC
202 points
14 comments1 min readLW link

GPT-4 Plugs In

Zvi27 Mar 2023 12:10 UTC
198 points
47 comments6 min readLW link
(thezvi.wordpress.com)

An­thropic’s Core Views on AI Safety

Zac Hatfield-Dodds9 Mar 2023 16:55 UTC
181 points
39 comments2 min readLW link
(www.anthropic.com)

ChatGPT (and now GPT4) is very eas­ily dis­tracted from its rules

dmcs15 Mar 2023 17:55 UTC
178 points
41 comments1 min readLW link

A rough and in­com­plete re­view of some of John Went­worth’s research

So8res28 Mar 2023 18:52 UTC
175 points
17 comments18 min readLW link

Acausal normalcy

Andrew_Critch3 Mar 2023 23:34 UTC
167 points
30 comments8 min readLW link

What Dis­cov­er­ing La­tent Knowl­edge Did and Did Not Find

Fabien Roger13 Mar 2023 19:29 UTC
164 points
16 comments11 min readLW link

A stylized di­alogue on John Went­worth’s claims about mar­kets and optimization

So8res25 Mar 2023 22:32 UTC
159 points
21 comments8 min readLW link

What would a com­pute mon­i­tor­ing plan look like? [Linkpost]

Akash26 Mar 2023 19:33 UTC
157 points
9 comments4 min readLW link
(arxiv.org)

The salt in pasta wa­ter fallacy

Thomas Sepulchre27 Mar 2023 14:53 UTC
157 points
38 comments3 min readLW link

Why Not Just… Build Weak AI Tools For AI Align­ment Re­search?

johnswentworth5 Mar 2023 0:12 UTC
156 points
17 comments6 min readLW link

AI: Prac­ti­cal Ad­vice for the Worried

Zvi1 Mar 2023 12:30 UTC
153 points
43 comments16 min readLW link
(thezvi.wordpress.com)

Towards un­der­stand­ing-based safety evaluations

evhub15 Mar 2023 18:18 UTC
152 points
16 comments5 min readLW link

GPT-4

nz14 Mar 2023 17:02 UTC
150 points
149 comments1 min readLW link
(openai.com)

Com­ments on OpenAI’s “Plan­ning for AGI and be­yond”

So8res3 Mar 2023 23:01 UTC
148 points
2 comments14 min readLW link

Re­marks 1–18 on GPT (com­pressed)

Cleo Nardo20 Mar 2023 22:27 UTC
146 points
34 comments31 min readLW link

In­side the mind of a su­per­hu­man Go model: How does Leela Zero read lad­ders?

Haoxing Du1 Mar 2023 1:47 UTC
146 points
8 comments30 min readLW link

Dan Luu on “You can only com­mu­ni­cate one top pri­or­ity”

Raemon18 Mar 2023 18:55 UTC
145 points
17 comments3 min readLW link
(twitter.com)

POC || GTFO cul­ture as par­tial an­ti­dote to al­ign­ment wordcelism

lc15 Mar 2023 10:21 UTC
144 points
10 comments7 min readLW link

Speed run­ning ev­ery­one through the bad al­ign­ment bingo. $5k bounty for a LW con­ver­sa­tional agent

ArthurB9 Mar 2023 9:26 UTC
139 points
32 comments2 min readLW link

Why I’m not into the Free En­ergy Principle

Steven Byrnes2 Mar 2023 19:27 UTC
137 points
47 comments9 min readLW link

Against LLM Reductionism

Erich_Grunewald8 Mar 2023 15:52 UTC
137 points
16 comments18 min readLW link
(www.erichgrunewald.com)

Good News, Every­one!

jbash25 Mar 2023 13:48 UTC
133 points
23 comments2 min readLW link

Con­ced­ing a short timelines bet early

Matthew Barnett16 Mar 2023 21:49 UTC
132 points
16 comments1 min readLW link

[Linkpost] Some high-level thoughts on the Deep­Mind al­ign­ment team’s strategy

7 Mar 2023 11:55 UTC
128 points
13 comments5 min readLW link
(drive.google.com)

The Translu­cent Thoughts Hy­pothe­ses and Their Implications

Fabien Roger9 Mar 2023 16:30 UTC
126 points
6 comments19 min readLW link

FLI open let­ter: Pause gi­ant AI experiments

Zach Stein-Perlman29 Mar 2023 4:04 UTC
126 points
123 comments2 min readLW link
(futureoflife.org)

Why Not Just Out­source Align­ment Re­search To An AI?

johnswentworth9 Mar 2023 21:49 UTC
126 points
47 comments9 min readLW link

We have to Upgrade

Jed McCaleb23 Mar 2023 17:53 UTC
125 points
35 comments2 min readLW link

High Sta­tus Eschews Quan­tifi­ca­tion of Performance

niplav19 Mar 2023 22:14 UTC
124 points
36 comments5 min readLW link

How bad a fu­ture do ML re­searchers ex­pect?

KatjaGrace9 Mar 2023 4:50 UTC
121 points
7 comments2 min readLW link
(aiimpacts.org)

ARC tests to see if GPT-4 can es­cape hu­man con­trol; GPT-4 failed to do so

Christopher King15 Mar 2023 0:29 UTC
116 points
22 comments2 min readLW link

Man­i­fold: If okay AGI, why?

Eliezer Yudkowsky25 Mar 2023 22:43 UTC
116 points
37 comments1 min readLW link
(manifold.markets)

GPT can write Quines now (GPT-4)

Andrew_Critch14 Mar 2023 19:18 UTC
111 points
30 comments1 min readLW link

Here, have a calm­ness video

Kaj_Sotala16 Mar 2023 10:00 UTC
111 points
15 comments2 min readLW link
(www.youtube.com)

The Over­ton Win­dow widens: Ex­am­ples of AI risk in the media

Akash23 Mar 2023 17:10 UTC
107 points
24 comments6 min readLW link

“Liquidity” vs “solvency” in bank runs (and some notes on Sili­con Valley Bank)

rossry12 Mar 2023 9:16 UTC
107 points
27 comments12 min readLW link

Want to pre­dict/​ex­plain/​con­trol the out­put of GPT-4? Then learn about the world, not about trans­form­ers.

Cleo Nardo16 Mar 2023 3:08 UTC
105 points
26 comments5 min readLW link