RSS

Les­sons from Con­ver­gent Evolu­tion for AI Alignment

27 Mar 2023 16:25 UTC
33 points
5 comments8 min readLW link

A stylized di­alogue on John Went­worth’s claims about mar­kets and optimization

So8res25 Mar 2023 22:32 UTC
115 points
12 comments8 min readLW link

My Ob­jec­tions to “We’re All Gonna Die with Eliezer Yud­kowsky”

Quintin Pope21 Mar 2023 0:06 UTC
318 points
173 comments33 min readLW link

$500 Bounty/​Con­test: Ex­plain In­fra-Bayes In The Lan­guage Of Game Theory

johnswentworth25 Mar 2023 17:29 UTC
77 points
6 comments2 min readLW link

Deep Deceptiveness

So8res21 Mar 2023 2:51 UTC
190 points
44 comments14 min readLW link

More in­for­ma­tion about the dan­ger­ous ca­pa­bil­ity eval­u­a­tions we did with GPT-4 and Claude.

Beth Barnes19 Mar 2023 0:25 UTC
219 points
43 comments8 min readLW link
(evals.alignment.org)

The Waluigi Effect (mega-post)

Cleo Nardo3 Mar 2023 3:22 UTC
575 points
169 comments16 min readLW link

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

11 Mar 2023 18:59 UTC
286 points
17 comments22 min readLW link

Truth and Ad­van­tage: Re­sponse to a draft of “AI safety seems hard to mea­sure”

So8res22 Mar 2023 3:36 UTC
87 points
9 comments5 min readLW link

Nat­u­ral Ab­strac­tions: Key claims, The­o­rems, and Critiques

16 Mar 2023 16:37 UTC
169 points
13 comments45 min readLW link

[Question] What hap­pens with log­i­cal in­duc­tion when...

Donald Hobson26 Mar 2023 18:31 UTC
18 points
2 comments1 min readLW link

Dis­cus­sion with Nate Soares on a key al­ign­ment difficulty

HoldenKarnofsky13 Mar 2023 21:20 UTC
208 points
26 comments22 min readLW link

SolidGoldMag­ikarp (plus, prompt gen­er­a­tion)

5 Feb 2023 22:02 UTC
646 points
194 comments12 min readLW link

“Pub­lish or Per­ish” (a quick note on why you should try to make your work leg­ible to ex­ist­ing aca­demic com­mu­ni­ties)

David Scott Krueger (formerly: capybaralet)18 Mar 2023 19:01 UTC
93 points
43 comments1 min readLW link

Towards un­der­stand­ing-based safety evaluations

evhub15 Mar 2023 18:18 UTC
121 points
9 comments5 min readLW link

Shell games

TsviBT19 Mar 2023 10:43 UTC
78 points
8 comments4 min readLW link

What Dis­cov­er­ing La­tent Knowl­edge Did and Did Not Find

Fabien Roger13 Mar 2023 19:29 UTC
135 points
10 comments11 min readLW link

An­thropic’s Core Views on AI Safety

Zac Hatfield-Dodds9 Mar 2023 16:55 UTC
178 points
38 comments2 min readLW link
(www.anthropic.com)

De­scrip­tive vs. speci­fi­able values

TsviBT26 Mar 2023 9:10 UTC
13 points
1 comment2 min readLW link

the QACI al­ign­ment plan: table of contents

carado21 Mar 2023 20:22 UTC
49 points
0 comments2 min readLW link
(carado.moe)