RSS

AI

Ar­tifi­cial In­tel­li­gence is the study of cre­at­ing in­tel­li­gence in al­gorithms. On LessWrong, the pri­mary fo­cus of AI dis­cus­sion is to en­sure that as hu­man­ity builds in­creas­ingly pow­er­ful AI sys­tems, the out­come will be good. The cen­tral con­cern is that a pow­er­ful enough AI, if not de­signed and im­ple­mented with suffi­cient un­der­stand­ing, would op­ti­mize some­thing un­in­tended by its cre­ators and pose an ex­is­ten­tial threat to the fu­ture of hu­man­ity. This is known as the AI al­ign­ment prob­lem.

Com­mon terms in this space are su­per­in­tel­li­gence, AI Align­ment, AI Safety, Friendly AI, Trans­for­ma­tive AI, hu­man-level-in­tel­li­gence, AI Gover­nance, and Benefi­cial AI. This en­try and the as­so­ci­ated tag roughly en­com­pass all of these top­ics: any­thing part of the broad cluster of un­der­stand­ing AI and its fu­ture im­pacts on our civ­i­liza­tion de­serves this tag.

AI Alignment

There are nar­row con­cep­tions of al­ign­ment, where you’re try­ing to get it to do some­thing like cure Alzheimer’s dis­ease with­out de­stroy­ing the rest of the world. And there’s much more am­bi­tious no­tions of al­ign­ment, where you’re try­ing to get it to do the right thing and achieve a happy in­ter­galac­tic civ­i­liza­tion.

But both the nar­row and the am­bi­tious al­ign­ment have in com­mon that you’re try­ing to have the AI do that thing rather than mak­ing a lot of pa­per­clips.

See also Gen­eral In­tel­li­gence.

Ba­sic Align­ment Theory

AIXI
Cor­rigi­bil­i­ty
De­ci­sion The­o­ry
Embed­ded Agen­cy
Fixed Point The­o­rem­s
Good­hart’s Law
In­ner Align­men­t
In­stru­men­tal Con­ver­gence
Log­i­cal In­duc­tion
Mesa-Op­ti­miza­tion
My­opi­a
New­comb’s Prob­lem
Op­ti­miza­tion
Orthog­o­nal­ity Th­e­sis
Outer Align­men­t
Solomonoff In­duc­tion
Utility Functions

Eng­ineer­ing Alignment

AI Box­ing (Con­tain­ment)
De­bate
Fac­tored Cog­ni­tion
Hu­mans Con­sult­ing HCH
Im­pact Mea­sures
Iter­ated Am­plifi­ca­tion
Value Learning

Strategy

AI Pro­gress
AI Risk
AI Ser­vices (CAIS)
AI Take­off
AI Timelines

Other

Cen­tre for Hu­man-Com­pat­i­ble AI
Fu­ture of Hu­man­ity In­sti­tute
GPT
Ma­chine In­tel­li­gence Re­search In­sti­tute
OpenAI
Ought
Re­search Agen­das

There’s No Fire Alarm for Ar­tifi­cial Gen­eral Intelligence

Eliezer Yudkowsky
13 Oct 2017 21:38 UTC
293 points
67 comments25 min readLW link

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhub
29 May 2020 20:38 UTC
153 points
29 comments38 min readLW link

Su­per­in­tel­li­gence FAQ

Scott Alexander
20 Sep 2016 19:00 UTC
35 points
5 comments27 min readLW link

Embed­ded Agents

29 Oct 2018 19:53 UTC
200 points
41 comments1 min readLW link6 nominations2 reviews

What failure looks like

paulfchristiano
17 Mar 2019 20:18 UTC
230 points
44 comments8 min readLW link

Challenges to Chris­ti­ano’s ca­pa­bil­ity am­plifi­ca­tion proposal

Eliezer Yudkowsky
19 May 2018 18:18 UTC
180 points
53 comments23 min readLW link3 nominations1 review

Risks from Learned Op­ti­miza­tion: Introduction

31 May 2019 23:44 UTC
135 points
32 comments12 min readLW link

The Rocket Align­ment Problem

Eliezer Yudkowsky
4 Oct 2018 0:38 UTC
175 points
41 comments15 min readLW link6 nominations2 reviews

Embed­ded Agency (full-text ver­sion)

15 Nov 2018 19:49 UTC
119 points
9 comments54 min readLW link

AI Align­ment 2018-19 Review

rohinmshah
28 Jan 2020 2:19 UTC
143 points
6 comments35 min readLW link

Good­hart Taxonomy

Scott Garrabrant
30 Dec 2017 16:38 UTC
207 points
33 comments10 min readLW link

A space of pro­pos­als for build­ing safe ad­vanced AI

Richard_Ngo
10 Jul 2020 16:58 UTC
42 points
0 comments4 min readLW link

That Alien Message

Eliezer Yudkowsky
22 May 2008 5:55 UTC
179 points
171 comments10 min readLW link

Ro­bust­ness to Scale

Scott Garrabrant
21 Feb 2018 22:55 UTC
177 points
21 comments2 min readLW link3 nominations1 review

Chris Olah’s views on AGI safety

evhub
1 Nov 2019 20:13 UTC
157 points
34 comments12 min readLW link

[AN #96]: Buck and I dis­cuss/​ar­gue about AI Alignment

rohinmshah
22 Apr 2020 17:20 UTC
17 points
4 comments10 min readLW link
(mailchi.mp)

Matt Botv­inick on the spon­ta­neous emer­gence of learn­ing algorithms

Adam Scholl
12 Aug 2020 7:47 UTC
137 points
88 comments5 min readLW link

Co­her­ence ar­gu­ments do not im­ply goal-di­rected behavior

rohinmshah
3 Dec 2018 3:26 UTC
79 points
65 comments7 min readLW link2 nominations3 reviews

Com­pe­ti­tion: Am­plify Ro­hin’s Pre­dic­tion on AGI re­searchers & Safety Concerns

stuhlmueller
21 Jul 2020 20:06 UTC
86 points
40 comments3 min readLW link

Align­ment By Default

johnswentworth
12 Aug 2020 18:54 UTC
99 points
79 comments11 min readLW link

AlphaGo Zero and the Foom Debate

Eliezer Yudkowsky
21 Oct 2017 2:18 UTC
291 points
16 comments3 min readLW link

Trade­off be­tween de­sir­able prop­er­ties for baseline choices in im­pact measures

Vika
4 Jul 2020 11:56 UTC
37 points
24 comments5 min readLW link

Dis­con­tin­u­ous progress in his­tory: an update

KatjaGrace
14 Apr 2020 0:00 UTC
163 points
20 comments31 min readLW link
(aiimpacts.org)

Repli­ca­tion Dy­nam­ics Bridge to RL in Ther­mo­dy­namic Limit

Zachary Robertson
18 May 2020 1:02 UTC
6 points
1 comment2 min readLW link

The ground of optimization

alexflint
20 Jun 2020 0:38 UTC
128 points
62 comments27 min readLW link

Model­ling Con­tin­u­ous Progress

SDM
23 Jun 2020 18:06 UTC
28 points
3 comments7 min readLW link

Clas­sifi­ca­tion of AI al­ign­ment re­search: de­con­fu­sion, “good enough” non-su­per­in­tel­li­gent AI al­ign­ment, su­per­in­tel­li­gent AI alignment

crabman
14 Jul 2020 22:48 UTC
34 points
25 comments3 min readLW link

Col­lec­tion of GPT-3 results

Kaj_Sotala
18 Jul 2020 20:04 UTC
84 points
24 comments1 min readLW link
(twitter.com)

An Un­trol­lable Math­e­mat­i­cian Illustrated

abramdemski
20 Mar 2018 0:00 UTC
275 points
38 comments1 min readLW link2 nominations1 review

Con­di­tions for Mesa-Optimization

1 Jun 2019 20:52 UTC
59 points
44 comments12 min readLW link

Thoughts on Hu­man Models

21 Feb 2019 9:10 UTC
126 points
24 comments10 min readLW link

In­ner al­ign­ment in the brain

steve2152
22 Apr 2020 13:14 UTC
69 points
9 comments15 min readLW link

Prob­lem re­lax­ation as a tactic

TurnTrout
22 Apr 2020 23:44 UTC
103 points
8 comments7 min readLW link

[Question] How should po­ten­tial AI al­ign­ment re­searchers gauge whether the field is right for them?

TurnTrout
6 May 2020 12:24 UTC
20 points
5 comments1 min readLW link

Speci­fi­ca­tion gam­ing: the flip side of AI ingenuity

6 May 2020 23:51 UTC
46 points
8 comments6 min readLW link

Les­sons from Isaac: Pit­falls of Reason

adamShimi
8 May 2020 20:44 UTC
10 points
0 comments8 min readLW link

Cor­rigi­bil­ity as out­side view

TurnTrout
8 May 2020 21:56 UTC
39 points
11 comments4 min readLW link

[Question] How to choose a PhD with AI Safety in mind

Ariel Kwiatkowski
15 May 2020 22:19 UTC
9 points
1 comment1 min readLW link

Re­ward func­tions and up­dat­ing as­sump­tions can hide a mul­ti­tude of sins

Stuart_Armstrong
18 May 2020 15:18 UTC
16 points
2 comments9 min readLW link

Pos­si­ble take­aways from the coro­n­avirus pan­demic for slow AI takeoff

Vika
31 May 2020 17:51 UTC
128 points
35 comments3 min readLW link

Fo­cus: you are al­lowed to be bad at ac­com­plish­ing your goals

adamShimi
3 Jun 2020 21:04 UTC
20 points
17 comments3 min readLW link

Re­ply to Paul Chris­ti­ano on Inac­cessible Information

alexflint
5 Jun 2020 9:10 UTC
76 points
15 comments6 min readLW link

Our take on CHAI’s re­search agenda in un­der 1500 words

alexflint
17 Jun 2020 12:24 UTC
95 points
19 comments5 min readLW link

[Question] Ques­tion on GPT-3 Ex­cel Demo

Zhitao Hou
22 Jun 2020 20:31 UTC
0 points
3 comments1 min readLW link

Dy­namic in­con­sis­tency of the in­ac­tion and ini­tial state baseline

Stuart_Armstrong
7 Jul 2020 12:02 UTC
30 points
8 comments2 min readLW link

Cortés, Pizarro, and Afonso as Prece­dents for Takeover

Daniel Kokotajlo
1 Mar 2020 3:49 UTC
117 points
66 comments11 min readLW link

[Question] What prob­lem would you like to see Re­in­force­ment Learn­ing ap­plied to?

Julian Schrittwieser
8 Jul 2020 2:40 UTC
45 points
4 comments1 min readLW link

[Question] To what ex­tent is GPT-3 ca­pa­ble of rea­son­ing?

TurnTrout
20 Jul 2020 17:10 UTC
62 points
74 comments16 min readLW link

Repli­cat­ing the repli­ca­tion crisis with GPT-3?

skybrian
22 Jul 2020 21:20 UTC
30 points
9 comments1 min readLW link

Can you get AGI from a Trans­former?

steve2152
23 Jul 2020 15:27 UTC
65 points
16 comments11 min readLW link

Writ­ing with GPT-3

Jacobian
24 Jul 2020 15:22 UTC
41 points
0 comments4 min readLW link

In­ner Align­ment: Ex­plain like I’m 12 Edition

Rafael Harth
1 Aug 2020 15:24 UTC
113 points
13 comments12 min readLW link

Devel­op­men­tal Stages of GPTs

orthonormal
26 Jul 2020 22:03 UTC
119 points
71 comments7 min readLW link

Gen­er­al­iz­ing the Power-Seek­ing Theorems

TurnTrout
27 Jul 2020 0:28 UTC
39 points
2 comments6 min readLW link

Are we in an AI over­hang?

Andy Jones
27 Jul 2020 12:48 UTC
224 points
87 comments4 min readLW link

[Question] What spe­cific dan­gers arise when ask­ing GPT-N to write an Align­ment Fo­rum post?

Matthew Barnett
28 Jul 2020 2:56 UTC
41 points
14 comments1 min readLW link

[Question] Prob­a­bil­ity that other ar­chi­tec­tures will scale as well as Trans­form­ers?

Daniel Kokotajlo
28 Jul 2020 19:36 UTC
22 points
4 comments1 min readLW link

What a 20-year-lead in mil­i­tary tech might look like

Daniel Kokotajlo
29 Jul 2020 20:10 UTC
64 points
41 comments16 min readLW link

[Question] What if memes are com­mon in highly ca­pa­ble minds?

Daniel Kokotajlo
30 Jul 2020 20:45 UTC
29 points
7 comments2 min readLW link

Three men­tal images from think­ing about AGI de­bate & corrigibility

steve2152
3 Aug 2020 14:29 UTC
49 points
35 comments4 min readLW link

Solv­ing Key Align­ment Prob­lems Group

elriggs
3 Aug 2020 19:30 UTC
20 points
7 comments2 min readLW link

How eas­ily can we sep­a­rate a friendly AI in de­sign space from one which would bring about a hy­per­ex­is­ten­tial catas­tro­phe?

Anirandis
10 Sep 2020 0:40 UTC
18 points
20 comments2 min readLW link

My com­pu­ta­tional frame­work for the brain

steve2152
14 Sep 2020 14:19 UTC
81 points
11 comments12 min readLW link

[Question] Where is hu­man level on text pre­dic­tion? (GPTs task)

Daniel Kokotajlo
20 Sep 2020 9:00 UTC
20 points
17 comments1 min readLW link

Needed: AI in­fo­haz­ard policy

Vanessa Kosoy
21 Sep 2020 15:26 UTC
43 points
16 comments2 min readLW link

[AN #94]: AI al­ign­ment as trans­la­tion be­tween hu­mans and machines

rohinmshah
8 Apr 2020 17:10 UTC
11 points
0 comments7 min readLW link
(mailchi.mp)

[Question] What are the rel­a­tive speeds of AI ca­pa­bil­ities and AI safety?

NunoSempere
24 Apr 2020 18:21 UTC
8 points
2 comments1 min readLW link

Seek­ing Power is Often Prov­ably In­stru­men­tally Con­ver­gent in MDPs

5 Dec 2019 2:33 UTC
116 points
25 comments11 min readLW link
(arxiv.org)

“Don’t even think about hell”

emmab
2 May 2020 8:06 UTC
6 points
2 comments1 min readLW link

[Question] AI Box­ing for Hard­ware-bound agents (aka the China al­ign­ment prob­lem)

Logan Zoellner
8 May 2020 15:50 UTC
11 points
27 comments10 min readLW link

Could We Give an AI a Solu­tion?

Liam Goddard
15 May 2020 21:38 UTC
3 points
2 comments2 min readLW link

Point­ing to a Flower

johnswentworth
18 May 2020 18:54 UTC
51 points
18 comments9 min readLW link

Learn­ing and ma­nipu­lat­ing learning

Stuart_Armstrong
19 May 2020 13:02 UTC
40 points
4 comments10 min readLW link

[Question] Why aren’t we test­ing gen­eral in­tel­li­gence dis­tri­bu­tion?

Bob Jacobs
26 May 2020 16:07 UTC
24 points
7 comments1 min readLW link

OpenAI an­nounces GPT-3

gwern
29 May 2020 1:49 UTC
65 points
23 comments1 min readLW link
(arxiv.org)

GPT-3: a dis­ap­point­ing paper

nostalgebraist
29 May 2020 19:06 UTC
54 points
37 comments8 min readLW link

In­tro­duc­tion to Ex­is­ten­tial Risks from Ar­tifi­cial In­tel­li­gence, for an EA audience

JoshuaFox
2 Jun 2020 8:30 UTC
9 points
1 comment1 min readLW link

Prepar­ing for “The Talk” with AI projects

Daniel Kokotajlo
13 Jun 2020 23:01 UTC
61 points
16 comments3 min readLW link

[Question] What are the high-level ap­proaches to AI al­ign­ment?

G Gordon Worley III
16 Jun 2020 17:10 UTC
13 points
13 comments1 min readLW link

Re­sults of $1,000 Or­a­cle con­test!

Stuart_Armstrong
17 Jun 2020 17:44 UTC
55 points
2 comments1 min readLW link

[Question] Like­li­hood of hy­per­ex­is­ten­tial catas­tro­phe from a bug?

Anirandis
18 Jun 2020 16:23 UTC
11 points
27 comments1 min readLW link

AI Benefits Post 1: In­tro­duc­ing “AI Benefits”

Cullen_OKeefe
22 Jun 2020 16:59 UTC
11 points
3 comments3 min readLW link

Goals and short descriptions

Michele Campolo
2 Jul 2020 17:41 UTC
14 points
8 comments5 min readLW link

Re­search ideas to study hu­mans with AI Safety in mind

Riccardo Volpato
3 Jul 2020 16:01 UTC
22 points
2 comments5 min readLW link

AI Benefits Post 3: Direct and Indi­rect Ap­proaches to AI Benefits

Cullen_OKeefe
6 Jul 2020 18:48 UTC
8 points
0 comments2 min readLW link

An­titrust-Com­pli­ant AI In­dus­try Self-Regulation

Cullen_OKeefe
7 Jul 2020 20:53 UTC
9 points
3 comments1 min readLW link
(cullenokeefe.com)

Should AI Be Open?

Scott Alexander
17 Dec 2015 8:25 UTC
12 points
2 comments13 min readLW link

Meta Pro­gram­ming GPT: A route to Su­per­in­tel­li­gence?

dmtea
11 Jul 2020 14:51 UTC
10 points
7 comments4 min readLW link

The Dilemma of Worse Than Death Scenarios

arkaeik
10 Jul 2018 9:18 UTC
3 points
17 comments4 min readLW link

[Question] What are the mostly likely ways AGI will emerge?

Craig Quiter
14 Jul 2020 0:58 UTC
3 points
7 comments1 min readLW link

AI Benefits Post 4: Out­stand­ing Ques­tions on Select­ing Benefits

Cullen_OKeefe
14 Jul 2020 17:26 UTC
4 points
4 comments5 min readLW link

Solv­ing Math Prob­lems by Relay

17 Jul 2020 15:32 UTC
87 points
26 comments7 min readLW link

AI Benefits Post 5: Out­stand­ing Ques­tions on Govern­ing Benefits

Cullen_OKeefe
21 Jul 2020 16:46 UTC
4 points
0 comments4 min readLW link

[Question] Why is pseudo-al­ign­ment “worse” than other ways ML can fail to gen­er­al­ize?

nostalgebraist
18 Jul 2020 22:54 UTC
43 points
9 comments2 min readLW link

[Question] “Do Noth­ing” util­ity func­tion, 3½ years later?

niplav
20 Jul 2020 11:09 UTC
5 points
3 comments1 min readLW link

[AN #80]: Why AI risk might be solved with­out ad­di­tional in­ter­ven­tion from longtermists

rohinmshah
2 Jan 2020 18:20 UTC
36 points
93 comments10 min readLW link
(mailchi.mp)

Ac­cess to AI: a hu­man right?

dmtea
25 Jul 2020 9:38 UTC
5 points
3 comments2 min readLW link

The Rise of Com­mon­sense Reasoning

DragonGod
27 Jul 2020 19:01 UTC
8 points
0 comments1 min readLW link
(www.reddit.com)

AI and Efficiency

DragonGod
27 Jul 2020 20:58 UTC
9 points
1 comment1 min readLW link
(openai.com)

FHI Re­port: How Will Na­tional Se­cu­rity Con­sid­er­a­tions Affect An­titrust De­ci­sions in AI? An Ex­am­i­na­tion of His­tor­i­cal Precedents

Cullen_OKeefe
28 Jul 2020 18:34 UTC
2 points
0 comments1 min readLW link
(www.fhi.ox.ac.uk)

The “best pre­dic­tor is mal­i­cious op­ti­miser” problem

Donald Hobson
29 Jul 2020 11:49 UTC
14 points
10 comments2 min readLW link

Suffi­ciently Ad­vanced Lan­guage Models Can Do Re­in­force­ment Learning

Zachary Robertson
2 Aug 2020 15:32 UTC
23 points
7 comments7 min readLW link

[Question] What are the most im­por­tant pa­pers/​post/​re­sources to read to un­der­stand more of GPT-3?

adamShimi
2 Aug 2020 20:53 UTC
25 points
4 comments1 min readLW link

[Question] What should an Ein­stein-like figure in Ma­chine Learn­ing do?

Razied
5 Aug 2020 23:52 UTC
3 points
3 comments1 min readLW link

Book re­view: Ar­chi­tects of In­tel­li­gence by Martin Ford (2018)

ofer
11 Aug 2020 17:30 UTC
15 points
0 comments2 min readLW link

[Question] Will OpenAI’s work un­in­ten­tion­ally in­crease ex­is­ten­tial risks re­lated to AI?

adamShimi
11 Aug 2020 18:16 UTC
56 points
48 comments1 min readLW link

Blog post: A tale of two re­search communities

alenglander
12 Aug 2020 20:41 UTC
14 points
0 comments4 min readLW link

Map­ping Out Alignment

15 Aug 2020 1:02 UTC
42 points
0 comments5 min readLW link

My Un­der­stand­ing of Paul Chris­ti­ano’s Iter­ated Am­plifi­ca­tion AI Safety Re­search Agenda

Chi Nguyen
15 Aug 2020 20:02 UTC
94 points
11 comments39 min readLW link

GPT-3, be­lief, and consistency

skybrian
16 Aug 2020 23:12 UTC
19 points
7 comments2 min readLW link

[Question] What pre­cisely do we mean by AI al­ign­ment?

G Gordon Worley III
9 Dec 2018 2:23 UTC
29 points
8 comments1 min readLW link

Thoughts on the Fea­si­bil­ity of Pro­saic AGI Align­ment?

iamthouthouarti
21 Aug 2020 23:25 UTC
8 points
10 comments1 min readLW link

[Question] Fore­cast­ing Thread: AI Timelines

22 Aug 2020 2:33 UTC
112 points
84 comments2 min readLW link

Learn­ing hu­man prefer­ences: black-box, white-box, and struc­tured white-box access

Stuart_Armstrong
24 Aug 2020 11:42 UTC
23 points
9 comments6 min readLW link

Proofs Sec­tion 2.3 (Up­dates, De­ci­sion The­ory)

Diffractor
27 Aug 2020 7:49 UTC
7 points
0 comments31 min readLW link

Proofs Sec­tion 2.2 (Iso­mor­phism to Ex­pec­ta­tions)

Diffractor
27 Aug 2020 7:52 UTC
7 points
0 comments46 min readLW link

Proofs Sec­tion 2.1 (The­o­rem 1, Lem­mas)

Diffractor
27 Aug 2020 7:54 UTC
7 points
0 comments36 min readLW link

Proofs Sec­tion 1.1 (Ini­tial re­sults to LF-du­al­ity)

Diffractor
27 Aug 2020 7:59 UTC
6 points
0 comments20 min readLW link

Proofs Sec­tion 1.2 (Mix­tures, Up­dates, Push­for­wards)

Diffractor
27 Aug 2020 7:57 UTC
7 points
0 comments14 min readLW link

Ba­sic In­framea­sure Theory

Diffractor
27 Aug 2020 8:02 UTC
15 points
5 comments25 min readLW link

Belief Func­tions And De­ci­sion Theory

Diffractor
27 Aug 2020 8:00 UTC
7 points
0 comments39 min readLW link

Tech­ni­cal model re­fine­ment formalism

Stuart_Armstrong
27 Aug 2020 11:54 UTC
9 points
0 comments6 min readLW link

Pong from pix­els with­out read­ing “Pong from Pix­els”

naimenz
29 Aug 2020 17:26 UTC
16 points
1 comment7 min readLW link

Reflec­tions on AI Timelines Fore­cast­ing Thread

Amandango
1 Sep 2020 1:42 UTC
48 points
6 comments5 min readLW link

on “learn­ing to sum­ma­rize”

nostalgebraist
12 Sep 2020 3:20 UTC
22 points
13 comments8 min readLW link
(nostalgebraist.tumblr.com)

[Question] The uni­ver­sal­ity of com­pu­ta­tion and mind de­sign space

alanf
12 Sep 2020 14:58 UTC
1 point
7 comments1 min readLW link

Clar­ify­ing “What failure looks like” (part 1)

Sam Clarke
20 Sep 2020 20:40 UTC
49 points
10 comments17 min readLW link

An Ortho­dox Case Against Utility Functions

abramdemski
7 Apr 2020 19:18 UTC
118 points
49 comments8 min readLW link

2018 AI Align­ment Liter­a­ture Re­view and Char­ity Comparison

Larks
18 Dec 2018 4:46 UTC
195 points
26 comments62 min readLW link2 nominations1 review

Real­ism about rationality

Richard_Ngo
16 Sep 2018 10:46 UTC
178 points
139 comments4 min readLW link3 nominations3 reviews
(thinkingcomplete.blogspot.com)

De­bate on In­stru­men­tal Con­ver­gence be­tween LeCun, Rus­sell, Ben­gio, Zador, and More

Ben Pace
4 Oct 2019 4:08 UTC
186 points
49 comments15 min readLW link

The Parable of Pre­dict-O-Matic

abramdemski
15 Oct 2019 0:49 UTC
195 points
19 comments14 min readLW link

“How con­ser­va­tive” should the par­tial max­imisers be?

Stuart_Armstrong
13 Apr 2020 15:50 UTC
20 points
8 comments2 min readLW link

[AN #95]: A frame­work for think­ing about how to make AI go well

rohinmshah
15 Apr 2020 17:10 UTC
20 points
2 comments10 min readLW link
(mailchi.mp)

AI Align­ment Pod­cast: An Overview of Tech­ni­cal AI Align­ment in 2018 and 2019 with Buck Sh­legeris and Ro­hin Shah

Palus Astra
16 Apr 2020 0:50 UTC
46 points
27 comments89 min readLW link

Open ques­tion: are min­i­mal cir­cuits dae­mon-free?

paulfchristiano
5 May 2018 22:40 UTC
128 points
69 comments2 min readLW link2 nominations1 review

Disen­tan­gling ar­gu­ments for the im­por­tance of AI safety

Richard_Ngo
21 Jan 2019 12:41 UTC
127 points
23 comments8 min readLW link

[AI Align­ment Fo­rum] Database Main­te­nance Today

habryka
16 Apr 2020 19:11 UTC
9 points
0 comments1 min readLW link

In­te­grat­ing Hid­den Vari­ables Im­proves Approximation

johnswentworth
16 Apr 2020 21:43 UTC
15 points
4 comments1 min readLW link

AI Ser­vices as a Re­search Paradigm

VojtaKovarik
20 Apr 2020 13:00 UTC
27 points
12 comments4 min readLW link
(docs.google.com)

Databases of hu­man be­havi­our and prefer­ences?

Stuart_Armstrong
21 Apr 2020 18:06 UTC
10 points
9 comments1 min readLW link

Critch on ca­reer ad­vice for ju­nior AI-x-risk-con­cerned researchers

Rob Bensinger
12 May 2018 2:13 UTC
213 points
25 comments4 min readLW link

Refram­ing Impact

TurnTrout
20 Sep 2019 19:03 UTC
90 points
11 comments3 min readLW link

De­scrip­tion vs simu­lated prediction

Richard Korzekwa
22 Apr 2020 16:40 UTC
27 points
0 comments5 min readLW link
(aiimpacts.org)

Deep­Mind team on speci­fi­ca­tion gaming

JoshuaFox
23 Apr 2020 8:01 UTC
29 points
2 comments1 min readLW link
(deepmind.com)

[Question] Does Agent-like Be­hav­ior Im­ply Agent-like Ar­chi­tec­ture?

Scott Garrabrant
23 Aug 2019 2:01 UTC
45 points
7 comments1 min readLW link

Risks from Learned Op­ti­miza­tion: Con­clu­sion and Re­lated Work

7 Jun 2019 19:53 UTC
65 points
4 comments6 min readLW link

De­cep­tive Alignment

5 Jun 2019 20:16 UTC
64 points
11 comments17 min readLW link

The In­ner Align­ment Problem

4 Jun 2019 1:20 UTC
73 points
17 comments13 min readLW link

How the MtG Color Wheel Ex­plains AI Safety

Scott Garrabrant
15 Feb 2019 23:42 UTC
68 points
4 comments6 min readLW link

[Question] How does Gra­di­ent Des­cent In­ter­act with Good­hart?

Scott Garrabrant
2 Feb 2019 0:14 UTC
71 points
19 comments4 min readLW link

For­mal Open Prob­lem in De­ci­sion Theory

Scott Garrabrant
29 Nov 2018 3:25 UTC
32 points
11 comments4 min readLW link

The Ubiquitous Con­verse Law­vere Problem

Scott Garrabrant
29 Nov 2018 3:16 UTC
20 points
0 comments2 min readLW link

Embed­ded Curiosities

8 Nov 2018 14:19 UTC
87 points
1 comment2 min readLW link

Sub­sys­tem Alignment

6 Nov 2018 16:16 UTC
121 points
12 comments1 min readLW link

Ro­bust Delegation

4 Nov 2018 16:38 UTC
120 points
10 comments1 min readLW link

Embed­ded World-Models

2 Nov 2018 16:07 UTC
91 points
15 comments1 min readLW link

De­ci­sion Theory

31 Oct 2018 18:41 UTC
102 points
37 comments1 min readLW link

(A → B) → A

Scott Garrabrant
11 Sep 2018 22:38 UTC
46 points
10 comments2 min readLW link

His­tory of the Devel­op­ment of Log­i­cal Induction

Scott Garrabrant
29 Aug 2018 3:15 UTC
94 points
4 comments5 min readLW link

Op­ti­miza­tion Amplifies

Scott Garrabrant
27 Jun 2018 1:51 UTC
102 points
12 comments4 min readLW link2 nominations

What makes coun­ter­fac­tu­als com­pa­rable?

Chris_Leong
24 Apr 2020 22:47 UTC
11 points
6 comments3 min readLW link

New Paper Ex­pand­ing on the Good­hart Taxonomy

Scott Garrabrant
14 Mar 2018 9:01 UTC
50 points
4 comments1 min readLW link
(arxiv.org)

Sources of in­tu­itions and data on AGI

Scott Garrabrant
31 Jan 2018 23:30 UTC
158 points
26 comments3 min readLW link

Corrigibility

paulfchristiano
27 Nov 2018 21:50 UTC
42 points
4 comments6 min readLW link

AI pre­dic­tion case study 5: Omo­hun­dro’s AI drives

Stuart_Armstrong
15 Mar 2013 9:09 UTC
5 points
5 comments8 min readLW link

Toy model: con­ver­gent in­stru­men­tal goals

Stuart_Armstrong
25 Feb 2016 14:03 UTC
8 points
2 comments4 min readLW link

AI-cre­ated pseudo-deontology

Stuart_Armstrong
12 Feb 2015 21:11 UTC
6 points
35 comments1 min readLW link

Eth­i­cal Injunctions

Eliezer Yudkowsky
20 Oct 2008 23:00 UTC
46 points
76 comments9 min readLW link

Mo­ti­vat­ing Ab­strac­tion-First De­ci­sion Theory

johnswentworth
29 Apr 2020 17:47 UTC
41 points
16 comments5 min readLW link

[AN #97]: Are there his­tor­i­cal ex­am­ples of large, ro­bust dis­con­ti­nu­ities?

rohinmshah
29 Apr 2020 17:30 UTC
15 points
0 comments10 min readLW link
(mailchi.mp)

My Up­dat­ing Thoughts on AI policy

Ben Pace
1 Mar 2020 7:06 UTC
22 points
1 comment9 min readLW link

Use­ful Does Not Mean Secure

Ben Pace
30 Nov 2019 2:05 UTC
49 points
12 comments11 min readLW link

[Question] What is the al­ter­na­tive to in­tent al­ign­ment called?

Richard_Ngo
30 Apr 2020 2:16 UTC
10 points
6 comments1 min readLW link

Op­ti­mis­ing So­ciety to Con­strain Risk of War from an Ar­tifi­cial Su­per­in­tel­li­gence

JohnCDraper
30 Apr 2020 10:47 UTC
3 points
0 comments51 min readLW link

[Question] Juke­box: how to up­date from AI imi­tat­ing hu­mans?

Michaël Trazzi
30 Apr 2020 20:50 UTC
10 points
0 comments1 min readLW link

Stan­ford En­cy­clo­pe­dia of Philos­o­phy on AI ethics and superintelligence

Kaj_Sotala
2 May 2020 7:35 UTC
42 points
19 comments7 min readLW link
(plato.stanford.edu)

[Question] How does iter­ated am­plifi­ca­tion ex­ceed hu­man abil­ities?

riceissa
2 May 2020 23:44 UTC
21 points
9 comments2 min readLW link

How uniform is the neo­cor­tex?

zhukeepa
4 May 2020 2:16 UTC
69 points
22 comments11 min readLW link

Scott Garrabrant’s prob­lem on re­cov­er­ing Brouwer as a corol­lary of Lawvere

Rupert
4 May 2020 10:01 UTC
25 points
2 comments2 min readLW link

“AI and Effi­ciency”, OA (44✕ im­prove­ment in CNNs since 2012)

gwern
5 May 2020 16:32 UTC
48 points
0 comments1 min readLW link
(openai.com)

Com­pet­i­tive safety via gra­dated curricula

Richard_Ngo
5 May 2020 18:11 UTC
34 points
5 comments5 min readLW link

Model­ing nat­u­ral­ized de­ci­sion prob­lems in lin­ear logic

jessicata
6 May 2020 0:15 UTC
15 points
2 comments6 min readLW link
(unstableontology.com)

[AN #98]: Un­der­stand­ing neu­ral net train­ing by see­ing which gra­di­ents were helpful

rohinmshah
6 May 2020 17:10 UTC
20 points
3 comments9 min readLW link
(mailchi.mp)

[Question] Is AI safety re­search less par­alleliz­able than AI re­search?

Mati_Roy
10 May 2020 20:43 UTC
9 points
5 comments1 min readLW link

Thoughts on im­ple­ment­ing cor­rigible ro­bust alignment

steve2152
26 Nov 2019 14:06 UTC
26 points
2 comments6 min readLW link

Wire­head­ing is in the eye of the beholder

Stuart_Armstrong
30 Jan 2019 18:23 UTC
26 points
10 comments1 min readLW link

Wire­head­ing as a po­ten­tial prob­lem with the new im­pact measure

Stuart_Armstrong
25 Sep 2018 14:15 UTC
25 points
20 comments4 min readLW link

Wire­head­ing and discontinuity

Michele Campolo
18 Feb 2020 10:49 UTC
22 points
4 comments3 min readLW link

[AN #99]: Dou­bling times for the effi­ciency of AI algorithms

rohinmshah
13 May 2020 17:20 UTC
30 points
0 comments10 min readLW link
(mailchi.mp)

How should AIs up­date a prior over hu­man prefer­ences?

Stuart_Armstrong
15 May 2020 13:14 UTC
17 points
9 comments2 min readLW link

Con­jec­ture Workshop

johnswentworth
15 May 2020 22:41 UTC
34 points
2 comments2 min readLW link

Multi-agent safety

Richard_Ngo
16 May 2020 1:59 UTC
22 points
8 comments5 min readLW link

Allow­ing Ex­ploita­bil­ity in Game Theory

Liam Goddard
17 May 2020 23:19 UTC
2 points
3 comments2 min readLW link

The Mechanis­tic and Nor­ma­tive Struc­ture of Agency

G Gordon Worley III
18 May 2020 16:03 UTC
14 points
4 comments1 min readLW link
(philpapers.org)

“Star­wink” by Alicorn

Zack_M_Davis
18 May 2020 8:17 UTC
40 points
1 comment1 min readLW link
(alicorn.elcenia.com)

[AN #100]: What might go wrong if you learn a re­ward func­tion while acting

rohinmshah
20 May 2020 17:30 UTC
33 points
2 comments12 min readLW link
(mailchi.mp)

Prob­a­bil­ities, weights, sums: pretty much the same for re­ward functions

Stuart_Armstrong
20 May 2020 15:19 UTC
11 points
1 comment2 min readLW link

[Question] Source code size vs learned model size in ML and in hu­mans?

riceissa
20 May 2020 8:47 UTC
11 points
6 comments1 min readLW link

Com­par­ing re­ward learn­ing/​re­ward tam­per­ing formalisms

Stuart_Armstrong
21 May 2020 12:03 UTC
9 points
1 comment3 min readLW link

AGIs as collectives

Richard_Ngo
22 May 2020 20:36 UTC
20 points
23 comments4 min readLW link

[AN #101]: Why we should rigor­ously mea­sure and fore­cast AI progress

rohinmshah
27 May 2020 17:20 UTC
15 points
0 comments10 min readLW link
(mailchi.mp)

AI Safety Dis­cus­sion Days

Linda Linsefors
27 May 2020 16:54 UTC
11 points
1 comment3 min readLW link

Build­ing brain-in­spired AGI is in­finitely eas­ier than un­der­stand­ing the brain

steve2152
2 Jun 2020 14:13 UTC
37 points
7 comments7 min readLW link

Spar­sity and in­ter­pretabil­ity?

1 Jun 2020 13:25 UTC
41 points
3 comments7 min readLW link

GPT-3: A Summary

leogao
2 Jun 2020 18:14 UTC
20 points
0 comments1 min readLW link
(leogao.dev)

Inac­cessible information

paulfchristiano
3 Jun 2020 5:10 UTC
85 points
15 comments14 min readLW link
(ai-alignment.com)

[AN #102]: Meta learn­ing by GPT-3, and a list of full pro­pos­als for AI alignment

rohinmshah
3 Jun 2020 17:20 UTC
38 points
6 comments10 min readLW link
(mailchi.mp)

Feed­back is cen­tral to agency

alexflint
1 Jun 2020 12:56 UTC
29 points
0 comments3 min readLW link

Defin­ing AGI

lsusr
4 Jun 2020 10:59 UTC
7 points
1 comment2 min readLW link

Think­ing About Su­per-Hu­man AI: An Ex­am­i­na­tion of Likely Paths and Ul­ti­mate Constitution

meanderingmoose
4 Jun 2020 23:22 UTC
−3 points
0 comments7 min readLW link

Emer­gence and Con­trol: An ex­am­i­na­tion of our abil­ity to gov­ern the be­hav­ior of in­tel­li­gent systems

meanderingmoose
5 Jun 2020 17:10 UTC
1 point
0 comments6 min readLW link

GAN Discrim­i­na­tors Don’t Gen­er­al­ize?

tryactions
8 Jun 2020 20:36 UTC
18 points
7 comments2 min readLW link

More on dis­am­biguat­ing “dis­con­ti­nu­ity”

alenglander
9 Jun 2020 15:16 UTC
16 points
1 comment3 min readLW link

[AN #103]: ARCHES: an agenda for ex­is­ten­tial safety, and com­bin­ing nat­u­ral lan­guage with deep RL

rohinmshah
10 Jun 2020 17:20 UTC
26 points
1 comment10 min readLW link
(mailchi.mp)

Dutch-Book­ing CDT: Re­vised Argument

abramdemski
11 Jun 2020 23:34 UTC
45 points
6 comments16 min readLW link
(NIL)

[Question] List of pub­lic pre­dic­tions of what GPT-X can or can’t do?

Daniel Kokotajlo
14 Jun 2020 14:25 UTC
20 points
9 comments1 min readLW link

Achiev­ing AI al­ign­ment through de­liber­ate un­cer­tainty in mul­ti­a­gent systems

Florian Dietz
15 Jun 2020 12:19 UTC
3 points
10 comments7 min readLW link

Su­per­ex­po­nen­tial His­toric Growth, by David Roodman

Ben Pace
15 Jun 2020 21:49 UTC
43 points
6 comments5 min readLW link
(www.openphilanthropy.org)

Re­lat­ing HCH and Log­i­cal Induction

abramdemski
16 Jun 2020 22:08 UTC
49 points
4 comments5 min readLW link

Image GPT

Daniel Kokotajlo
18 Jun 2020 11:41 UTC
30 points
27 comments1 min readLW link
(openai.com)

[AN #104]: The per­ils of in­ac­cessible in­for­ma­tion, and what we can learn about AI al­ign­ment from COVID

rohinmshah
18 Jun 2020 17:10 UTC
19 points
5 comments8 min readLW link
(mailchi.mp)

[Question] If AI is based on GPT, how to en­sure its safety?

avturchin
18 Jun 2020 20:33 UTC
20 points
11 comments1 min readLW link

What’s Your Cog­ni­tive Al­gorithm?

Raemon
18 Jun 2020 22:16 UTC
69 points
23 comments13 min readLW link

Rele­vant pre-AGI possibilities

Daniel Kokotajlo
20 Jun 2020 10:52 UTC
30 points
7 comments19 min readLW link
(aiimpacts.org)

Plau­si­ble cases for HRAD work, and lo­cat­ing the crux in the “re­al­ism about ra­tio­nal­ity” debate

riceissa
22 Jun 2020 1:10 UTC
80 points
14 comments10 min readLW link

The In­dex­ing Problem

johnswentworth
22 Jun 2020 19:11 UTC
38 points
2 comments4 min readLW link

[Question] Re­quest­ing feed­back/​ad­vice: what Type The­ory to study for AI safety?

rvnnt
23 Jun 2020 17:03 UTC
7 points
4 comments3 min readLW link

Lo­cal­ity of goals

adamShimi
22 Jun 2020 21:56 UTC
15 points
8 comments6 min readLW link

[Question] What is “In­stru­men­tal Cor­rigi­bil­ity”?

joebernstein
23 Jun 2020 20:24 UTC
3 points
1 comment1 min readLW link

Models, myths, dreams, and Cheshire cat grins

Stuart_Armstrong
24 Jun 2020 10:50 UTC
21 points
7 comments2 min readLW link

[AN #105]: The eco­nomic tra­jec­tory of hu­man­ity, and what we might mean by optimization

rohinmshah
24 Jun 2020 17:30 UTC
24 points
3 comments11 min readLW link
(mailchi.mp)

There’s an Awe­some AI Ethics List and it’s a lit­tle thin

AABoyles
25 Jun 2020 13:43 UTC
13 points
1 comment1 min readLW link
(github.com)

GPT-3 Fic­tion Samples

gwern
25 Jun 2020 16:12 UTC
61 points
18 comments1 min readLW link
(www.gwern.net)

Walk­through: The Trans­former Ar­chi­tec­ture [Part 1/​2]

Matthew Barnett
30 Jul 2019 13:54 UTC
35 points
0 comments6 min readLW link

Ro­bust­ness as a Path to AI Alignment

abramdemski
10 Oct 2017 8:14 UTC
66 points
9 comments9 min readLW link

Rad­i­cal Prob­a­bil­ism [Tran­script]

26 Jun 2020 22:14 UTC
47 points
12 comments6 min readLW link

AI safety via mar­ket making

evhub
26 Jun 2020 23:07 UTC
48 points
31 comments11 min readLW link

[Question] Have gen­eral de­com­posers been for­mal­ized?

Quinn
27 Jun 2020 18:09 UTC
8 points
5 comments1 min readLW link

Gary Mar­cus vs Cor­ti­cal Uniformity

steve2152
28 Jun 2020 18:18 UTC
19 points
0 comments8 min readLW link

Web AI dis­cus­sion Groups

Donald Hobson
30 Jun 2020 11:22 UTC
10 points
0 comments2 min readLW link

Com­par­ing AI Align­ment Ap­proaches to Min­i­mize False Pos­i­tive Risk

G Gordon Worley III
30 Jun 2020 19:34 UTC
6 points
0 comments9 min readLW link

AvE: As­sis­tance via Empowerment

FactorialCode
30 Jun 2020 22:07 UTC
12 points
1 comment1 min readLW link
(arxiv.org)

Evan Hub­inger on In­ner Align­ment, Outer Align­ment, and Pro­pos­als for Build­ing Safe Ad­vanced AI

Palus Astra
1 Jul 2020 17:30 UTC
35 points
4 comments67 min readLW link

[AN #106]: Eval­u­at­ing gen­er­al­iza­tion abil­ity of learned re­ward models

rohinmshah
1 Jul 2020 17:20 UTC
14 points
2 comments11 min readLW link
(mailchi.mp)

The “AI De­bate” Debate

michaelcohen
2 Jul 2020 10:16 UTC
20 points
20 comments3 min readLW link

Idea: Imi­ta­tion/​Value Learn­ing AIXI

Zachary Robertson
3 Jul 2020 17:10 UTC
3 points
6 comments1 min readLW link

Split­ting De­bate up into Two Subsystems

Nandi
3 Jul 2020 20:11 UTC
13 points
5 comments4 min readLW link

AI Un­safety via Non-Zero-Sum Debate

VojtaKovarik
3 Jul 2020 22:03 UTC
15 points
10 comments5 min readLW link

Clas­sify­ing games like the Pri­soner’s Dilemma

philh
4 Jul 2020 17:10 UTC
73 points
22 comments6 min readLW link
(reasonableapproximation.net)

AI-Feyn­man as a bench­mark for what we should be aiming for

Faustus2
4 Jul 2020 9:24 UTC
8 points
1 comment2 min readLW link

Learn­ing the prior

paulfchristiano
5 Jul 2020 21:00 UTC
75 points
18 comments8 min readLW link
(ai-alignment.com)

Bet­ter pri­ors as a safety problem

paulfchristiano
5 Jul 2020 21:20 UTC
64 points
7 comments5 min readLW link
(ai-alignment.com)

[Question] How far is AGI?

Roko Jelavić
5 Jul 2020 17:58 UTC
6 points
5 comments1 min readLW link

Clas­sify­ing speci­fi­ca­tion prob­lems as var­i­ants of Good­hart’s Law

19 Aug 2019 20:40 UTC
71 points
2 comments5 min readLW link

New safety re­search agenda: scal­able agent al­ign­ment via re­ward modeling

Vika
20 Nov 2018 17:29 UTC
35 points
13 comments1 min readLW link
(medium.com)

De­sign­ing agent in­cen­tives to avoid side effects

11 Mar 2019 20:55 UTC
31 points
0 comments2 min readLW link
(medium.com)

Dis­cus­sion on the ma­chine learn­ing ap­proach to AI safety

Vika
1 Nov 2018 20:54 UTC
28 points
3 comments4 min readLW link

Speci­fi­ca­tion gam­ing ex­am­ples in AI

Vika
3 Apr 2018 12:30 UTC
82 points
9 comments1 min readLW link2 nominations2 reviews

[Question] (an­swered: yes) Has any­one writ­ten up a con­sid­er­a­tion of Downs’s “Para­dox of Vot­ing” from the per­spec­tive of MIRI-ish de­ci­sion the­o­ries (UDT, FDT, or even just EDT)?

Jameson Quinn
6 Jul 2020 18:26 UTC
10 points
24 comments1 min readLW link

New Deep­Mind AI Safety Re­search Blog

Vika
27 Sep 2018 16:28 UTC
46 points
0 comments1 min readLW link
(medium.com)

Con­test: $1,000 for good ques­tions to ask to an Or­a­cle AI

Stuart_Armstrong
31 Jul 2019 18:48 UTC
69 points
155 comments3 min readLW link

De­con­fus­ing Hu­man Values Re­search Agenda v1

G Gordon Worley III
23 Mar 2020 16:25 UTC
18 points
12 comments4 min readLW link

[Question] How “hon­est” is GPT-3?

abramdemski
8 Jul 2020 19:38 UTC
72 points
18 comments5 min readLW link

What does it mean to ap­ply de­ci­sion the­ory?

abramdemski
8 Jul 2020 20:31 UTC
42 points
5 comments8 min readLW link

AI Re­search Con­sid­er­a­tions for Hu­man Ex­is­ten­tial Safety (ARCHES)

habryka
9 Jul 2020 2:49 UTC
57 points
7 comments1 min readLW link
(arxiv.org)

The Un­rea­son­able Effec­tive­ness of Deep Learning

Richard_Ngo
30 Sep 2018 15:48 UTC
88 points
5 comments13 min readLW link
(thinkingcomplete.blogspot.com)

mAIry’s room: AI rea­son­ing to solve philo­soph­i­cal problems

Stuart_Armstrong
5 Mar 2019 20:24 UTC
66 points
30 comments6 min readLW link

Failures of an em­bod­ied AIXI

So8res
15 Jun 2014 18:29 UTC
29 points
46 comments12 min readLW link

The Prob­lem with AIXI

Rob Bensinger
18 Mar 2014 1:55 UTC
29 points
78 comments23 min readLW link

Ver­sions of AIXI can be ar­bi­trar­ily stupid

Stuart_Armstrong
10 Aug 2015 13:23 UTC
23 points
59 comments1 min readLW link

Reflec­tive AIXI and Anthropics

Diffractor
24 Sep 2018 2:15 UTC
19 points
13 comments8 min readLW link

AIXI and Ex­is­ten­tial Despair

paulfchristiano
8 Dec 2011 20:03 UTC
16 points
38 comments6 min readLW link

How to make AIXI-tl in­ca­pable of learning

itaibn0
27 Jan 2014 0:05 UTC
4 points
5 comments2 min readLW link

Help re­quest: What is the Kol­mogorov com­plex­ity of com­putable ap­prox­i­ma­tions to AIXI?

AnnaSalamon
5 Dec 2010 10:23 UTC
4 points
9 comments1 min readLW link

“AIXIjs: A Soft­ware Demo for Gen­eral Re­in­force­ment Learn­ing”, As­lanides 2017

gwern
29 May 2017 21:09 UTC
4 points
1 comment1 min readLW link
(arxiv.org)

Can AIXI be trained to do any­thing a hu­man can?

Stuart_Armstrong
20 Oct 2014 13:12 UTC
3 points
9 comments2 min readLW link

Shap­ing eco­nomic in­cen­tives for col­lab­o­ra­tive AGI

Kaj_Sotala
29 Jun 2018 16:26 UTC
47 points
15 comments4 min readLW link

Is the Star Trek Fed­er­a­tion re­ally in­ca­pable of build­ing AI?

Kaj_Sotala
18 Mar 2018 10:30 UTC
30 points
4 comments2 min readLW link
(kajsotala.fi)

Some con­cep­tual high­lights from “Disjunc­tive Sce­nar­ios of Catas­trophic AI Risk”

Kaj_Sotala
12 Feb 2018 12:30 UTC
68 points
4 comments6 min readLW link
(kajsotala.fi)

Mis­con­cep­tions about con­tin­u­ous takeoff

Matthew Barnett
8 Oct 2019 21:31 UTC
75 points
36 comments4 min readLW link

Dist­in­guish­ing defi­ni­tions of takeoff

Matthew Barnett
14 Feb 2020 0:16 UTC
46 points
6 comments6 min readLW link

Book re­view: Ar­tifi­cial In­tel­li­gence Safety and Security

PeterMcCluskey
8 Dec 2018 3:47 UTC
30 points
3 comments8 min readLW link
(www.bayesianinvestor.com)

Why AI may not foom

John_Maxwell
24 Mar 2013 8:11 UTC
23 points
81 comments12 min readLW link

Hu­mans Who Are Not Con­cen­trat­ing Are Not Gen­eral Intelligences

sarahconstantin
25 Feb 2019 20:40 UTC
137 points
29 comments6 min readLW link
(srconstantin.wordpress.com)

The Hacker Learns to Trust

Ben Pace
22 Jun 2019 0:27 UTC
82 points
18 comments8 min readLW link
(medium.com)

Book Re­view: Hu­man Compatible

Scott Alexander
31 Jan 2020 5:20 UTC
77 points
6 comments16 min readLW link
(slatestarcodex.com)

SSC Jour­nal Club: AI Timelines

Scott Alexander
8 Jun 2017 19:00 UTC
7 points
1 comment8 min readLW link

Ar­gu­ments against my­opic training

Richard_Ngo
9 Jul 2020 16:07 UTC
50 points
36 comments12 min readLW link

On mo­ti­va­tions for MIRI’s highly re­li­able agent de­sign research

jessicata
29 Jan 2017 19:34 UTC
19 points
1 comment5 min readLW link

Why is the im­pact penalty time-in­con­sis­tent?

Stuart_Armstrong
9 Jul 2020 17:26 UTC
16 points
1 comment2 min readLW link

My cur­rent take on the Paul-MIRI dis­agree­ment on al­ignabil­ity of messy AI

jessicata
29 Jan 2017 20:52 UTC
17 points
0 comments10 min readLW link

Ben Go­ertzel: The Sin­gu­lar­ity In­sti­tute’s Scary Idea (and Why I Don’t Buy It)

Paul Crowley
30 Oct 2010 9:31 UTC
34 points
442 comments1 min readLW link

An An­a­lytic Per­spec­tive on AI Alignment

DanielFilan
1 Mar 2020 4:10 UTC
54 points
45 comments8 min readLW link
(danielfilan.com)

Mechanis­tic Trans­parency for Ma­chine Learning

DanielFilan
11 Jul 2018 0:34 UTC
55 points
9 comments4 min readLW link

A model I use when mak­ing plans to re­duce AI x-risk

Ben Pace
19 Jan 2018 0:21 UTC
143 points
41 comments6 min readLW link

AI Re­searchers On AI Risk

Scott Alexander
22 May 2015 11:16 UTC
10 points
0 comments16 min readLW link

Mini ad­vent cal­en­dar of Xrisks: Ar­tifi­cial Intelligence

Stuart_Armstrong
7 Dec 2012 11:26 UTC
3 points
5 comments1 min readLW link

For FAI: Is “Molec­u­lar Nan­otech­nol­ogy” putting our best foot for­ward?

leplen
22 Jun 2013 4:44 UTC
60 points
118 comments3 min readLW link

UFAI can­not be the Great Filter

Thrasymachus
22 Dec 2012 11:26 UTC
43 points
92 comments3 min readLW link

Don’t Fear The Filter

Scott Alexander
29 May 2014 0:45 UTC
6 points
5 comments6 min readLW link

The Great Filter is early, or AI is hard

Stuart_Armstrong
29 Aug 2014 16:17 UTC
19 points
76 comments1 min readLW link

Talk: Key Is­sues In Near-Term AI Safety Research

alenglander
10 Jul 2020 18:36 UTC
24 points
1 comment1 min readLW link

Mesa-Op­ti­miz­ers vs “Steered Op­ti­miz­ers”

steve2152
10 Jul 2020 16:49 UTC
25 points
5 comments8 min readLW link

AlphaS­tar: Im­pres­sive for RL progress, not for AGI progress

orthonormal
2 Nov 2019 1:50 UTC
118 points
53 comments2 min readLW link

The Catas­trophic Con­ver­gence Conjecture

TurnTrout
14 Feb 2020 21:16 UTC
40 points
13 comments8 min readLW link

[Question] How well can the GPT ar­chi­tec­ture solve the par­ity task?

FactorialCode
11 Jul 2020 19:02 UTC
18 points
3 comments1 min readLW link

Sun­day July 12 — talks by Scott Garrabrant, Alexflint, alexei, Stu­art_Armstrong

8 Jul 2020 0:27 UTC
19 points
2 comments1 min readLW link

[Link] Word-vec­tor based DL sys­tem achieves hu­man par­ity in ver­bal IQ tests

jacob_cannell
13 Jun 2015 23:38 UTC
8 points
8 comments1 min readLW link

The Power of Intelligence

Eliezer Yudkowsky
1 Jan 2007 20:00 UTC
31 points
3 comments4 min readLW link

Refram­ing Su­per­in­tel­li­gence: Com­pre­hen­sive AI Ser­vices as Gen­eral Intelligence

rohinmshah
8 Jan 2019 7:12 UTC
97 points
70 comments5 min readLW link
(www.fhi.ox.ac.uk)

Com­ments on CAIS

Richard_Ngo
12 Jan 2019 15:20 UTC
75 points
12 comments7 min readLW link

[Question] What are CAIS’ bold­est near/​medium-term pre­dic­tions?

jacobjacob
28 Mar 2019 13:14 UTC
35 points
17 comments1 min readLW link

Drexler on AI Risk

PeterMcCluskey
1 Feb 2019 5:11 UTC
34 points
10 comments9 min readLW link
(www.bayesianinvestor.com)

Six AI Risk/​Strat­egy Ideas

Wei_Dai
27 Aug 2019 0:40 UTC
63 points
15 comments4 min readLW link

New re­port: In­tel­li­gence Ex­plo­sion Microeconomics

Eliezer Yudkowsky
29 Apr 2013 23:14 UTC
45 points
251 comments3 min readLW link

Book re­view: Hu­man Compatible

PeterMcCluskey
19 Jan 2020 3:32 UTC
39 points
2 comments5 min readLW link
(www.bayesianinvestor.com)

Thoughts on “Hu­man-Com­pat­i­ble”

TurnTrout
10 Oct 2019 5:24 UTC
60 points
35 comments5 min readLW link

Book Re­view: The AI Does Not Hate You

PeterMcCluskey
28 Oct 2019 17:45 UTC
28 points
0 comments5 min readLW link
(www.bayesianinvestor.com)

[Link] Book Re­view: ‘The AI Does Not Hate You’ by Tom Chivers (Scott Aaron­son)

eigen
7 Oct 2019 18:16 UTC
22 points
0 comments1 min readLW link

Book Re­view: Life 3.0: Be­ing Hu­man in the Age of Ar­tifi­cial Intelligence

J_Thomas_Moros
18 Jan 2018 17:18 UTC
16 points
0 comments1 min readLW link
(ferocioustruth.com)

Book Re­view: Weapons of Math Destruction

Zvi
4 Jun 2017 21:20 UTC
1 point
0 comments16 min readLW link

DARPA Digi­tal Tu­tor: Four Months to To­tal Tech­ni­cal Ex­per­tise?

JohnBuridan
6 Jul 2020 23:34 UTC
149 points
13 comments7 min readLW link

Paper: Su­per­in­tel­li­gence as a Cause or Cure for Risks of Astro­nom­i­cal Suffering

Kaj_Sotala
3 Jan 2018 14:39 UTC
1 point
6 comments1 min readLW link
(www.informatica.si)

Prevent­ing s-risks via in­dex­i­cal un­cer­tainty, acausal trade and dom­i­na­tion in the multiverse

avturchin
27 Sep 2018 10:09 UTC
7 points
3 comments4 min readLW link

[Link] Suffer­ing-fo­cused AI safety: Why “fail-safe” mea­sures might be par­tic­u­larly promis­ing

David Althaus
21 Jul 2016 20:22 UTC
9 points
5 comments1 min readLW link

Pre­face to CLR’s Re­search Agenda on Co­op­er­a­tion, Con­flict, and TAI

JesseClifton
13 Dec 2019 21:02 UTC
55 points
8 comments2 min readLW link

Sec­tions 1 & 2: In­tro­duc­tion, Strat­egy and Governance

JesseClifton
17 Dec 2019 21:27 UTC
35 points
5 comments14 min readLW link

Sec­tions 3 & 4: Cred­i­bil­ity, Peace­ful Bar­gain­ing Mechanisms

JesseClifton
17 Dec 2019 21:46 UTC
21 points
2 comments12 min readLW link

Sec­tions 5 & 6: Con­tem­po­rary Ar­chi­tec­tures, Hu­mans in the Loop

JesseClifton
20 Dec 2019 3:52 UTC
29 points
4 comments10 min readLW link

Sec­tion 7: Foun­da­tions of Ra­tional Agency

JesseClifton
22 Dec 2019 2:05 UTC
16 points
3 comments8 min readLW link

What counts as defec­tion?

TurnTrout
12 Jul 2020 22:03 UTC
84 points
20 comments4 min readLW link

The “Com­mit­ment Races” problem

Daniel Kokotajlo
23 Aug 2019 1:58 UTC
76 points
16 comments5 min readLW link

Align­ment Newslet­ter #36

rohinmshah
12 Dec 2018 1:10 UTC
22 points
0 comments11 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #47

rohinmshah
4 Mar 2019 4:30 UTC
21 points
0 comments8 min readLW link
(mailchi.mp)

Un­der­stand­ing “Deep Dou­ble Des­cent”

evhub
6 Dec 2019 0:00 UTC
110 points
30 comments5 min readLW link

My cur­rent frame­work for think­ing about AGI timelines

zhukeepa
30 Mar 2020 1:23 UTC
104 points
5 comments3 min readLW link

[LINK] Strong AI Startup Raises $15M

olalonde
21 Aug 2012 20:47 UTC
17 points
13 comments1 min readLW link

An­nounc­ing the AI Align­ment Prize

cousin_it
3 Nov 2017 15:47 UTC
156 points
78 comments1 min readLW link

I’m leav­ing AI al­ign­ment – you bet­ter stay

rmoehn
12 Mar 2020 5:58 UTC
142 points
17 comments4 min readLW link

New pa­per: AGI Agent Safety by Iter­a­tively Im­prov­ing the Utility Function

Koen.Holtman
15 Jul 2020 14:05 UTC
26 points
2 comments6 min readLW link

[Question] How should AI de­bate be judged?

abramdemski
15 Jul 2020 22:20 UTC
48 points
27 comments6 min readLW link

Align­ment pro­pos­als and com­plex­ity classes

evhub
16 Jul 2020 0:27 UTC
48 points
26 comments13 min readLW link

[AN #107]: The con­ver­gent in­stru­men­tal sub­goals of goal-di­rected agents

rohinmshah
16 Jul 2020 6:47 UTC
13 points
1 comment8 min readLW link
(mailchi.mp)

[AN #108]: Why we should scru­ti­nize ar­gu­ments for AI risk

rohinmshah
16 Jul 2020 6:47 UTC
19 points
6 comments12 min readLW link
(mailchi.mp)

En­vi­ron­ments as a bot­tle­neck in AGI development

Richard_Ngo
17 Jul 2020 5:02 UTC
31 points
18 comments6 min readLW link

[Question] Can an agent use in­ter­ac­tive proofs to check the al­ign­ment of suc­ce­sors?

PabloAMC
17 Jul 2020 19:07 UTC
7 points
2 comments1 min readLW link

Les­sons on AI Takeover from the conquistadors

17 Jul 2020 22:35 UTC
53 points
30 comments5 min readLW link

What Would I Do? Self-pre­dic­tion in Sim­ple Algorithms

Scott Garrabrant
20 Jul 2020 4:27 UTC
52 points
13 comments5 min readLW link

Wri­teup: Progress on AI Safety via Debate

5 Feb 2020 21:04 UTC
95 points
17 comments33 min readLW link

Oper­a­tional­iz­ing Interpretability

lifelonglearner
20 Jul 2020 5:22 UTC
21 points
0 comments4 min readLW link

Learn­ing Values in Practice

Stuart_Armstrong
20 Jul 2020 18:38 UTC
24 points
0 comments5 min readLW link

Par­allels Between AI Safety by De­bate and Ev­i­dence Law

Cullen_OKeefe
20 Jul 2020 22:52 UTC
10 points
1 comment2 min readLW link
(cullenokeefe.com)

The Redis­cov­ery of In­te­ri­or­ity in Ma­chine Learning

DanB
21 Jul 2020 5:02 UTC
5 points
4 comments1 min readLW link
(danburfoot.net)

The “AI Dun­geons” Dragon Model is heav­ily path de­pen­dent (test­ing GPT-3 on ethics)

Rafael Harth
21 Jul 2020 12:14 UTC
47 points
9 comments6 min readLW link

How good is hu­man­ity at co­or­di­na­tion?

Buck
21 Jul 2020 20:01 UTC
72 points
43 comments3 min readLW link

Align­ment As A Bot­tle­neck To Use­ful­ness Of GPT-3

johnswentworth
21 Jul 2020 20:02 UTC
93 points
57 comments3 min readLW link

$1000 bounty for OpenAI to show whether GPT3 was “de­liber­ately” pre­tend­ing to be stupi­der than it is

jacobjacob
21 Jul 2020 18:42 UTC
50 points
40 comments2 min readLW link
(twitter.com)

[Preprint] The Com­pu­ta­tional Limits of Deep Learning

G Gordon Worley III
21 Jul 2020 21:25 UTC
9 points
1 comment1 min readLW link
(arxiv.org)

[AN #109]: Teach­ing neu­ral nets to gen­er­al­ize the way hu­mans would

rohinmshah
22 Jul 2020 17:10 UTC
17 points
3 comments9 min readLW link
(mailchi.mp)

Re­search agenda for AI safety and a bet­ter civilization

agilecaveman
22 Jul 2020 6:35 UTC
13 points
2 comments16 min readLW link

Weak HCH ac­cesses EXP

evhub
22 Jul 2020 22:36 UTC
14 points
0 comments3 min readLW link

GPT-3 Gems

TurnTrout
23 Jul 2020 0:46 UTC
26 points
7 comments41 min readLW link

Op­ti­miz­ing ar­bi­trary ex­pres­sions with a lin­ear num­ber of queries to a Log­i­cal In­duc­tion Or­a­cle (Car­toon Guide)

Donald Hobson
23 Jul 2020 21:37 UTC
3 points
2 comments2 min readLW link

[Question] Con­struct a port­fo­lio to profit from AI progress.

deluks917
25 Jul 2020 8:18 UTC
28 points
13 comments1 min readLW link

Think­ing soberly about the con­text and con­se­quences of Friendly AI

Mitchell_Porter
16 Oct 2012 4:33 UTC
12 points
39 comments1 min readLW link

Goal re­ten­tion dis­cus­sion with Eliezer

MaxTegmark
4 Sep 2014 22:23 UTC
61 points
26 comments6 min readLW link

[Question] Where do peo­ple dis­cuss do­ing things with GPT-3?

skybrian
26 Jul 2020 14:31 UTC
3 points
7 comments1 min readLW link

You Can Prob­a­bly Am­plify GPT3 Directly

Zachary Robertson
26 Jul 2020 21:58 UTC
35 points
14 comments6 min readLW link

[up­dated] how does gpt2′s train­ing cor­pus cap­ture in­ter­net dis­cus­sion? not well

nostalgebraist
27 Jul 2020 22:30 UTC
24 points
3 comments2 min readLW link
(nostalgebraist.tumblr.com)

Agen­tic Lan­guage Model Memes

FactorialCode
1 Aug 2020 18:03 UTC
11 points
1 comment2 min readLW link

A com­mu­nity-cu­rated repos­i­tory of in­ter­est­ing GPT-3 stuff

Rudi C
28 Jul 2020 14:16 UTC
8 points
0 comments1 min readLW link
(github.com)

[Question] Does the lot­tery ticket hy­poth­e­sis sug­gest the scal­ing hy­poth­e­sis?

Daniel Kokotajlo
28 Jul 2020 19:52 UTC
13 points
2 comments1 min readLW link

[Question] To what ex­tent are the scal­ing prop­er­ties of Trans­former net­works ex­cep­tional?

abramdemski
28 Jul 2020 20:06 UTC
29 points
1 comment1 min readLW link

[Question] What hap­pens to var­i­ance as neu­ral net­work train­ing is scaled? What does it im­ply about “lot­tery tick­ets”?

abramdemski
28 Jul 2020 20:22 UTC
25 points
2 comments1 min readLW link

[Question] How will in­ter­net fo­rums like LW be able to defend against GPT-style spam?

ChristianKl
28 Jul 2020 20:12 UTC
14 points
18 comments1 min readLW link

Pre­dic­tions for GPT-N

hippke
29 Jul 2020 1:16 UTC
37 points
31 comments1 min readLW link

An­nounce­ment: AI al­ign­ment prize win­ners and next round

cousin_it
15 Jan 2018 14:33 UTC
167 points
68 comments2 min readLW link

Jeff Hawk­ins on neu­ro­mor­phic AGI within 20 years

steve2152
15 Jul 2019 19:16 UTC
164 points
14 comments12 min readLW link

Cas­cades, Cy­cles, In­sight...

Eliezer Yudkowsky
24 Nov 2008 9:33 UTC
17 points
31 comments8 min readLW link

...Re­cur­sion, Magic

Eliezer Yudkowsky
25 Nov 2008 9:10 UTC
16 points
28 comments5 min readLW link

Refer­ences & Re­sources for LessWrong

XiXiDu
10 Oct 2010 14:54 UTC
124 points
106 comments20 min readLW link

[Question] A game de­signed to beat AI?

Long try
17 Mar 2020 3:51 UTC
13 points
29 comments1 min readLW link

Ma­chine Learn­ing Can’t Han­dle Long-Term Time-Series Data

lsusr
5 Jan 2020 3:43 UTC
2 points
10 comments5 min readLW link

Truly Part Of You

Eliezer Yudkowsky
21 Nov 2007 2:18 UTC
98 points
56 comments4 min readLW link

[AN #110]: Learn­ing fea­tures from hu­man feed­back to en­able re­ward learning

rohinmshah
29 Jul 2020 17:20 UTC
13 points
2 comments10 min readLW link
(mailchi.mp)

Struc­tured Tasks for Lan­guage Models

Zachary Robertson
29 Jul 2020 14:17 UTC
5 points
0 comments1 min readLW link

En­gag­ing Se­ri­ously with Short Timelines

deluks917
29 Jul 2020 19:21 UTC
42 points
22 comments3 min readLW link

What Failure Looks Like: Distill­ing the Discussion

Ben Pace
29 Jul 2020 21:49 UTC
67 points
11 comments7 min readLW link

Learn­ing the prior and generalization

evhub
29 Jul 2020 22:49 UTC
33 points
10 comments4 min readLW link

[Question] Is the work on AI al­ign­ment rele­vant to GPT?

Richard_Kennaway
30 Jul 2020 12:23 UTC
12 points
5 comments1 min readLW link

Ver­ifi­ca­tion and Transparency

DanielFilan
8 Aug 2019 1:50 UTC
37 points
6 comments2 min readLW link
(danielfilan.com)

Robin Han­son on Lump­iness of AI Services

DanielFilan
17 Feb 2019 23:08 UTC
16 points
2 comments2 min readLW link
(www.overcomingbias.com)

One Way to Think About ML Transparency

Matthew Barnett
2 Sep 2019 23:27 UTC
26 points
28 comments5 min readLW link

What is In­ter­pretabil­ity?

17 Mar 2020 20:23 UTC
31 points
0 comments11 min readLW link

Re­laxed ad­ver­sar­ial train­ing for in­ner alignment

evhub
10 Sep 2019 23:03 UTC
45 points
10 comments27 min readLW link

Con­clu­sion to ‘Refram­ing Im­pact’

TurnTrout
28 Feb 2020 16:05 UTC
43 points
17 comments2 min readLW link

Bayesian Evolv­ing-to-Extinction

abramdemski
14 Feb 2020 23:55 UTC
39 points
13 comments5 min readLW link

Do Suffi­ciently Ad­vanced Agents Use Logic?

abramdemski
13 Sep 2019 19:53 UTC
41 points
11 comments9 min readLW link

World State is the Wrong Ab­strac­tion for Impact

TurnTrout
1 Oct 2019 21:03 UTC
61 points
17 comments2 min readLW link

At­tain­able Utility Preser­va­tion: Concepts

TurnTrout
17 Feb 2020 5:20 UTC
40 points
18 comments1 min readLW link

At­tain­able Utility Preser­va­tion: Em­piri­cal Results

22 Feb 2020 0:38 UTC
48 points
7 comments9 min readLW link

How Low Should Fruit Hang Be­fore We Pick It?

TurnTrout
25 Feb 2020 2:08 UTC
28 points
9 comments12 min readLW link

At­tain­able Utility Preser­va­tion: Scal­ing to Superhuman

TurnTrout
27 Feb 2020 0:52 UTC
26 points
20 comments8 min readLW link

Rea­sons for Ex­cite­ment about Im­pact of Im­pact Mea­sure Research

TurnTrout
27 Feb 2020 21:42 UTC
29 points
8 comments4 min readLW link

Power as Easily Ex­ploitable Opportunities

TurnTrout
1 Aug 2020 2:14 UTC
26 points
5 comments6 min readLW link

[Question] Would AGIs par­ent young AGIs?

Vishrut Arya
2 Aug 2020 0:57 UTC
3 points
6 comments1 min readLW link

If I were a well-in­ten­tioned AI… I: Image classifier

Stuart_Armstrong
26 Feb 2020 12:39 UTC
35 points
4 comments5 min readLW link

Non-Con­se­quen­tial­ist Co­op­er­a­tion?

abramdemski
11 Jan 2019 9:15 UTC
48 points
15 comments7 min readLW link

Cu­ri­os­ity Killed the Cat and the Asymp­tot­i­cally Op­ti­mal Agent

michaelcohen
20 Feb 2020 17:28 UTC
28 points
15 comments1 min readLW link

If I were a well-in­ten­tioned AI… IV: Mesa-optimising

Stuart_Armstrong
2 Mar 2020 12:16 UTC
26 points
2 comments6 min readLW link

Re­sponse to Oren Etz­ioni’s “How to know if ar­tifi­cial in­tel­li­gence is about to de­stroy civ­i­liza­tion”

Daniel Kokotajlo
27 Feb 2020 18:10 UTC
29 points
5 comments8 min readLW link

Clar­ify­ing Power-Seek­ing and In­stru­men­tal Convergence

TurnTrout
20 Dec 2019 19:59 UTC
42 points
7 comments3 min readLW link

How im­por­tant are MDPs for AGI (Safety)?

michaelcohen
26 Mar 2020 20:32 UTC
14 points
8 comments2 min readLW link

Syn­the­siz­ing am­plifi­ca­tion and debate

evhub
5 Feb 2020 22:53 UTC
39 points
10 comments4 min readLW link

is gpt-3 few-shot ready for real ap­pli­ca­tions?

nostalgebraist
3 Aug 2020 19:50 UTC
31 points
5 comments9 min readLW link
(nostalgebraist.tumblr.com)

In­ter­pretabil­ity in ML: A Broad Overview

lifelonglearner
4 Aug 2020 19:03 UTC
32 points
5 comments15 min readLW link

In­finite Data/​Com­pute Ar­gu­ments in Alignment

johnswentworth
4 Aug 2020 20:21 UTC
42 points
6 comments2 min readLW link

Four Ways An Im­pact Mea­sure Could Help Alignment

Matthew Barnett
8 Aug 2019 0:10 UTC
21 points
1 comment8 min readLW link

Un­der­stand­ing Re­cent Im­pact Measures

Matthew Barnett
7 Aug 2019 4:57 UTC
17 points
6 comments7 min readLW link

A Sur­vey of Early Im­pact Measures

Matthew Barnett
6 Aug 2019 1:22 UTC
24 points
0 comments8 min readLW link

Op­ti­miza­tion Reg­u­lariza­tion through Time Penalty

Linda Linsefors
1 Jan 2019 13:05 UTC
12 points
4 comments3 min readLW link

Stable Poin­t­ers to Value III: Re­cur­sive Quantilization

abramdemski
21 Jul 2018 8:06 UTC
20 points
4 comments4 min readLW link

Thoughts on Quantilizers

Stuart_Armstrong
2 Jun 2017 16:24 UTC
2 points
0 comments2 min readLW link

Quan­tiliz­ers max­i­mize ex­pected util­ity sub­ject to a con­ser­va­tive cost constraint

jessicata
28 Sep 2015 2:17 UTC
3 points
0 comments5 min readLW link

Quan­tilal con­trol for finite MDPs

Vanessa Kosoy
12 Apr 2018 9:21 UTC
3 points
0 comments13 min readLW link

The limits of corrigibility

Stuart_Armstrong
10 Apr 2018 10:49 UTC
44 points
9 comments4 min readLW link

Align­ment Newslet­ter #16: 07/​23/​18

rohinmshah
23 Jul 2018 16:20 UTC
44 points
0 comments12 min readLW link
(mailchi.mp)

Mea­sur­ing hard­ware overhang

hippke
5 Aug 2020 19:59 UTC
43 points
6 comments4 min readLW link

[AN #111]: The Cir­cuits hy­pothe­ses for deep learning

rohinmshah
5 Aug 2020 17:40 UTC
23 points
0 comments9 min readLW link
(mailchi.mp)

Self-Fulfilling Prophe­cies Aren’t Always About Self-Awareness

John_Maxwell
18 Nov 2019 23:11 UTC
15 points
7 comments4 min readLW link

The Good­hart Game

John_Maxwell
18 Nov 2019 23:22 UTC
12 points
5 comments5 min readLW link

Why don’t sin­gu­lar­i­tar­i­ans bet on the cre­ation of AGI by buy­ing stocks?

John_Maxwell
11 Mar 2020 16:27 UTC
33 points
19 comments4 min readLW link

The Dual­ist Pre­dict-O-Matic ($100 prize)

John_Maxwell
17 Oct 2019 6:45 UTC
17 points
35 comments5 min readLW link

[Question] What AI safety prob­lems need solv­ing for safe AI re­search as­sis­tants?

John_Maxwell
5 Nov 2019 2:09 UTC
15 points
13 comments1 min readLW link

Refin­ing the Evolu­tion­ary Anal­ogy to AI

brglnd
7 Aug 2020 23:13 UTC
9 points
2 comments4 min readLW link

The Fu­sion Power Gen­er­a­tor Scenario

johnswentworth
8 Aug 2020 18:31 UTC
101 points
23 comments3 min readLW link

[Question] How much is known about the “in­fer­ence rules” of log­i­cal in­duc­tion?

Eigil Rischel
8 Aug 2020 10:45 UTC
10 points
7 comments1 min readLW link

If I were a well-in­ten­tioned AI… II: Act­ing in a world

Stuart_Armstrong
27 Feb 2020 11:58 UTC
20 points
0 comments3 min readLW link

If I were a well-in­ten­tioned AI… III: Ex­tremal Goodhart

Stuart_Armstrong
28 Feb 2020 11:24 UTC
20 points
0 comments5 min readLW link

Towards a For­mal­i­sa­tion of Log­i­cal Counterfactuals

Bunthut
8 Aug 2020 22:14 UTC
6 points
2 comments2 min readLW link

[Question] 10/​50/​90% chance of GPT-N Trans­for­ma­tive AI?

human_generated_text
9 Aug 2020 0:10 UTC
25 points
8 comments1 min readLW link

[Question] Can we ex­pect more value from AI al­ign­ment than from an ASI with the goal of run­ning al­ter­nate tra­jec­to­ries of our uni­verse?

Maxime Riché
9 Aug 2020 17:17 UTC
1 point
5 comments1 min readLW link

In defense of Or­a­cle (“Tool”) AI research

steve2152
7 Aug 2019 19:14 UTC
20 points
11 comments4 min readLW link

How GPT-N will es­cape from its AI-box

hippke
12 Aug 2020 19:34 UTC
8 points
9 comments1 min readLW link

Strong im­pli­ca­tion of prefer­ence uncertainty

Stuart_Armstrong
12 Aug 2020 19:02 UTC
20 points
3 comments2 min readLW link

[AN #112]: Eng­ineer­ing a Safer World

rohinmshah
13 Aug 2020 17:20 UTC
22 points
1 comment12 min readLW link
(mailchi.mp)

Room and Board for Peo­ple Self-Learn­ing ML or Do­ing In­de­pen­dent ML Research

SamuelKnoche
14 Aug 2020 17:19 UTC
7 points
1 comment1 min readLW link

Talk and Q&A—Dan Hendrycks—Paper: Align­ing AI With Shared Hu­man Values. On Dis­cord at Aug 28, 2020 8:00-10:00 AM GMT+8.

wassname
14 Aug 2020 23:57 UTC
1 point
0 comments1 min readLW link

Search ver­sus design

alexflint
16 Aug 2020 16:53 UTC
78 points
39 comments36 min readLW link

Work on Se­cu­rity In­stead of Friendli­ness?

Wei_Dai
21 Jul 2012 18:28 UTC
37 points
107 comments2 min readLW link

Goal-Direct­ed­ness: What Suc­cess Looks Like

adamShimi
16 Aug 2020 18:33 UTC
9 points
0 comments2 min readLW link

[Question] A way to beat su­per­ra­tional/​EDT agents?

Abhimanyu Pallavi Sudhir
17 Aug 2020 14:33 UTC
6 points
13 comments1 min readLW link

Learn­ing hu­man prefer­ences: op­ti­mistic and pes­simistic scenarios

Stuart_Armstrong
18 Aug 2020 13:05 UTC
26 points
6 comments6 min readLW link

Mesa-Search vs Mesa-Control

abramdemski
18 Aug 2020 18:51 UTC
51 points
41 comments7 min readLW link

Why we want un­bi­ased learn­ing processes

Stuart_Armstrong
20 Feb 2018 14:48 UTC
37 points
3 comments3 min readLW link

In­tu­itive ex­am­ples of re­ward func­tion learn­ing?

Stuart_Armstrong
6 Mar 2018 16:54 UTC
22 points
3 comments2 min readLW link

Open-Cat­e­gory Classification

TurnTrout
28 Mar 2018 14:49 UTC
36 points
6 comments10 min readLW link

Look­ing for ad­ver­sar­ial col­lab­o­ra­tors to test our De­bate protocol

Beth Barnes
19 Aug 2020 3:15 UTC
52 points
5 comments1 min readLW link

Walk­through of ‘For­mal­iz­ing Con­ver­gent In­stru­men­tal Goals’

TurnTrout
26 Feb 2018 2:20 UTC
27 points
2 comments10 min readLW link

Am­bi­guity Detection

TurnTrout
1 Mar 2018 4:23 UTC
33 points
9 comments4 min readLW link

Pe­nal­iz­ing Im­pact via At­tain­able Utility Preservation

TurnTrout
28 Dec 2018 21:46 UTC
26 points
0 comments3 min readLW link
(arxiv.org)

What You See Isn’t Always What You Want

TurnTrout
13 Sep 2019 4:17 UTC
30 points
12 comments3 min readLW link

[Question] In­stru­men­tal Oc­cam?

abramdemski
31 Jan 2020 19:27 UTC
31 points
15 comments1 min readLW link

Com­pact vs. Wide Models

Vaniver
16 Jul 2018 4:09 UTC
32 points
5 comments3 min readLW link

Alex Ir­pan: “My AI Timelines Have Sped Up”

Vaniver
19 Aug 2020 16:23 UTC
44 points
20 comments1 min readLW link
(www.alexirpan.com)

[AN #113]: Check­ing the eth­i­cal in­tu­itions of large lan­guage models

rohinmshah
19 Aug 2020 17:10 UTC
23 points
0 comments9 min readLW link
(mailchi.mp)

AI safety as feather­less bipeds *with broad flat nails*

Stuart_Armstrong
19 Aug 2020 10:22 UTC
35 points
1 comment1 min readLW link

Time Magaz­ine has an ar­ti­cle about the Sin­gu­lar­ity...

Raemon
11 Feb 2011 2:20 UTC
28 points
13 comments1 min readLW link

How rapidly are GPUs im­prov­ing in price perfor­mance?

gallabytes
25 Nov 2018 19:54 UTC
32 points
9 comments1 min readLW link
(mediangroup.org)

Our val­ues are un­der­defined, change­able, and manipulable

Stuart_Armstrong
2 Nov 2017 11:09 UTC
27 points
6 comments3 min readLW link

[Question] What fund­ing sources ex­ist for tech­ni­cal AI safety re­search?

johnswentworth
1 Oct 2019 15:30 UTC
27 points
5 comments1 min readLW link

Hu­mans can drive cars

Apprentice
30 Jan 2014 11:55 UTC
33 points
87 comments2 min readLW link

A Less Wrong sin­gu­lar­ity ar­ti­cle?

Kaj_Sotala
17 Nov 2009 14:15 UTC
30 points
215 comments1 min readLW link

The Bayesian Tyrant

abramdemski
20 Aug 2020 0:08 UTC
103 points
14 comments6 min readLW link

Con­cept Safety: Pro­duc­ing similar AI-hu­man con­cept spaces

Kaj_Sotala
14 Apr 2015 20:39 UTC
31 points
45 comments8 min readLW link

[LINK] What should a rea­son­able per­son be­lieve about the Sin­gu­lar­ity?

Kaj_Sotala
13 Jan 2011 9:32 UTC
27 points
14 comments2 min readLW link

The many ways AIs be­have badly

Stuart_Armstrong
24 Apr 2018 11:40 UTC
25 points
3 comments2 min readLW link

July 2020 gw­ern.net newsletter

gwern
20 Aug 2020 16:39 UTC
27 points
0 comments1 min readLW link
(www.gwern.net)

Do what we mean vs. do what we say

rohinmshah
30 Aug 2018 22:03 UTC
36 points
14 comments1 min readLW link

[Question] What’s a De­com­pos­able Align­ment Topic?

elriggs
21 Aug 2020 22:57 UTC
26 points
16 comments1 min readLW link

Tools ver­sus agents

Stuart_Armstrong
16 May 2012 13:00 UTC
29 points
39 comments5 min readLW link

An un­al­igned benchmark

paulfchristiano
17 Nov 2018 15:51 UTC
28 points
0 comments9 min readLW link

Fol­low­ing hu­man norms

rohinmshah
20 Jan 2019 23:59 UTC
28 points
10 comments5 min readLW link

nos­talge­braist: Re­cur­sive Good­hart’s Law

Kaj_Sotala
26 Aug 2020 11:07 UTC
56 points
27 comments1 min readLW link
(nostalgebraist.tumblr.com)

[AN #114]: The­ory-in­spired safety solu­tions for pow­er­ful Bayesian RL agents

rohinmshah
26 Aug 2020 17:20 UTC
21 points
3 comments8 min readLW link
(mailchi.mp)

[Question] How hard would it be to change GPT-3 in a way that al­lows au­dio?

ChristianKl
28 Aug 2020 14:42 UTC
8 points
5 comments1 min readLW link

Safe Scram­bling?

Hoagy
29 Aug 2020 14:31 UTC
3 points
1 comment2 min readLW link

(Hu­mor) AI Align­ment Crit­i­cal Failure Table

Kaj_Sotala
31 Aug 2020 19:51 UTC
25 points
2 comments1 min readLW link
(sl4.org)

What is am­bi­tious value learn­ing?

rohinmshah
1 Nov 2018 16:20 UTC
48 points
28 comments2 min readLW link

The easy goal in­fer­ence prob­lem is still hard

paulfchristiano
3 Nov 2018 14:41 UTC
42 points
17 comments4 min readLW link

[AN #115]: AI safety re­search prob­lems in the AI-GA framework

rohinmshah
2 Sep 2020 17:10 UTC
19 points
16 comments6 min readLW link
(mailchi.mp)

Emo­tional valence vs RL re­ward: a video game analogy

steve2152
3 Sep 2020 15:28 UTC
12 points
6 comments4 min readLW link

Us­ing GPT-N to Solve In­ter­pretabil­ity of Neu­ral Net­works: A Re­search Agenda

3 Sep 2020 18:27 UTC
60 points
11 comments2 min readLW link

“Learn­ing to Sum­ma­rize with Hu­man Feed­back”—OpenAI

Rekrul
7 Sep 2020 17:59 UTC
56 points
2 comments1 min readLW link

[AN #116]: How to make ex­pla­na­tions of neu­rons compositional

rohinmshah
9 Sep 2020 17:20 UTC
21 points
2 comments9 min readLW link
(mailchi.mp)

Safer sand­box­ing via col­lec­tive separation

Richard_Ngo
9 Sep 2020 19:49 UTC
20 points
6 comments4 min readLW link
No comments.