RSS

AI

Core TagLast edit: 4 Oct 2021 19:59 UTC by plex

Artificial Intelligence is the study of creating intelligence in algorithms. On LessWrong, the primary focus of AI discussion is to ensure that as humanity builds increasingly powerful AI systems, the outcome will be good. The central concern is that a powerful enough AI, if not designed and implemented with sufficient understanding, would optimize something unintended by its creators and pose an existential threat to the future of humanity. This is known as the AI alignment problem.

Common terms in this space are superintelligence, AI Alignment, AI Safety, Friendly AI, Transformative AI, human-level-intelligence, AI Governance, and Beneficial AI. This entry and the associated tag roughly encompass all of these topics: anything part of the broad cluster of understanding AI and its future impacts on our civilization deserves this tag.

AI Alignment

There are narrow conceptions of alignment, where you’re trying to get it to do something like cure Alzheimer’s disease without destroying the rest of the world. And there’s much more ambitious notions of alignment, where you’re trying to get it to do the right thing and achieve a happy intergalactic civilization.

But both the narrow and the ambitious alignment have in common that you’re trying to have the AI do that thing rather than making a lot of paperclips.

See also General Intelligence.

Basic Alignment Theory

AIXI
Coherent Extrapolated Volition
Complexity of Value
Corrigibility
Decision Theory
Embedded Agency
Fixed Point Theorems
Goodhart’s Law
Goal-Directedness
Infra-Bayesianism
Inner Alignment
Instrumental Convergence
Intelligence Explosion
Logical Induction
Logical Uncertainty
Mesa-Optimization
Myopia
Newcomb’s Problem
Optimization
Orthogonality Thesis
Outer Alignment
Paperclip Maximizer
Recursive Self-Improvement
Solomonoff Induction
Treacherous Turn
Utility Functions

Engineering Alignment

AI Boxing (Containment)
Conservatism (AI)
Debate (AI safety technique)
Factored Cognition
Humans Consulting HCH
Impact Measures
Inverse Reinforcement Learning
Iterated Amplification
Mild Optimization
Oracle AI
Reward Functions
Tool AI
Transparency /​ Interpretability
Tripwire
Value Learning

Strategy

AI Governance
AI Risk
AI Services (CAIS)
AI Takeoff
AI Timelines
Computing Overhang
Regulation and AI Risk
Transformative AI

Organizations

AI Safety Camp
Centre for Human-Compatible AI
DeepMind
Future of Humanity Institute
Future of Life Institute
Machine Intelligence Research Institute
OpenAI
Ought

Other

AI Capabilities
GPT
Language Models
Machine Learning
Narrow AI
Neuromorphic AI
Reinforcement Learning
Research Agendas
Superintelligence
Whole Brain Emulation

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhub29 May 2020 20:38 UTC
181 points
34 comments38 min readLW link2 reviews

There’s No Fire Alarm for Ar­tifi­cial Gen­eral Intelligence

Eliezer Yudkowsky13 Oct 2017 21:38 UTC
120 points
69 comments25 min readLW link

Su­per­in­tel­li­gence FAQ

Scott Alexander20 Sep 2016 19:00 UTC
83 points
11 comments27 min readLW link

Risks from Learned Op­ti­miza­tion: Introduction

31 May 2019 23:44 UTC
156 points
42 comments12 min readLW link3 reviews

Embed­ded Agents

29 Oct 2018 19:53 UTC
194 points
41 comments1 min readLW link2 reviews

What failure looks like

paulfchristiano17 Mar 2019 20:18 UTC
283 points
49 comments8 min readLW link2 reviews

The Rocket Align­ment Problem

Eliezer Yudkowsky4 Oct 2018 0:38 UTC
186 points
42 comments15 min readLW link2 reviews

Challenges to Chris­ti­ano’s ca­pa­bil­ity am­plifi­ca­tion proposal

Eliezer Yudkowsky19 May 2018 18:18 UTC
107 points
54 comments23 min readLW link1 review

Embed­ded Agency (full-text ver­sion)

15 Nov 2018 19:49 UTC
125 points
11 comments54 min readLW link

A space of pro­pos­als for build­ing safe ad­vanced AI

Richard_Ngo10 Jul 2020 16:58 UTC
51 points
4 comments4 min readLW link

Biol­ogy-In­spired AGI Timelines: The Trick That Never Works

Eliezer Yudkowsky1 Dec 2021 22:35 UTC
174 points
139 comments65 min readLW link

larger lan­guage mod­els may dis­ap­point you [or, an eter­nally un­finished draft]

nostalgebraist26 Nov 2021 23:08 UTC
221 points
28 comments31 min readLW link

Deep­mind’s Go­pher—more pow­er­ful than GPT-3

hath8 Dec 2021 17:06 UTC
86 points
27 comments1 min readLW link
(deepmind.com)

Good­hart Taxonomy

Scott Garrabrant30 Dec 2017 16:38 UTC
171 points
33 comments10 min readLW link

AI Align­ment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC
125 points
6 comments35 min readLW link

Some AI re­search ar­eas and their rele­vance to ex­is­ten­tial safety

Andrew_Critch19 Nov 2020 3:18 UTC
183 points
39 comments50 min readLW link2 reviews

Mo­ravec’s Para­dox Comes From The Availa­bil­ity Heuristic

james.lucassen20 Oct 2021 6:23 UTC
31 points
2 comments2 min readLW link
(jlucassen.com)

In­fer­ence cost limits the im­pact of ever larger models

SoerenMind23 Oct 2021 10:51 UTC
36 points
28 comments2 min readLW link

[Linkpost] Chi­nese gov­ern­ment’s guidelines on AI

RomanS10 Dec 2021 21:10 UTC
61 points
14 comments1 min readLW link

That Alien Message

Eliezer Yudkowsky22 May 2008 5:55 UTC
282 points
173 comments10 min readLW link

Episte­molog­i­cal Fram­ing for AI Align­ment Research

adamShimi8 Mar 2021 22:05 UTC
53 points
7 comments9 min readLW link

Effi­cien­tZero: hu­man ALE sam­ple-effi­ciency w/​MuZero+self-supervised

gwern2 Nov 2021 2:32 UTC
134 points
52 comments1 min readLW link
(arxiv.org)

Dis­cus­sion with Eliezer Yud­kowsky on AGI interventions

11 Nov 2021 3:01 UTC
325 points
256 comments34 min readLW link

Shul­man and Yud­kowsky on AI progress

3 Dec 2021 20:05 UTC
91 points
16 comments20 min readLW link

Fu­ture ML Sys­tems Will Be Qual­i­ta­tively Different

jsteinhardt11 Jan 2022 19:50 UTC
107 points
10 comments5 min readLW link
(bounded-regret.ghost.io)

[Linkpost] Tro­janNet: Embed­ding Hid­den Tro­jan Horse Models in Neu­ral Networks

Gunnar_Zarncke11 Feb 2022 1:17 UTC
12 points
1 comment1 min readLW link

Ro­bust­ness to Scale

Scott Garrabrant21 Feb 2018 22:55 UTC
103 points
22 comments2 min readLW link1 review

Chris Olah’s views on AGI safety

evhub1 Nov 2019 20:13 UTC
187 points
38 comments12 min readLW link2 reviews

[AN #96]: Buck and I dis­cuss/​ar­gue about AI Alignment

Rohin Shah22 Apr 2020 17:20 UTC
17 points
4 comments10 min readLW link
(mailchi.mp)

Matt Botv­inick on the spon­ta­neous emer­gence of learn­ing algorithms

Adam Scholl12 Aug 2020 7:47 UTC
149 points
90 comments5 min readLW link

Co­her­ence ar­gu­ments do not en­tail goal-di­rected behavior

Rohin Shah3 Dec 2018 3:26 UTC
93 points
67 comments7 min readLW link3 reviews

Align­ment By Default

johnswentworth12 Aug 2020 18:54 UTC
128 points
92 comments11 min readLW link2 reviews

Book re­view: “A Thou­sand Brains” by Jeff Hawkins

Steven Byrnes4 Mar 2021 5:10 UTC
107 points
18 comments19 min readLW link

Model­ling Trans­for­ma­tive AI Risks (MTAIR) Pro­ject: Introduction

16 Aug 2021 7:12 UTC
82 points
0 comments9 min readLW link

In­fra-Bayesian phys­i­cal­ism: a for­mal the­ory of nat­u­ral­ized induction

Vanessa Kosoy30 Nov 2021 22:25 UTC
84 points
13 comments40 min readLW link

What an ac­tu­ally pes­simistic con­tain­ment strat­egy looks like

lc5 Apr 2022 0:19 UTC
497 points
123 comments6 min readLW link

AlphaGo Zero and the Foom Debate

Eliezer Yudkowsky21 Oct 2017 2:18 UTC
81 points
17 comments3 min readLW link

Trade­off be­tween de­sir­able prop­er­ties for baseline choices in im­pact measures

Vika4 Jul 2020 11:56 UTC
37 points
24 comments5 min readLW link

Com­pe­ti­tion: Am­plify Ro­hin’s Pre­dic­tion on AGI re­searchers & Safety Concerns

stuhlmueller21 Jul 2020 20:06 UTC
80 points
40 comments3 min readLW link

the scal­ing “in­con­sis­tency”: openAI’s new insight

nostalgebraist7 Nov 2020 7:40 UTC
146 points
12 comments9 min readLW link
(nostalgebraist.tumblr.com)

2019 Re­view Rewrite: Seek­ing Power is Often Ro­bustly In­stru­men­tal in MDPs

TurnTrout23 Dec 2020 17:16 UTC
35 points
0 comments4 min readLW link
(www.lesswrong.com)

Boot­strapped Alignment

G Gordon Worley III27 Feb 2021 15:46 UTC
19 points
12 comments2 min readLW link

Mul­ti­modal Neu­rons in Ar­tifi­cial Neu­ral Networks

Kaj_Sotala5 Mar 2021 9:01 UTC
57 points
2 comments2 min readLW link
(distill.pub)

Re­view of “Fun with +12 OOMs of Com­pute”

28 Mar 2021 14:55 UTC
57 points
20 comments8 min readLW link

Draft re­port on ex­is­ten­tial risk from power-seek­ing AI

Joe Carlsmith28 Apr 2021 21:41 UTC
66 points
23 comments1 min readLW link

Rogue AGI Em­bod­ies Valuable In­tel­lec­tual Property

3 Jun 2021 20:37 UTC
69 points
9 comments3 min readLW link

Analo­gies and Gen­eral Pri­ors on Intelligence

20 Aug 2021 21:03 UTC
57 points
12 comments14 min readLW link

We’re already in AI takeoff

Valentine8 Mar 2022 23:09 UTC
106 points
106 comments7 min readLW link

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC
37 points
0 comments59 min readLW link

Dis­con­tin­u­ous progress in his­tory: an update

KatjaGrace14 Apr 2020 0:00 UTC
178 points
25 comments31 min readLW link1 review
(aiimpacts.org)

Repli­ca­tion Dy­nam­ics Bridge to RL in Ther­mo­dy­namic Limit

Zachary Robertson18 May 2020 1:02 UTC
6 points
1 comment2 min readLW link

The ground of optimization

Alex Flint20 Jun 2020 0:38 UTC
202 points
74 comments27 min readLW link1 review

Model­ling Con­tin­u­ous Progress

Sammy Martin23 Jun 2020 18:06 UTC
29 points
3 comments7 min readLW link

Clas­sifi­ca­tion of AI al­ign­ment re­search: de­con­fu­sion, “good enough” non-su­per­in­tel­li­gent AI al­ign­ment, su­per­in­tel­li­gent AI alignment

philip_b14 Jul 2020 22:48 UTC
35 points
25 comments3 min readLW link

Col­lec­tion of GPT-3 results

Kaj_Sotala18 Jul 2020 20:04 UTC
88 points
24 comments1 min readLW link
(twitter.com)

Hiring en­g­ineers and re­searchers to help al­ign GPT-3

paulfchristiano1 Oct 2020 18:54 UTC
206 points
14 comments3 min readLW link

The date of AI Takeover is not the day the AI takes over

Daniel Kokotajlo22 Oct 2020 10:41 UTC
118 points
32 comments2 min readLW link1 review

[Question] What could one do with truly un­limited com­pu­ta­tional power?

Yitz11 Nov 2020 10:03 UTC
30 points
22 comments2 min readLW link

AGI Predictions

21 Nov 2020 3:46 UTC
107 points
36 comments4 min readLW link

[Question] What are the best prece­dents for in­dus­tries failing to in­vest in valuable AI re­search?

Daniel Kokotajlo14 Dec 2020 23:57 UTC
18 points
17 comments1 min readLW link

Ex­trap­o­lat­ing GPT-N performance

Lanrian18 Dec 2020 21:41 UTC
96 points
31 comments25 min readLW link1 review

De­bate up­date: Obfus­cated ar­gu­ments problem

Beth Barnes23 Dec 2020 3:24 UTC
118 points
20 comments16 min readLW link

Liter­a­ture Re­view on Goal-Directedness

18 Jan 2021 11:15 UTC
60 points
21 comments31 min readLW link

[Question] How will OpenAI + GitHub’s Copi­lot af­fect pro­gram­ming?

smountjoy29 Jun 2021 16:42 UTC
55 points
23 comments1 min readLW link

Deep­Mind: Gen­er­ally ca­pa­ble agents emerge from open-ended play

Daniel Kokotajlo27 Jul 2021 14:19 UTC
245 points
53 comments2 min readLW link
(deepmind.com)

Model­ing Risks From Learned Optimization

Ben Cottier12 Oct 2021 20:54 UTC
44 points
0 comments12 min readLW link

Truth­ful AI: Devel­op­ing and gov­ern­ing AI that does not lie

18 Oct 2021 18:37 UTC
81 points
9 comments10 min readLW link

Effi­cien­tZero: How It Works

1a3orn26 Nov 2021 15:17 UTC
251 points
42 comments29 min readLW link

The­o­ret­i­cal Neu­ro­science For Align­ment Theory

Cameron Berg7 Dec 2021 21:50 UTC
62 points
19 comments23 min readLW link

Magna Alta Doctrina

jacob_cannell11 Dec 2021 21:54 UTC
35 points
7 comments28 min readLW link

DL to­wards the un­al­igned Re­cur­sive Self-Op­ti­miza­tion attractor

jacob_cannell18 Dec 2021 2:15 UTC
30 points
22 comments4 min readLW link

Reg­u­lariza­tion Causes Mo­du­lar­ity Causes Generalization

dkirmani1 Jan 2022 23:34 UTC
49 points
7 comments4 min readLW link

It Looks Like You’re Try­ing To Take Over The World

gwern9 Mar 2022 16:35 UTC
366 points
123 comments1 min readLW link
(www.gwern.net)

An Un­trol­lable Math­e­mat­i­cian Illustrated

abramdemski20 Mar 2018 0:00 UTC
154 points
38 comments1 min readLW link1 review

Con­di­tions for Mesa-Optimization

1 Jun 2019 20:52 UTC
68 points
47 comments12 min readLW link

Thoughts on Hu­man Models

21 Feb 2019 9:10 UTC
124 points
32 comments10 min readLW link1 review

In­ner al­ign­ment in the brain

Steven Byrnes22 Apr 2020 13:14 UTC
76 points
16 comments16 min readLW link

Prob­lem re­lax­ation as a tactic

TurnTrout22 Apr 2020 23:44 UTC
109 points
8 comments7 min readLW link

[Question] How should po­ten­tial AI al­ign­ment re­searchers gauge whether the field is right for them?

TurnTrout6 May 2020 12:24 UTC
20 points
5 comments1 min readLW link

Speci­fi­ca­tion gam­ing: the flip side of AI ingenuity

6 May 2020 23:51 UTC
45 points
8 comments6 min readLW link

Les­sons from Isaac: Pit­falls of Reason

adamShimi8 May 2020 20:44 UTC
9 points
0 comments8 min readLW link

Cor­rigi­bil­ity as out­side view

TurnTrout8 May 2020 21:56 UTC
36 points
11 comments4 min readLW link

[Question] How to choose a PhD with AI Safety in mind

Ariel Kwiatkowski15 May 2020 22:19 UTC
9 points
1 comment1 min readLW link

Re­ward func­tions and up­dat­ing as­sump­tions can hide a mul­ti­tude of sins

Stuart_Armstrong18 May 2020 15:18 UTC
16 points
2 comments9 min readLW link

Pos­si­ble take­aways from the coro­n­avirus pan­demic for slow AI takeoff

Vika31 May 2020 17:51 UTC
135 points
36 comments3 min readLW link1 review

Fo­cus: you are al­lowed to be bad at ac­com­plish­ing your goals

adamShimi3 Jun 2020 21:04 UTC
19 points
17 comments3 min readLW link

Re­ply to Paul Chris­ti­ano on Inac­cessible Information

Alex Flint5 Jun 2020 9:10 UTC
77 points
15 comments6 min readLW link

Our take on CHAI’s re­search agenda in un­der 1500 words

Alex Flint17 Jun 2020 12:24 UTC
104 points
19 comments5 min readLW link

[Question] Ques­tion on GPT-3 Ex­cel Demo

Zhitao Hou22 Jun 2020 20:31 UTC
0 points
2 comments1 min readLW link

Dy­namic in­con­sis­tency of the in­ac­tion and ini­tial state baseline

Stuart_Armstrong7 Jul 2020 12:02 UTC
30 points
8 comments2 min readLW link

Cortés, Pizarro, and Afonso as Prece­dents for Takeover

Daniel Kokotajlo1 Mar 2020 3:49 UTC
139 points
75 comments11 min readLW link1 review

[Question] What prob­lem would you like to see Re­in­force­ment Learn­ing ap­plied to?

Julian Schrittwieser8 Jul 2020 2:40 UTC
43 points
4 comments1 min readLW link

Refram­ing Su­per­in­tel­li­gence: Com­pre­hen­sive AI Ser­vices as Gen­eral Intelligence

Rohin Shah8 Jan 2019 7:12 UTC
104 points
74 comments5 min readLW link2 reviews
(www.fhi.ox.ac.uk)

My cur­rent frame­work for think­ing about AGI timelines

zhukeepa30 Mar 2020 1:23 UTC
106 points
5 comments3 min readLW link

[Question] To what ex­tent is GPT-3 ca­pa­ble of rea­son­ing?

TurnTrout20 Jul 2020 17:10 UTC
70 points
74 comments16 min readLW link

Repli­cat­ing the repli­ca­tion crisis with GPT-3?

skybrian22 Jul 2020 21:20 UTC
29 points
10 comments1 min readLW link

Can you get AGI from a Trans­former?

Steven Byrnes23 Jul 2020 15:27 UTC
106 points
39 comments12 min readLW link

Writ­ing with GPT-3

Jacob Falkovich24 Jul 2020 15:22 UTC
42 points
0 comments4 min readLW link

In­ner Align­ment: Ex­plain like I’m 12 Edition

Rafael Harth1 Aug 2020 15:24 UTC
170 points
39 comments13 min readLW link2 reviews

Devel­op­men­tal Stages of GPTs

orthonormal26 Jul 2020 22:03 UTC
140 points
74 comments7 min readLW link1 review

Gen­er­al­iz­ing the Power-Seek­ing Theorems

TurnTrout27 Jul 2020 0:28 UTC
40 points
6 comments4 min readLW link

Are we in an AI over­hang?

Andy Jones27 Jul 2020 12:48 UTC
253 points
106 comments4 min readLW link

[Question] What spe­cific dan­gers arise when ask­ing GPT-N to write an Align­ment Fo­rum post?

Matthew Barnett28 Jul 2020 2:56 UTC
43 points
14 comments1 min readLW link

[Question] Prob­a­bil­ity that other ar­chi­tec­tures will scale as well as Trans­form­ers?

Daniel Kokotajlo28 Jul 2020 19:36 UTC
22 points
4 comments1 min readLW link

What a 20-year-lead in mil­i­tary tech might look like

Daniel Kokotajlo29 Jul 2020 20:10 UTC
67 points
44 comments16 min readLW link

[Question] What if memes are com­mon in highly ca­pa­ble minds?

Daniel Kokotajlo30 Jul 2020 20:45 UTC
34 points
11 comments2 min readLW link

Three men­tal images from think­ing about AGI de­bate & corrigibility

Steven Byrnes3 Aug 2020 14:29 UTC
55 points
35 comments4 min readLW link

Solv­ing Key Align­ment Prob­lems Group

Logan Riggs3 Aug 2020 19:30 UTC
19 points
7 comments2 min readLW link

How eas­ily can we sep­a­rate a friendly AI in de­sign space from one which would bring about a hy­per­ex­is­ten­tial catas­tro­phe?

Anirandis10 Sep 2020 0:40 UTC
19 points
20 comments2 min readLW link

My com­pu­ta­tional frame­work for the brain

Steven Byrnes14 Sep 2020 14:19 UTC
141 points
26 comments13 min readLW link1 review

[Question] Where is hu­man level on text pre­dic­tion? (GPTs task)

Daniel Kokotajlo20 Sep 2020 9:00 UTC
25 points
19 comments1 min readLW link

Needed: AI in­fo­haz­ard policy

Vanessa Kosoy21 Sep 2020 15:26 UTC
55 points
17 comments2 min readLW link

The Col­lid­ing Ex­po­nen­tials of AI

VermillionStuka14 Oct 2020 23:31 UTC
27 points
16 comments5 min readLW link

“Lit­tle glimpses of em­pa­thy” as the foun­da­tion for so­cial emotions

Steven Byrnes22 Oct 2020 11:02 UTC
31 points
1 comment5 min readLW link

In­tro­duc­tion to Carte­sian Frames

Scott Garrabrant22 Oct 2020 13:00 UTC
142 points
28 comments22 min readLW link1 review

“Carte­sian Frames” Talk #2 this Sun­day at 2pm (PT)

Rob Bensinger28 Oct 2020 13:59 UTC
30 points
0 comments1 min readLW link

Does SGD Pro­duce De­cep­tive Align­ment?

Mark Xu6 Nov 2020 23:48 UTC
76 points
4 comments16 min readLW link

[Question] How can I bet on short timelines?

Daniel Kokotajlo7 Nov 2020 12:44 UTC
43 points
16 comments2 min readLW link

Non-Ob­struc­tion: A Sim­ple Con­cept Mo­ti­vat­ing Corrigibility

TurnTrout21 Nov 2020 19:35 UTC
64 points
19 comments19 min readLW link

Carte­sian Frames Definitions

Rob Bensinger8 Nov 2020 12:44 UTC
24 points
0 comments4 min readLW link

Com­mu­ni­ca­tion Prior as Align­ment Strategy

johnswentworth12 Nov 2020 22:06 UTC
36 points
7 comments6 min readLW link

How Rood­man’s GWP model trans­lates to TAI timelines

Daniel Kokotajlo16 Nov 2020 14:05 UTC
21 points
5 comments3 min readLW link

Normativity

abramdemski18 Nov 2020 16:52 UTC
46 points
11 comments9 min readLW link

In­ner Align­ment in Salt-Starved Rats

Steven Byrnes19 Nov 2020 2:40 UTC
124 points
38 comments11 min readLW link2 reviews

Con­tin­u­ing the take­offs debate

Richard_Ngo23 Nov 2020 15:58 UTC
67 points
13 comments9 min readLW link

The next AI win­ter will be due to en­ergy costs

hippke24 Nov 2020 16:53 UTC
53 points
6 comments2 min readLW link

Re­cur­sive Quan­tiliz­ers II

abramdemski2 Dec 2020 15:26 UTC
29 points
15 comments13 min readLW link

Su­per­vised learn­ing in the brain, part 4: com­pres­sion /​ filtering

Steven Byrnes5 Dec 2020 17:06 UTC
12 points
0 comments5 min readLW link

Con­ser­vatism in neo­cor­tex-like AGIs

Steven Byrnes8 Dec 2020 16:37 UTC
21 points
5 comments8 min readLW link

Avoid­ing Side Effects in Com­plex Environments

12 Dec 2020 0:34 UTC
62 points
9 comments2 min readLW link
(avoiding-side-effects.github.io)

The Power of Annealing

meanderingmoose14 Dec 2020 11:02 UTC
23 points
6 comments5 min readLW link

[link] The AI Gir­lfriend Se­duc­ing China’s Lonely Men

Kaj_Sotala14 Dec 2020 20:18 UTC
34 points
11 comments1 min readLW link
(www.sixthtone.com)

Oper­a­tional­iz­ing com­pat­i­bil­ity with strat­egy-stealing

evhub24 Dec 2020 22:36 UTC
41 points
6 comments4 min readLW link

De­fus­ing AGI Danger

Mark Xu24 Dec 2020 22:58 UTC
48 points
9 comments9 min readLW link

Multi-di­men­sional re­wards for AGI in­ter­pretabil­ity and control

Steven Byrnes4 Jan 2021 3:08 UTC
11 points
7 comments10 min readLW link

DALL-E by OpenAI

Daniel Kokotajlo5 Jan 2021 20:05 UTC
97 points
22 comments1 min readLW link

Re­view of ‘But ex­actly how com­plex and frag­ile?’

TurnTrout6 Jan 2021 18:39 UTC
53 points
0 comments8 min readLW link

The Case for a Jour­nal of AI Alignment

adamShimi9 Jan 2021 18:13 UTC
44 points
29 comments4 min readLW link

Trans­parency and AGI safety

jylin0411 Jan 2021 18:51 UTC
51 points
12 comments30 min readLW link

Birds, Brains, Planes, and AI: Against Ap­peals to the Com­plex­ity/​Mys­te­ri­ous­ness/​Effi­ciency of the Brain

Daniel Kokotajlo18 Jan 2021 12:08 UTC
181 points
84 comments14 min readLW link

In­fra-Bayesi­anism Unwrapped

adamShimi20 Jan 2021 13:35 UTC
38 points
0 comments24 min readLW link

Op­ti­mal play in hu­man-judged De­bate usu­ally won’t an­swer your question

Joe_Collman27 Jan 2021 7:34 UTC
33 points
12 comments12 min readLW link

Creat­ing AGI Safety Interlocks

Koen.Holtman5 Feb 2021 12:01 UTC
7 points
4 comments8 min readLW link

Timeline of AI safety

riceissa7 Feb 2021 22:29 UTC
59 points
6 comments2 min readLW link
(timelines.issarice.com)

Tour­ne­sol, YouTube and AI Risk

adamShimi12 Feb 2021 18:56 UTC
36 points
13 comments4 min readLW link

In­ter­net En­cy­clo­pe­dia of Philos­o­phy on Ethics of Ar­tifi­cial Intelligence

Kaj_Sotala20 Feb 2021 13:54 UTC
15 points
1 comment4 min readLW link
(iep.utm.edu)

Be­hav­ioral Suffi­cient Statis­tics for Goal-Directedness

adamShimi11 Mar 2021 15:01 UTC
21 points
12 comments9 min readLW link

A sim­ple way to make GPT-3 fol­low instructions

Quintin Pope8 Mar 2021 2:57 UTC
10 points
5 comments4 min readLW link

Towards a Mechanis­tic Un­der­stand­ing of Goal-Directedness

Mark Xu9 Mar 2021 20:17 UTC
41 points
1 comment5 min readLW link

AXRP Epi­sode 5 - In­fra-Bayesi­anism with Vanessa Kosoy

DanielFilan10 Mar 2021 4:30 UTC
28 points
12 comments35 min readLW link

Com­ments on “The Sin­gu­lar­ity is Nowhere Near”

Steven Byrnes16 Mar 2021 23:59 UTC
50 points
6 comments8 min readLW link

Is RL in­volved in sen­sory pro­cess­ing?

Steven Byrnes18 Mar 2021 13:57 UTC
19 points
4 comments5 min readLW link

Against evolu­tion as an anal­ogy for how hu­mans will cre­ate AGI

Steven Byrnes23 Mar 2021 12:29 UTC
35 points
25 comments25 min readLW link

My AGI Threat Model: Misal­igned Model-Based RL Agent

Steven Byrnes25 Mar 2021 13:45 UTC
64 points
40 comments16 min readLW link

Co­her­ence ar­gu­ments im­ply a force for goal-di­rected behavior

KatjaGrace26 Mar 2021 16:10 UTC
87 points
20 comments14 min readLW link
(aiimpacts.org)

Trans­parency Trichotomy

Mark Xu28 Mar 2021 20:26 UTC
22 points
2 comments7 min readLW link

Hard­ware is already ready for the sin­gu­lar­ity. Al­gorithm knowl­edge is the only bar­rier.

Andrew Vlahos30 Mar 2021 22:48 UTC
16 points
3 comments3 min readLW link

Ben Go­ertzel’s “Kinds of Minds”

JoshuaFox11 Apr 2021 12:41 UTC
12 points
4 comments1 min readLW link

Up­dat­ing the Lot­tery Ticket Hypothesis

johnswentworth18 Apr 2021 21:45 UTC
71 points
41 comments2 min readLW link

Three rea­sons to ex­pect long AI timelines

Matthew Barnett22 Apr 2021 18:44 UTC
64 points
29 comments11 min readLW link
(matthewbarnett.substack.com)

Be­ware over-use of the agent model

Alex Flint25 Apr 2021 22:19 UTC
28 points
9 comments5 min readLW link

Agents Over Carte­sian World Models

27 Apr 2021 2:06 UTC
57 points
3 comments27 min readLW link

Less Real­is­tic Tales of Doom

Mark Xu6 May 2021 23:01 UTC
100 points
13 comments4 min readLW link

Challenge: know ev­ery­thing that the best go bot knows about go

DanielFilan11 May 2021 5:10 UTC
48 points
93 comments2 min readLW link
(danielfilan.com)

For­mal In­ner Align­ment, Prospectus

abramdemski12 May 2021 19:57 UTC
91 points
57 comments16 min readLW link

Agency in Con­way’s Game of Life

Alex Flint13 May 2021 1:07 UTC
87 points
77 comments9 min readLW link

Knowl­edge Neu­rons in Pre­trained Transformers

evhub17 May 2021 22:54 UTC
98 points
7 comments2 min readLW link
(arxiv.org)

De­cou­pling de­liber­a­tion from competition

paulfchristiano25 May 2021 18:50 UTC
66 points
16 comments9 min readLW link
(ai-alignment.com)

Power dy­nam­ics as a blind spot or blurry spot in our col­lec­tive world-mod­el­ing, es­pe­cially around AI

Andrew_Critch1 Jun 2021 18:45 UTC
171 points
26 comments6 min readLW link

Game-the­o­retic Align­ment in terms of At­tain­able Utility

8 Jun 2021 12:36 UTC
20 points
2 comments9 min readLW link

Beijing Academy of Ar­tifi­cial In­tel­li­gence an­nounces 1,75 trillion pa­ram­e­ters model, Wu Dao 2.0

Ozyrus3 Jun 2021 12:07 UTC
23 points
9 comments1 min readLW link
(www.engadget.com)

An In­tu­itive Guide to Garrabrant Induction

Mark Xu3 Jun 2021 22:21 UTC
115 points
18 comments24 min readLW link

Con­ser­va­tive Agency with Mul­ti­ple Stakeholders

TurnTrout8 Jun 2021 0:30 UTC
31 points
0 comments3 min readLW link

Sup­ple­ment to “Big pic­ture of pha­sic dopamine”

Steven Byrnes8 Jun 2021 13:08 UTC
12 points
2 comments9 min readLW link

Look­ing Deeper at Deconfusion

adamShimi13 Jun 2021 21:29 UTC
55 points
13 comments15 min readLW link

[Question] Open prob­lem: how can we quan­tify player al­ign­ment in 2x2 nor­mal-form games?

TurnTrout16 Jun 2021 2:09 UTC
23 points
59 comments1 min readLW link

Re­ward Is Not Enough

Steven Byrnes16 Jun 2021 13:52 UTC
92 points
18 comments10 min readLW link

En­vi­ron­men­tal Struc­ture Can Cause In­stru­men­tal Convergence

TurnTrout22 Jun 2021 22:26 UTC
71 points
44 comments16 min readLW link
(arxiv.org)

AXRP Epi­sode 9 - Finite Fac­tored Sets with Scott Garrabrant

DanielFilan24 Jun 2021 22:10 UTC
56 points
2 comments58 min readLW link

Mus­ings on gen­eral sys­tems alignment

Alex Flint30 Jun 2021 18:16 UTC
31 points
11 comments3 min readLW link

Thoughts on safety in pre­dic­tive learning

Steven Byrnes30 Jun 2021 19:17 UTC
18 points
17 comments19 min readLW link

The More Power At Stake, The Stronger In­stru­men­tal Con­ver­gence Gets For Op­ti­mal Policies

TurnTrout11 Jul 2021 17:36 UTC
45 points
7 comments6 min readLW link

A world in which the al­ign­ment prob­lem seems lower-stakes

TurnTrout8 Jul 2021 2:31 UTC
19 points
17 comments2 min readLW link

Frac­tional progress es­ti­mates for AI timelines and im­plied re­source requirements

15 Jul 2021 18:43 UTC
54 points
6 comments7 min readLW link

Ex­per­i­men­ta­tion with AI-gen­er­ated images (VQGAN+CLIP) | So­larpunk air­ships flee­ing a dragon

Kaj_Sotala15 Jul 2021 11:00 UTC
44 points
4 comments2 min readLW link
(kajsotala.fi)

Seek­ing Power is Con­ver­gently In­stru­men­tal in a Broad Class of Environments

TurnTrout8 Aug 2021 2:02 UTC
41 points
15 comments8 min readLW link

LCDT, A My­opic De­ci­sion Theory

3 Aug 2021 22:41 UTC
49 points
50 comments15 min readLW link

When Most VNM-Co­her­ent Prefer­ence Order­ings Have Con­ver­gent In­stru­men­tal Incentives

TurnTrout9 Aug 2021 17:22 UTC
52 points
4 comments5 min readLW link

Two AI-risk-re­lated game de­sign ideas

Daniel Kokotajlo5 Aug 2021 13:36 UTC
46 points
9 comments5 min readLW link

Re­search agenda update

Steven Byrnes6 Aug 2021 19:24 UTC
53 points
40 comments7 min readLW link

What 2026 looks like

Daniel Kokotajlo6 Aug 2021 16:14 UTC
325 points
64 comments16 min readLW link

Satis­ficers Tend To Seek Power: In­stru­men­tal Con­ver­gence Via Retargetability

TurnTrout18 Nov 2021 1:54 UTC
69 points
8 comments17 min readLW link
(www.overleaf.com)

Dopamine-su­per­vised learn­ing in mam­mals & fruit flies

Steven Byrnes10 Aug 2021 16:13 UTC
16 points
6 comments8 min readLW link

Free course re­view — Reli­able and In­ter­pretable Ar­tifi­cial In­tel­li­gence (ETH Zurich)

Jan Czechowski10 Aug 2021 16:36 UTC
7 points
0 comments3 min readLW link

Tech­ni­cal Pre­dic­tions Re­lated to AI Safety

lsusr13 Aug 2021 0:29 UTC
28 points
12 comments8 min readLW link

Provide feed­back on Open Philan­thropy’s AI al­ign­ment RFP

20 Aug 2021 19:52 UTC
56 points
6 comments1 min readLW link

AI Safety Papers: An App for the TAI Safety Database

ozziegooen21 Aug 2021 2:02 UTC
74 points
12 comments2 min readLW link

Ran­dal Koene on brain un­der­stand­ing be­fore whole brain emulation

Steven Byrnes23 Aug 2021 20:59 UTC
36 points
12 comments3 min readLW link

MIRI/​OP ex­change about de­ci­sion theory

Rob Bensinger25 Aug 2021 22:44 UTC
46 points
7 comments10 min readLW link

Good­hart Ethology

Charlie Steiner17 Sep 2021 17:31 UTC
17 points
4 comments15 min readLW link

[Question] What are good al­ign­ment con­fer­ence pa­pers?

adamShimi28 Aug 2021 13:35 UTC
12 points
2 comments1 min readLW link

Brain-Com­puter In­ter­faces and AI Alignment

niplav28 Aug 2021 19:48 UTC
27 points
5 comments11 min readLW link

Su­per­in­tel­li­gent In­tro­spec­tion: A Counter-ar­gu­ment to the Orthog­o­nal­ity Thesis

AllAmericanBreakfast29 Aug 2021 4:53 UTC
2 points
18 comments4 min readLW link

Align­ment Re­search = Con­cep­tual Align­ment Re­search + Ap­plied Align­ment Research

adamShimi30 Aug 2021 21:13 UTC
35 points
14 comments5 min readLW link

AXRP Epi­sode 11 - At­tain­able Utility and Power with Alex Turner

DanielFilan25 Sep 2021 21:10 UTC
19 points
5 comments52 min readLW link

Is progress in ML-as­sisted the­o­rem-prov­ing benefi­cial?

MakoYass28 Sep 2021 1:54 UTC
10 points
2 comments1 min readLW link

Take­off Speeds and Discontinuities

30 Sep 2021 13:50 UTC
56 points
1 comment15 min readLW link

My take on Vanessa Kosoy’s take on AGI safety

Steven Byrnes30 Sep 2021 12:23 UTC
75 points
10 comments31 min readLW link

[Pre­dic­tion] We are in an Al­gorith­mic Overhang

lsusr29 Sep 2021 23:40 UTC
31 points
14 comments1 min readLW link

In­ter­view with Skynet

lsusr30 Sep 2021 2:20 UTC
49 points
1 comment2 min readLW link

AI learns be­trayal and how to avoid it

Stuart_Armstrong30 Sep 2021 9:39 UTC
30 points
4 comments2 min readLW link

The Dark Side of Cog­ni­tion Hypothesis

Cameron Berg3 Oct 2021 20:10 UTC
19 points
1 comment16 min readLW link

[Question] How to think about and deal with OpenAI

Rafael Harth9 Oct 2021 13:10 UTC
101 points
71 comments1 min readLW link

NVIDIA and Microsoft re­leases 530B pa­ram­e­ter trans­former model, Me­ga­tron-Tur­ing NLG

Ozyrus11 Oct 2021 15:28 UTC
51 points
36 comments1 min readLW link
(developer.nvidia.com)

Post­mod­ern Warfare

lsusr25 Oct 2021 9:02 UTC
60 points
25 comments2 min readLW link

A very crude de­cep­tion eval is already passed

Beth Barnes29 Oct 2021 17:57 UTC
91 points
7 comments2 min readLW link

Study Guide

johnswentworth6 Nov 2021 1:23 UTC
185 points
41 comments16 min readLW link

Re: At­tempted Gears Anal­y­sis of AGI In­ter­ven­tion Dis­cus­sion With Eliezer

lsusr15 Nov 2021 10:02 UTC
20 points
8 comments15 min readLW link

Ngo and Yud­kowsky on al­ign­ment difficulty

15 Nov 2021 20:31 UTC
230 points
143 comments99 min readLW link

Cor­rigi­bil­ity Can Be VNM-Incoherent

TurnTrout20 Nov 2021 0:30 UTC
62 points
24 comments7 min readLW link

Visi­ble Thoughts Pro­ject and Bounty Announcement

So8res30 Nov 2021 0:19 UTC
236 points
104 comments13 min readLW link

In­ter­pret­ing Yud­kowsky on Deep vs Shal­low Knowledge

adamShimi5 Dec 2021 17:32 UTC
96 points
31 comments24 min readLW link

Are there al­ter­na­tive to solv­ing value trans­fer and ex­trap­o­la­tion?

Stuart_Armstrong6 Dec 2021 18:53 UTC
19 points
7 comments5 min readLW link

Con­sid­er­a­tions on in­ter­ac­tion be­tween AI and ex­pected value of the fu­ture

Beth Barnes7 Dec 2021 2:46 UTC
64 points
28 comments4 min readLW link

Some thoughts on why ad­ver­sar­ial train­ing might be useful

Beth Barnes8 Dec 2021 1:28 UTC
9 points
5 comments3 min readLW link

The Plan

johnswentworth10 Dec 2021 23:41 UTC
207 points
76 comments14 min readLW link

Moore’s Law, AI, and the pace of progress

Veedrac11 Dec 2021 3:02 UTC
117 points
39 comments24 min readLW link

Sum­mary of the Acausal At­tack Is­sue for AIXI

Diffractor13 Dec 2021 8:16 UTC
13 points
6 comments4 min readLW link

Con­se­quen­tial­ism & corrigibility

Steven Byrnes14 Dec 2021 13:23 UTC
60 points
27 comments7 min readLW link

Should we rely on the speed prior for safety?

Marc-Everin Carauleanu14 Dec 2021 20:45 UTC
14 points
6 comments5 min readLW link

The Case for Rad­i­cal Op­ti­mism about Interpretability

Quintin Pope16 Dec 2021 23:38 UTC
51 points
15 comments8 min readLW link

Re­searcher in­cen­tives cause smoother progress on bench­marks

ryan_greenblatt21 Dec 2021 4:13 UTC
20 points
4 comments1 min readLW link

Self-Or­ganised Neu­ral Net­works: A sim­ple, nat­u­ral and effi­cient way to intelligence

D𝜋1 Jan 2022 23:24 UTC
40 points
50 comments44 min readLW link

Prizes for ELK proposals

paulfchristiano3 Jan 2022 20:23 UTC
133 points
156 comments7 min readLW link

D𝜋′s Spik­ing Network

lsusr4 Jan 2022 4:08 UTC
49 points
37 comments4 min readLW link

More Is Differ­ent for AI

jsteinhardt4 Jan 2022 19:30 UTC
135 points
21 comments3 min readLW link
(bounded-regret.ghost.io)

In­stru­men­tal Con­ver­gence For Real­is­tic Agent Objectives

TurnTrout22 Jan 2022 0:41 UTC
35 points
8 comments9 min readLW link

What’s Up With Con­fus­ingly Per­va­sive Con­se­quen­tial­ism?

Raemon20 Jan 2022 19:22 UTC
164 points
88 comments4 min readLW link

[In­tro to brain-like-AGI safety] 1. What’s the prob­lem & Why work on it now?

Steven Byrnes26 Jan 2022 15:23 UTC
99 points
16 comments23 min readLW link

Ar­gu­ments about Highly Reli­able Agent De­signs as a Use­ful Path to Ar­tifi­cial In­tel­li­gence Safety

27 Jan 2022 13:13 UTC
27 points
0 comments1 min readLW link
(arxiv.org)

Com­pet­i­tive pro­gram­ming with AlphaCode

Algon2 Feb 2022 16:49 UTC
55 points
37 comments15 min readLW link
(deepmind.com)

Thoughts on AGI safety from the top

jylin042 Feb 2022 20:06 UTC
21 points
2 comments32 min readLW link

Paradigm-build­ing from first prin­ci­ples: Effec­tive al­tru­ism, AGI, and alignment

Cameron Berg8 Feb 2022 16:12 UTC
24 points
5 comments14 min readLW link

[In­tro to brain-like-AGI safety] 3. Two sub­sys­tems: Learn­ing & Steering

Steven Byrnes9 Feb 2022 13:09 UTC
38 points
0 comments24 min readLW link

[In­tro to brain-like-AGI safety] 4. The “short-term pre­dic­tor”

Steven Byrnes16 Feb 2022 13:12 UTC
42 points
11 comments13 min readLW link

ELK Pro­posal: Think­ing Via A Hu­man Imitator

TurnTrout22 Feb 2022 1:52 UTC
26 points
6 comments11 min readLW link

Why I’m co-found­ing Aligned AI

Stuart_Armstrong17 Feb 2022 19:55 UTC
89 points
54 comments3 min readLW link

Im­pli­ca­tions of au­to­mated on­tol­ogy identification

18 Feb 2022 3:30 UTC
68 points
29 comments23 min readLW link

Align­ment re­search exercises

Richard_Ngo21 Feb 2022 20:24 UTC
138 points
14 comments8 min readLW link

[In­tro to brain-like-AGI safety] 5. The “long-term pre­dic­tor”, and TD learning

Steven Byrnes23 Feb 2022 14:44 UTC
27 points
18 comments22 min readLW link

How do new mod­els from OpenAI, Deep­Mind and An­thropic perform on Truth­fulQA?

Owain_Evans26 Feb 2022 12:46 UTC
40 points
3 comments11 min readLW link

Es­ti­mat­ing Brain-Equiv­a­lent Com­pute from Image Recog­ni­tion Al­gorithms

Gunnar_Zarncke27 Feb 2022 2:45 UTC
14 points
4 comments2 min readLW link

[Link] Aligned AI AMA

Stuart_Armstrong1 Mar 2022 12:01 UTC
18 points
0 comments1 min readLW link

[In­tro to brain-like-AGI safety] 6. Big pic­ture of mo­ti­va­tion, de­ci­sion-mak­ing, and RL

Steven Byrnes2 Mar 2022 15:26 UTC
21 points
12 comments16 min readLW link

[Question] Would (my­opic) gen­eral pub­lic good pro­duc­ers sig­nifi­cantly ac­cel­er­ate the de­vel­op­ment of AGI?

MakoYass2 Mar 2022 23:47 UTC
24 points
10 comments1 min readLW link

[In­tro to brain-like-AGI safety] 7. From hard­coded drives to fore­sighted plans: A worked example

Steven Byrnes9 Mar 2022 14:28 UTC
29 points
0 comments9 min readLW link

[In­tro to brain-like-AGI safety] 9. Take­aways from neuro 2/​2: On AGI motivation

Steven Byrnes23 Mar 2022 12:48 UTC
21 points
2 comments23 min readLW link

Hu­mans pre­tend­ing to be robots pre­tend­ing to be human

Richard_Kennaway28 Mar 2022 15:13 UTC
27 points
15 comments1 min readLW link

[In­tro to brain-like-AGI safety] 10. The al­ign­ment problem

Steven Byrnes30 Mar 2022 13:24 UTC
32 points
2 comments21 min readLW link

AXRP Epi­sode 13 - First Prin­ci­ples of AGI Safety with Richard Ngo

DanielFilan31 Mar 2022 5:20 UTC
24 points
0 comments48 min readLW link

Un­con­trol­lable Su­per-Pow­er­ful Explosives

Sammy Martin2 Apr 2022 20:13 UTC
53 points
12 comments5 min readLW link

The case for Do­ing Some­thing Else (if Align­ment is doomed)

Rafael Harth5 Apr 2022 17:52 UTC
69 points
14 comments2 min readLW link

[In­tro to brain-like-AGI safety] 11. Safety ≠ al­ign­ment (but they’re close!)

Steven Byrnes6 Apr 2022 13:39 UTC
23 points
0 comments10 min readLW link

Strate­gic Con­sid­er­a­tions Re­gard­ing Autis­tic/​Literal AI

Chris_Leong6 Apr 2022 14:57 UTC
−1 points
2 comments2 min readLW link

DALL·E 2 by OpenAI

P.6 Apr 2022 14:17 UTC
43 points
48 comments1 min readLW link
(openai.com)

How to train your trans­former

p.b.7 Apr 2022 9:34 UTC
5 points
0 comments8 min readLW link

AMA Con­jec­ture, A New Align­ment Startup

adamShimi9 Apr 2022 9:43 UTC
43 points
40 comments1 min readLW link

Worse than an un­al­igned AGI

shminux10 Apr 2022 3:35 UTC
5 points
12 comments1 min readLW link

[Question] Did OpenAI let GPT out of the box?

ChristianKl16 Apr 2022 14:56 UTC
4 points
12 comments1 min readLW link

In­stru­men­tal Con­ver­gence To Offer Hope?

michael_mjd22 Apr 2022 1:56 UTC
11 points
5 comments3 min readLW link

[In­tro to brain-like-AGI safety] 13. Sym­bol ground­ing & hu­man so­cial instincts

Steven Byrnes27 Apr 2022 13:30 UTC
31 points
11 comments15 min readLW link

[In­tro to brain-like-AGI safety] 14. Con­trol­led AGI

Steven Byrnes11 May 2022 13:17 UTC
22 points
16 comments18 min readLW link

[Question] What’s keep­ing con­cerned ca­pa­bil­ities gain re­searchers from leav­ing the field?

sovran12 May 2022 12:16 UTC
16 points
4 comments1 min readLW link

[Question] What’s keep­ing con­cerned ca­pa­bil­ities gain re­searchers from leav­ing the field?

sovran12 May 2022 12:16 UTC
16 points
4 comments1 min readLW link

Read­ing the ethi­cists: A re­view of ar­ti­cles on AI in the jour­nal Science and Eng­ineer­ing Ethics

Charlie Steiner18 May 2022 20:52 UTC
45 points
3 comments14 min readLW link

[AN #94]: AI al­ign­ment as trans­la­tion be­tween hu­mans and machines

Rohin Shah8 Apr 2020 17:10 UTC
11 points
0 comments7 min readLW link
(mailchi.mp)

[Question] What are the rel­a­tive speeds of AI ca­pa­bil­ities and AI safety?

NunoSempere24 Apr 2020 18:21 UTC
8 points
2 comments1 min readLW link

Seek­ing Power is Often Con­ver­gently In­stru­men­tal in MDPs

5 Dec 2019 2:33 UTC
146 points
35 comments16 min readLW link2 reviews
(arxiv.org)

“Don’t even think about hell”

emmab2 May 2020 8:06 UTC
6 points
2 comments1 min readLW link

[Question] AI Box­ing for Hard­ware-bound agents (aka the China al­ign­ment prob­lem)

Logan Zoellner8 May 2020 15:50 UTC
11 points
27 comments10 min readLW link

Point­ing to a Flower

johnswentworth18 May 2020 18:54 UTC
58 points
18 comments9 min readLW link

Learn­ing and ma­nipu­lat­ing learning

Stuart_Armstrong19 May 2020 13:02 UTC
39 points
5 comments10 min readLW link

[Question] Why aren’t we test­ing gen­eral in­tel­li­gence dis­tri­bu­tion?

Bob Jacobs26 May 2020 16:07 UTC
25 points
7 comments1 min readLW link

OpenAI an­nounces GPT-3

gwern29 May 2020 1:49 UTC
67 points
23 comments1 min readLW link
(arxiv.org)

GPT-3: a dis­ap­point­ing paper

nostalgebraist29 May 2020 19:06 UTC
68 points
44 comments8 min readLW link1 review

In­tro­duc­tion to Ex­is­ten­tial Risks from Ar­tifi­cial In­tel­li­gence, for an EA audience

JoshuaFox2 Jun 2020 8:30 UTC
10 points
1 comment1 min readLW link

Prepar­ing for “The Talk” with AI projects

Daniel Kokotajlo13 Jun 2020 23:01 UTC
64 points
16 comments3 min readLW link

[Question] What are the high-level ap­proaches to AI al­ign­ment?

G Gordon Worley III16 Jun 2020 17:10 UTC
12 points
13 comments1 min readLW link

Re­sults of $1,000 Or­a­cle con­test!

Stuart_Armstrong17 Jun 2020 17:44 UTC
56 points
2 comments1 min readLW link

[Question] Like­li­hood of hy­per­ex­is­ten­tial catas­tro­phe from a bug?

Anirandis18 Jun 2020 16:23 UTC
12 points
27 comments1 min readLW link

AI Benefits Post 1: In­tro­duc­ing “AI Benefits”

Cullen_OKeefe22 Jun 2020 16:59 UTC
11 points
3 comments3 min readLW link

Goals and short descriptions

Michele Campolo2 Jul 2020 17:41 UTC
14 points
8 comments5 min readLW link

Re­search ideas to study hu­mans with AI Safety in mind

Riccardo Volpato3 Jul 2020 16:01 UTC
22 points
2 comments5 min readLW link

AI Benefits Post 3: Direct and Indi­rect Ap­proaches to AI Benefits

Cullen_OKeefe6 Jul 2020 18:48 UTC
8 points
0 comments2 min readLW link

An­titrust-Com­pli­ant AI In­dus­try Self-Regulation

Cullen_OKeefe7 Jul 2020 20:53 UTC
9 points
3 comments1 min readLW link
(cullenokeefe.com)

Should AI Be Open?

Scott Alexander17 Dec 2015 8:25 UTC
19 points
3 comments13 min readLW link

Meta Pro­gram­ming GPT: A route to Su­per­in­tel­li­gence?

dmtea11 Jul 2020 14:51 UTC
10 points
7 comments4 min readLW link

The Dilemma of Worse Than Death Scenarios

arkaeik10 Jul 2018 9:18 UTC
5 points
18 comments4 min readLW link

[Question] What are the mostly likely ways AGI will emerge?

Craig Quiter14 Jul 2020 0:58 UTC
3 points
7 comments1 min readLW link

AI Benefits Post 4: Out­stand­ing Ques­tions on Select­ing Benefits

Cullen_OKeefe14 Jul 2020 17:26 UTC
4 points
4 comments5 min readLW link

Solv­ing Math Prob­lems by Relay

17 Jul 2020 15:32 UTC
90 points
26 comments7 min readLW link

AI Benefits Post 5: Out­stand­ing Ques­tions on Govern­ing Benefits

Cullen_OKeefe21 Jul 2020 16:46 UTC
4 points
0 comments4 min readLW link

[Question] Why is pseudo-al­ign­ment “worse” than other ways ML can fail to gen­er­al­ize?

nostalgebraist18 Jul 2020 22:54 UTC
45 points
10 comments2 min readLW link

[Question] “Do Noth­ing” util­ity func­tion, 3½ years later?

niplav20 Jul 2020 11:09 UTC
5 points
3 comments1 min readLW link

[AN #80]: Why AI risk might be solved with­out ad­di­tional in­ter­ven­tion from longtermists

Rohin Shah2 Jan 2020 18:20 UTC
35 points
94 comments10 min readLW link
(mailchi.mp)

Ac­cess to AI: a hu­man right?

dmtea25 Jul 2020 9:38 UTC
5 points
3 comments2 min readLW link

The Rise of Com­mon­sense Reasoning

DragonGod27 Jul 2020 19:01 UTC
8 points
0 comments1 min readLW link
(www.reddit.com)

AI and Efficiency

DragonGod27 Jul 2020 20:58 UTC
9 points
1 comment1 min readLW link
(openai.com)

FHI Re­port: How Will Na­tional Se­cu­rity Con­sid­er­a­tions Affect An­titrust De­ci­sions in AI? An Ex­am­i­na­tion of His­tor­i­cal Precedents

Cullen_OKeefe28 Jul 2020 18:34 UTC
2 points
0 comments1 min readLW link
(www.fhi.ox.ac.uk)

The “best pre­dic­tor is mal­i­cious op­ti­miser” problem

Donald Hobson29 Jul 2020 11:49 UTC
14 points
10 comments2 min readLW link

Suffi­ciently Ad­vanced Lan­guage Models Can Do Re­in­force­ment Learning

Zachary Robertson2 Aug 2020 15:32 UTC
21 points
7 comments7 min readLW link

[Question] What are the most im­por­tant pa­pers/​post/​re­sources to read to un­der­stand more of GPT-3?

adamShimi2 Aug 2020 20:53 UTC
22 points
4 comments1 min readLW link

[Question] What should an Ein­stein-like figure in Ma­chine Learn­ing do?

Razied5 Aug 2020 23:52 UTC
3 points
3 comments1 min readLW link

Book re­view: Ar­chi­tects of In­tel­li­gence by Martin Ford (2018)

ofer11 Aug 2020 17:30 UTC
15 points
0 comments2 min readLW link

[Question] Will OpenAI’s work un­in­ten­tion­ally in­crease ex­is­ten­tial risks re­lated to AI?

adamShimi11 Aug 2020 18:16 UTC
50 points
56 comments1 min readLW link

Blog post: A tale of two re­search communities

Aryeh Englander12 Aug 2020 20:41 UTC
14 points
0 comments4 min readLW link

Map­ping Out Alignment

15 Aug 2020 1:02 UTC
42 points
0 comments5 min readLW link

My Un­der­stand­ing of Paul Chris­ti­ano’s Iter­ated Am­plifi­ca­tion AI Safety Re­search Agenda

Chi Nguyen15 Aug 2020 20:02 UTC
119 points
21 comments39 min readLW link

GPT-3, be­lief, and consistency

skybrian16 Aug 2020 23:12 UTC
18 points
7 comments2 min readLW link

[Question] What pre­cisely do we mean by AI al­ign­ment?

G Gordon Worley III9 Dec 2018 2:23 UTC
27 points
8 comments1 min readLW link

Thoughts on the Fea­si­bil­ity of Pro­saic AGI Align­ment?

iamthouthouarti21 Aug 2020 23:25 UTC
8 points
10 comments1 min readLW link

[Question] Fore­cast­ing Thread: AI Timelines

22 Aug 2020 2:33 UTC
125 points
93 comments2 min readLW link

Learn­ing hu­man prefer­ences: black-box, white-box, and struc­tured white-box access

Stuart_Armstrong24 Aug 2020 11:42 UTC
25 points
9 comments6 min readLW link

Proofs Sec­tion 2.3 (Up­dates, De­ci­sion The­ory)

Diffractor27 Aug 2020 7:49 UTC
7 points
0 comments31 min readLW link

Proofs Sec­tion 2.2 (Iso­mor­phism to Ex­pec­ta­tions)

Diffractor27 Aug 2020 7:52 UTC
7 points
0 comments46 min readLW link

Proofs Sec­tion 2.1 (The­o­rem 1, Lem­mas)

Diffractor27 Aug 2020 7:54 UTC
7 points
0 comments36 min readLW link

Proofs Sec­tion 1.1 (Ini­tial re­sults to LF-du­al­ity)

Diffractor27 Aug 2020 7:59 UTC
6 points
0 comments20 min readLW link

Proofs Sec­tion 1.2 (Mix­tures, Up­dates, Push­for­wards)

Diffractor27 Aug 2020 7:57 UTC
7 points
0 comments14 min readLW link

Ba­sic In­framea­sure Theory

Diffractor27 Aug 2020 8:02 UTC
32 points
12 comments25 min readLW link

Belief Func­tions And De­ci­sion Theory

Diffractor27 Aug 2020 8:00 UTC
13 points
8 comments39 min readLW link

Tech­ni­cal model re­fine­ment formalism

Stuart_Armstrong27 Aug 2020 11:54 UTC
19 points
0 comments6 min readLW link

Pong from pix­els with­out read­ing “Pong from Pix­els”

irmckenzie29 Aug 2020 17:26 UTC
15 points
1 comment7 min readLW link

Reflec­tions on AI Timelines Fore­cast­ing Thread

Amandango1 Sep 2020 1:42 UTC
53 points
7 comments5 min readLW link

on “learn­ing to sum­ma­rize”

nostalgebraist12 Sep 2020 3:20 UTC
25 points
13 comments8 min readLW link
(nostalgebraist.tumblr.com)

[Question] The uni­ver­sal­ity of com­pu­ta­tion and mind de­sign space

alanf12 Sep 2020 14:58 UTC
1 point
7 comments1 min readLW link

Clar­ify­ing “What failure looks like”

Sam Clarke20 Sep 2020 20:40 UTC
87 points
14 comments17 min readLW link

Hu­man Bi­ases that Ob­scure AI Progress

Phylliida Dev25 Sep 2020 0:24 UTC
42 points
2 comments4 min readLW link

[Question] Com­pe­tence vs Alignment

Ariel Kwiatkowski30 Sep 2020 21:03 UTC
6 points
4 comments1 min readLW link

[Question] GPT-3 + GAN

stick10917 Oct 2020 7:58 UTC
4 points
4 comments1 min readLW link

Book Re­view: Re­in­force­ment Learn­ing by Sut­ton and Barto

billmei20 Oct 2020 19:40 UTC
52 points
3 comments10 min readLW link

GPT-X, Paper­clip Max­i­mizer? An­a­lyz­ing AGI and Fi­nal Goals

meanderingmoose22 Oct 2020 14:33 UTC
8 points
1 comment6 min readLW link

Con­tain­ing the AI… In­side a Si­mu­lated Reality

HumaneAutomation31 Oct 2020 16:16 UTC
1 point
5 comments2 min readLW link

Why those who care about catas­trophic and ex­is­ten­tial risk should care about au­tonomous weapons

aaguirre11 Nov 2020 15:22 UTC
60 points
20 comments19 min readLW link

Euro­pean Master’s Pro­grams in Ma­chine Learn­ing, Ar­tifi­cial In­tel­li­gence, and re­lated fields

Master Programs ML/AI14 Nov 2020 15:51 UTC
32 points
8 comments1 min readLW link

Should we post­pone AGI un­til we reach safety?

otto.barten18 Nov 2020 15:43 UTC
25 points
36 comments3 min readLW link

Com­mit­ment and cred­i­bil­ity in mul­ti­po­lar AI scenarios

anni_leskela4 Dec 2020 18:48 UTC
25 points
3 comments18 min readLW link

[Question] AI Win­ter Is Com­ing—How to profit from it?

maximkazhenkov5 Dec 2020 20:23 UTC
10 points
7 comments1 min readLW link

An­nounc­ing the Tech­ni­cal AI Safety Podcast

Quinn7 Dec 2020 18:51 UTC
42 points
6 comments2 min readLW link
(technical-ai-safety.libsyn.com)

All GPT skills are translation

p.b.13 Dec 2020 20:06 UTC
4 points
0 comments2 min readLW link

[Question] Judg­ing AGI Output

meredev14 Dec 2020 12:43 UTC
3 points
0 comments2 min readLW link

Risk Map of AI Systems

15 Dec 2020 9:16 UTC
25 points
3 comments8 min readLW link

AI Align­ment, Philo­soph­i­cal Plu­ral­ism, and the Rele­vance of Non-Western Philosophy

xuan1 Jan 2021 0:08 UTC
30 points
21 comments20 min readLW link

Are we all mis­al­igned?

Mateusz Mazurkiewicz3 Jan 2021 2:42 UTC
11 points
0 comments5 min readLW link

[Question] What do we *re­ally* ex­pect from a well-al­igned AI?

jan betley4 Jan 2021 20:57 UTC
8 points
10 comments1 min readLW link

Eight claims about multi-agent AGI safety

Richard_Ngo7 Jan 2021 13:34 UTC
73 points
18 comments5 min readLW link

Imi­ta­tive Gen­er­al­i­sa­tion (AKA ‘Learn­ing the Prior’)

Beth Barnes10 Jan 2021 0:30 UTC
86 points
14 comments12 min readLW link

Pre­dic­tion can be Outer Aligned at Optimum

Lanrian10 Jan 2021 18:48 UTC
15 points
12 comments11 min readLW link

[Question] Poll: Which vari­ables are most strate­gi­cally rele­vant?

22 Jan 2021 17:17 UTC
32 points
34 comments1 min readLW link

AISU 2021

Linda Linsefors30 Jan 2021 17:40 UTC
28 points
2 comments1 min readLW link

Deep­mind has made a gen­eral in­duc­tor (“Mak­ing sense of sen­sory in­put”)

MakoYass2 Feb 2021 2:54 UTC
48 points
10 comments1 min readLW link
(www.sciencedirect.com)

Coun­ter­fac­tual Plan­ning in AGI Systems

Koen.Holtman3 Feb 2021 13:54 UTC
6 points
0 comments5 min readLW link

[AN #136]: How well will GPT-N perform on down­stream tasks?

Rohin Shah3 Feb 2021 18:10 UTC
21 points
2 comments9 min readLW link
(mailchi.mp)

For­mal Solu­tion to the In­ner Align­ment Problem

michaelcohen18 Feb 2021 14:51 UTC
46 points
123 comments2 min readLW link

TASP Ep 3 - Op­ti­mal Poli­cies Tend to Seek Power

Quinn11 Mar 2021 1:44 UTC
24 points
0 comments1 min readLW link
(technical-ai-safety.libsyn.com)

Phy­lac­tery De­ci­sion Theory

Bunthut2 Apr 2021 20:55 UTC
14 points
6 comments2 min readLW link

Pre­dic­tive Cod­ing has been Unified with Backpropagation

lsusr2 Apr 2021 21:42 UTC
158 points
44 comments2 min readLW link

[Question] What if we could use the the­ory of Mechanism De­sign from Game The­ory as a medium achieve AI Align­ment?

farari74 Apr 2021 12:56 UTC
4 points
0 comments1 min readLW link

Another (outer) al­ign­ment failure story

paulfchristiano7 Apr 2021 20:12 UTC
201 points
37 comments12 min readLW link

A Sys­tem For Evolv­ing In­creas­ingly Gen­eral Ar­tifi­cial In­tel­li­gence From Cur­rent Technologies

Tsang Chung Shu8 Apr 2021 21:37 UTC
1 point
3 comments11 min readLW link

[Question] [time­boxed ex­er­cise] write me your model of AI hu­man-ex­is­ten­tial safety and the al­ign­ment prob­lems in 15 minutes

Quinn4 May 2021 19:10 UTC
6 points
2 comments1 min readLW link

Mostly ques­tions about Dumb AI Kernels

HorizonHeld12 May 2021 22:00 UTC
1 point
1 comment9 min readLW link

Thoughts on Iter­ated Distil­la­tion and Amplification

Waddington11 May 2021 21:32 UTC
9 points
2 comments20 min readLW link

How do we build or­gani­sa­tions that want to build safe AI?

sxae12 May 2021 15:08 UTC
4 points
4 comments9 min readLW link

[Question] Who has ar­gued in de­tail that a cur­rent AI sys­tem is phe­nom­e­nally con­scious?

Robbo14 May 2021 22:03 UTC
3 points
2 comments1 min readLW link

How I Learned to Stop Wor­ry­ing and Love MUM

Waddington20 May 2021 7:57 UTC
2 points
0 comments3 min readLW link

AI Safety Re­search Pro­ject Ideas

Owain_Evans21 May 2021 13:39 UTC
58 points
1 comment3 min readLW link

[Question] How one uses set the­ory for al­ign­ment prob­lem?

Just Learning29 May 2021 0:28 UTC
8 points
6 comments1 min readLW link

Reflec­tion of Hier­ar­chi­cal Re­la­tion­ship via Nuanced Con­di­tion­ing of Game The­ory Ap­proach for AI Devel­op­ment and Utilization

Kyoung-cheol Kim4 Jun 2021 7:20 UTC
2 points
2 comments9 min readLW link

Re­view of “Learn­ing Nor­ma­tivity: A Re­search Agenda”

6 Jun 2021 13:33 UTC
34 points
0 comments6 min readLW link

Hard­ware for Trans­for­ma­tive AI

ViktorThink22 Jun 2021 18:13 UTC
17 points
7 comments2 min readLW link

Alex Turner’s Re­search, Com­pre­hen­sive In­for­ma­tion Gathering

adamShimi23 Jun 2021 9:44 UTC
15 points
3 comments3 min readLW link

Dis­cus­sion: Ob­jec­tive Ro­bust­ness and In­ner Align­ment Terminology

23 Jun 2021 23:25 UTC
67 points
6 comments9 min readLW link

The Lan­guage of Bird

johnswentworth27 Jun 2021 4:44 UTC
44 points
9 comments2 min readLW link

[Question] What are some claims or opinions about multi-multi del­e­ga­tion you’ve seen in the meme­plex that you think de­serve scrutiny?

Quinn27 Jun 2021 17:44 UTC
16 points
6 comments2 min readLW link

An ex­am­i­na­tion of Me­tac­u­lus’ re­solved AI pre­dic­tions and their im­pli­ca­tions for AI timelines

CharlesD20 Jul 2021 9:08 UTC
27 points
0 comments7 min readLW link

[Question] How should my timelines in­fluence my ca­reer choice?

Tom Lieberum3 Aug 2021 10:14 UTC
13 points
10 comments1 min readLW link

What is the prob­lem?

Carlos Ramirez11 Aug 2021 22:33 UTC
7 points
0 comments6 min readLW link

OpenAI Codex: First Impressions

Rishit Vora13 Aug 2021 16:52 UTC
49 points
8 comments4 min readLW link
(sixeleven.in)

[Question] 1h-vol­un­teers needed for a small AI Safety-re­lated re­search pro­ject

PabloAMC16 Aug 2021 17:53 UTC
2 points
0 comments1 min readLW link

Ex­trac­tion of hu­man prefer­ences 👨→🤖

arunraja-hub24 Aug 2021 16:34 UTC
18 points
2 comments5 min readLW link

Call for re­search on eval­u­at­ing al­ign­ment (fund­ing + ad­vice available)

Beth Barnes31 Aug 2021 23:28 UTC
105 points
11 comments5 min readLW link

Ob­sta­cles to gra­di­ent hacking

leogao5 Sep 2021 22:42 UTC
21 points
11 comments4 min readLW link

[Question] Con­di­tional on the first AGI be­ing al­igned cor­rectly, is a good out­come even still likely?

iamthouthouarti6 Sep 2021 17:30 UTC
2 points
1 comment1 min readLW link

Dist­in­guish­ing AI takeover scenarios

8 Sep 2021 16:19 UTC
62 points
11 comments14 min readLW link

Paths To High-Level Ma­chine Intelligence

Daniel_Eth10 Sep 2021 13:21 UTC
66 points
8 comments33 min readLW link

How truth­ful is GPT-3? A bench­mark for lan­guage models

Owain_Evans16 Sep 2021 10:09 UTC
54 points
24 comments6 min readLW link

In­ves­ti­gat­ing AI Takeover Scenarios

Sammy Martin17 Sep 2021 18:47 UTC
27 points
1 comment27 min readLW link

A suffi­ciently para­noid non-Friendly AGI might self-mod­ify it­self to be­come Friendly

RomanS22 Sep 2021 6:29 UTC
5 points
2 comments1 min readLW link

Towards De­con­fus­ing Gra­di­ent Hacking

leogao24 Oct 2021 0:43 UTC
25 points
1 comment12 min readLW link

A brief re­view of the rea­sons multi-ob­jec­tive RL could be im­por­tant in AI Safety Research

Ben Smith29 Sep 2021 17:09 UTC
25 points
7 comments10 min readLW link

Meta learn­ing to gra­di­ent hack

Quintin Pope1 Oct 2021 19:25 UTC
45 points
10 comments3 min readLW link

Pro­posal: Scal­ing laws for RL generalization

flodorner1 Oct 2021 21:32 UTC
14 points
10 comments11 min readLW link

A Frame­work of Pre­dic­tion Technologies

isaduan3 Oct 2021 10:26 UTC
8 points
2 comments9 min readLW link

AI Pre­dic­tion Ser­vices and Risks of War

isaduan3 Oct 2021 10:26 UTC
3 points
2 comments10 min readLW link

Pos­si­ble Wor­lds af­ter Pre­dic­tion Take-off

isaduan3 Oct 2021 10:26 UTC
5 points
0 comments4 min readLW link

[Pro­posal] Method of lo­cat­ing use­ful sub­nets in large models

Quintin Pope13 Oct 2021 20:52 UTC
9 points
0 comments2 min readLW link

Com­men­tary on “AGI Safety From First Prin­ci­ples by Richard Ngo, Septem­ber 2020”

Robert Kralisch14 Oct 2021 15:11 UTC
3 points
0 comments20 min readLW link

The AGI needs to be honest

rokosbasilisk16 Oct 2021 19:24 UTC
2 points
12 comments2 min readLW link

“Re­dun­dant” AI Alignment

Mckay Jensen16 Oct 2021 21:32 UTC
12 points
3 comments1 min readLW link
(quevivasbien.github.io)

[MLSN #1]: ICLR Safety Paper Roundup

Daniel Hendrycks18 Oct 2021 15:19 UTC
59 points
1 comment2 min readLW link

AMA on Truth­ful AI: Owen Cot­ton-Bar­ratt, Owain Evans & co-authors

Owain_Evans22 Oct 2021 16:23 UTC
31 points
15 comments1 min readLW link

Hegel vs. GPT-3

Bezzi27 Oct 2021 5:55 UTC
9 points
21 comments2 min readLW link

Google an­nounces Path­ways: new gen­er­a­tion mul­ti­task AI Architecture

Ozyrus29 Oct 2021 11:55 UTC
6 points
1 comment1 min readLW link
(blog.google)

What is the most evil AI that we could build, to­day?

ThomasJ1 Nov 2021 19:58 UTC
−2 points
14 comments1 min readLW link

Why we need proso­cial agents

Akbir Khan2 Nov 2021 15:19 UTC
6 points
0 comments2 min readLW link

Pos­si­ble re­search di­rec­tions to im­prove the mechanis­tic ex­pla­na­tion of neu­ral networks

delton1379 Nov 2021 2:36 UTC
29 points
8 comments9 min readLW link

What are red flags for Neu­ral Net­work suffer­ing?

Marius Hobbhahn8 Nov 2021 12:51 UTC
24 points
15 comments12 min readLW link

Us­ing Brain-Com­puter In­ter­faces to get more data for AI alignment

Robbo7 Nov 2021 0:00 UTC
37 points
10 comments7 min readLW link

Hard­code the AGI to need our ap­proval in­definitely?

MichaelStJules11 Nov 2021 7:04 UTC
1 point
2 comments1 min readLW link

Stop but­ton: to­wards a causal solution

tailcalled12 Nov 2021 19:09 UTC
23 points
36 comments9 min readLW link

A FLI post­doc­toral grant ap­pli­ca­tion: AI al­ign­ment via causal anal­y­sis and de­sign of agents

PabloAMC13 Nov 2021 1:44 UTC
4 points
0 comments7 min readLW link

What would we do if al­ign­ment were fu­tile?

Grant Demaree14 Nov 2021 8:09 UTC
71 points
43 comments3 min readLW link

At­tempted Gears Anal­y­sis of AGI In­ter­ven­tion Dis­cus­sion With Eliezer

Zvi15 Nov 2021 3:50 UTC
202 points
48 comments16 min readLW link
(thezvi.wordpress.com)

A pos­i­tive case for how we might suc­ceed at pro­saic AI alignment

evhub16 Nov 2021 1:49 UTC
84 points
45 comments6 min readLW link

Su­per in­tel­li­gent AIs that don’t re­quire alignment

Yair Halberstadt16 Nov 2021 19:55 UTC
10 points
2 comments6 min readLW link

Some real ex­am­ples of gra­di­ent hacking

Oliver Sourbut22 Nov 2021 0:11 UTC
7 points
4 comments2 min readLW link

[linkpost] Ac­qui­si­tion of Chess Knowl­edge in AlphaZero

Quintin Pope23 Nov 2021 7:55 UTC
8 points
1 comment1 min readLW link

AI Tracker: mon­i­tor­ing cur­rent and near-fu­ture risks from su­per­scale models

23 Nov 2021 19:16 UTC
62 points
13 comments3 min readLW link
(aitracker.org)

AI Safety Needs Great Engineers

Andy Jones23 Nov 2021 15:40 UTC
76 points
44 comments4 min readLW link

HIRING: In­form and shape a new pro­ject on AI safety at Part­ner­ship on AI

Madhulika Srikumar24 Nov 2021 8:27 UTC
6 points
0 comments1 min readLW link

How to mea­sure FLOP/​s for Neu­ral Net­works em­piri­cally?

Marius Hobbhahn29 Nov 2021 15:18 UTC
15 points
3 comments7 min readLW link

AI Gover­nance Fun­da­men­tals—Cur­ricu­lum and Application

Mauricio30 Nov 2021 2:19 UTC
16 points
0 comments16 min readLW link

Be­hav­ior Clon­ing is Miscalibrated

leogao5 Dec 2021 1:36 UTC
52 points
3 comments3 min readLW link

ML Align­ment The­ory Pro­gram un­der Evan Hubinger

6 Dec 2021 0:03 UTC
79 points
3 comments2 min readLW link

In­for­ma­tion bot­tle­neck for coun­ter­fac­tual corrigibility

tailcalled6 Dec 2021 17:11 UTC
8 points
1 comment7 min readLW link

Model­ing Failure Modes of High-Level Ma­chine Intelligence

6 Dec 2021 13:54 UTC
53 points
1 comment12 min readLW link

[Question] What al­ign­ment-re­lated con­cepts should be bet­ter known in the broader ML com­mu­nity?

Lauro Langosco9 Dec 2021 20:44 UTC
6 points
4 comments1 min readLW link

Un­der­stand­ing Gra­di­ent Hacking

peterbarnett10 Dec 2021 15:58 UTC
27 points
5 comments30 min readLW link

What’s the back­ward-for­ward FLOP ra­tio for Neu­ral Net­works?

13 Dec 2021 8:54 UTC
16 points
7 comments10 min readLW link

Disen­tan­gling Per­spec­tives On Strat­egy-Steal­ing in AI Safety

shawnghu18 Dec 2021 20:13 UTC
20 points
1 comment11 min readLW link

De­mand­ing and De­sign­ing Aligned Cog­ni­tive Architectures

Koen.Holtman21 Dec 2021 17:32 UTC
8 points
5 comments5 min readLW link

Po­ten­tial gears level ex­pla­na­tions of smooth progress

ryan_greenblatt22 Dec 2021 18:05 UTC
4 points
2 comments2 min readLW link

Trans­former Circuits

evhub22 Dec 2021 21:09 UTC
127 points
4 comments3 min readLW link
(transformer-circuits.pub)

Gra­di­ent Hack­ing via Schel­ling Goals

Adam Scherlis28 Dec 2021 20:38 UTC
30 points
4 comments4 min readLW link

Reader-gen­er­ated Essays

Henrik Karlsson3 Jan 2022 8:56 UTC
16 points
0 comments6 min readLW link
(escapingflatland.substack.com)

Brain Effi­ciency: Much More than You Wanted to Know

jacob_cannell6 Jan 2022 3:38 UTC
151 points
78 comments28 min readLW link

Un­der­stand­ing the two-head strat­egy for teach­ing ML to an­swer ques­tions honestly

Adam Scherlis11 Jan 2022 23:24 UTC
26 points
0 comments10 min readLW link

Plan B in AI Safety approach

avturchin13 Jan 2022 12:03 UTC
33 points
9 comments2 min readLW link

Truth­ful LMs as a warm-up for al­igned AGI

Jacob_Hilton17 Jan 2022 16:49 UTC
64 points
14 comments13 min readLW link

How I’m think­ing about GPT-N

delton13717 Jan 2022 17:11 UTC
44 points
21 comments18 min readLW link

Align­ment Prob­lems All the Way Down

peterbarnett22 Jan 2022 0:19 UTC
25 points
7 comments10 min readLW link

[Question] How fea­si­ble/​costly would it be to train a very large AI model on dis­tributed clusters of GPUs?

Anonymous25 Jan 2022 19:20 UTC
6 points
4 comments1 min readLW link

Causal­ity, Trans­for­ma­tive AI and al­ign­ment—part I

Marius Hobbhahn27 Jan 2022 16:18 UTC
11 points
11 comments8 min readLW link

2+2: On­tolog­i­cal Framework

Lyrialtus1 Feb 2022 1:07 UTC
−15 points
2 comments15 min readLW link

QNR prospects are im­por­tant for AI al­ign­ment research

Eric Drexler3 Feb 2022 15:20 UTC
80 points
10 comments11 min readLW link

Paradigm-build­ing: Introduction

Cameron Berg8 Feb 2022 0:06 UTC
24 points
0 comments2 min readLW link

Paradigm-build­ing: The hi­er­ar­chi­cal ques­tion framework

Cameron Berg9 Feb 2022 16:47 UTC
11 points
16 comments3 min readLW link

Ques­tion 1: Pre­dicted ar­chi­tec­ture of AGI learn­ing al­gorithm(s)

Cameron Berg10 Feb 2022 17:22 UTC
9 points
1 comment7 min readLW link

Ques­tion 2: Pre­dicted bad out­comes of AGI learn­ing architecture

Cameron Berg11 Feb 2022 22:23 UTC
5 points
1 comment10 min readLW link

Ques­tion 3: Con­trol pro­pos­als for min­i­miz­ing bad outcomes

Cameron Berg12 Feb 2022 19:13 UTC
5 points
1 comment7 min readLW link

Ques­tion 4: Im­ple­ment­ing the con­trol proposals

Cameron Berg13 Feb 2022 17:12 UTC
6 points
2 comments5 min readLW link

Ques­tion 5: The timeline hyperparameter

Cameron Berg14 Feb 2022 16:38 UTC
5 points
3 comments7 min readLW link

Paradigm-build­ing: Con­clu­sion and prac­ti­cal takeaways

Cameron Berg15 Feb 2022 16:11 UTC
2 points
1 comment2 min readLW link

How com­plex are my­opic imi­ta­tors?

Vivek Hebbar8 Feb 2022 12:00 UTC
21 points
1 comment15 min readLW link

Me­tac­u­lus launches con­test for es­says with quan­ti­ta­tive pre­dic­tions about AI

8 Feb 2022 16:07 UTC
25 points
2 comments1 min readLW link
(www.metaculus.com)

Hy­poth­e­sis: gra­di­ent de­scent prefers gen­eral circuits

Quintin Pope8 Feb 2022 21:12 UTC
38 points
26 comments11 min readLW link

Com­pute Trends Across Three eras of Ma­chine Learning

16 Feb 2022 14:18 UTC
90 points
13 comments2 min readLW link

[Question] Is the com­pe­ti­tion/​co­op­er­a­tion be­tween sym­bolic AI and statis­ti­cal AI (ML) about his­tor­i­cal ap­proach to re­search /​ en­g­ineer­ing, or is it more fun­da­men­tally about what in­tel­li­gent agents “are”?

Edward Hammond17 Feb 2022 23:11 UTC
1 point
1 comment2 min readLW link

HCH and Ad­ver­sar­ial Questions

David Udell19 Feb 2022 0:52 UTC
14 points
7 comments26 min readLW link

Thoughts on Danger­ous Learned Optimization

peterbarnett19 Feb 2022 10:46 UTC
4 points
2 comments4 min readLW link

Rel­a­tivized Defi­ni­tions as a Method to Sidestep the Löbian Obstacle

homotowat27 Feb 2022 6:37 UTC
27 points
4 comments7 min readLW link

What we know about ma­chine learn­ing’s repli­ca­tion crisis

Younes Kamel5 Mar 2022 23:55 UTC
35 points
4 comments6 min readLW link
(youneskamel.substack.com)

Pro­ject­ing com­pute trends in Ma­chine Learning

7 Mar 2022 15:32 UTC
52 points
4 comments6 min readLW link

[Sur­vey] Ex­pec­ta­tions of a Post-ASI Order

Conor Sullivan9 Mar 2022 19:17 UTC
5 points
0 comments1 min readLW link

A Longlist of The­o­ries of Im­pact for Interpretability

Neel Nanda11 Mar 2022 14:55 UTC
101 points
28 comments5 min readLW link

New GPT3 Im­pres­sive Ca­pa­bil­ities—In­struc­tGPT3 [1/​2]

WayZ13 Mar 2022 10:58 UTC
71 points
10 comments7 min readLW link

Phase tran­si­tions and AGI

17 Mar 2022 17:22 UTC
44 points
19 comments9 min readLW link
(www.metaculus.com)

Can we simu­late hu­man evolu­tion to cre­ate a some­what al­igned AGI?

Thomas Kwa28 Mar 2022 22:55 UTC
20 points
7 comments7 min readLW link

Pro­ject In­tro: Selec­tion The­o­rems for Modularity

4 Apr 2022 12:59 UTC
65 points
19 comments16 min readLW link

My agenda for re­search into trans­former ca­pa­bil­ities—Introduction

p.b.5 Apr 2022 21:23 UTC
11 points
1 comment3 min readLW link

Re­search agenda: Can trans­form­ers do sys­tem 2 think­ing?

p.b.6 Apr 2022 13:31 UTC
18 points
0 comments2 min readLW link

PaLM in “Ex­trap­o­lat­ing GPT-N perfor­mance”

Lanrian6 Apr 2022 13:05 UTC
76 points
17 comments2 min readLW link

Re­search agenda—Build­ing a multi-modal chess-lan­guage model

p.b.7 Apr 2022 12:25 UTC
8 points
2 comments2 min readLW link

Is GPT3 a Good Ra­tion­al­ist? - In­struc­tGPT3 [2/​2]

WayZ7 Apr 2022 13:46 UTC
10 points
0 comments7 min readLW link

Play­ing with DALL·E 2

Dave Orr7 Apr 2022 18:49 UTC
161 points
112 comments6 min readLW link

Progress Re­port 4: logit lens redux

Nathan Helm-Burger8 Apr 2022 18:35 UTC
3 points
0 comments2 min readLW link

Hyper­bolic takeoff

Ege Erdil9 Apr 2022 15:57 UTC
17 points
8 comments10 min readLW link
(www.metaculus.com)

Elicit: Lan­guage Models as Re­search Assistants

9 Apr 2022 14:56 UTC
62 points
5 comments13 min readLW link

Is it time to start think­ing about what AI Friendli­ness means?

ZT511 Apr 2022 9:32 UTC
17 points
6 comments3 min readLW link

What more com­pute does for brain-like mod­els: re­sponse to Rohin

Nathan Helm-Burger13 Apr 2022 3:40 UTC
22 points
14 comments11 min readLW link

Align­ment and Deep Learning

Aiyen17 Apr 2022 0:02 UTC
44 points
35 comments8 min readLW link

[$20K in Prizes] AI Safety Ar­gu­ments Competition

26 Apr 2022 16:13 UTC
62 points
492 comments3 min readLW link

SERI ML Align­ment The­ory Schol­ars Pro­gram 2022

27 Apr 2022 0:43 UTC
45 points
5 comments3 min readLW link

[Question] What is a train­ing “step” vs. “epi­sode” in ma­chine learn­ing?

Evan R. Murphy28 Apr 2022 21:53 UTC
9 points
4 comments1 min readLW link

Prize for Align­ment Re­search Tasks

29 Apr 2022 8:57 UTC
58 points
29 comments10 min readLW link

Quick Thoughts on A.I. Governance

NicholasKross30 Apr 2022 14:49 UTC
55 points
8 comments2 min readLW link
(www.thinkingmuchbetter.com)

Open Prob­lems in Nega­tive Side Effect Minimization

6 May 2022 9:37 UTC
12 points
3 comments17 min readLW link

[Linkpost] diffu­sion mag­ne­tizes man­i­folds (DALL-E 2 in­tu­ition build­ing)

paulbricman7 May 2022 11:01 UTC
1 point
0 comments1 min readLW link
(paulbricman.com)

Up­dat­ing Utility Functions

9 May 2022 9:44 UTC
33 points
7 comments8 min readLW link

Con­di­tions for math­e­mat­i­cal equiv­alence of Stochas­tic Gra­di­ent Des­cent and Nat­u­ral Selection

Oliver Sourbut9 May 2022 21:38 UTC
41 points
10 comments10 min readLW link

AI safety should be made more ac­cessible us­ing non text-based media

Massimog10 May 2022 3:14 UTC
2 points
4 comments4 min readLW link

The limits of AI safety via debate

Marius Hobbhahn10 May 2022 13:33 UTC
27 points
7 comments10 min readLW link

In­tro­duc­tion to the se­quence: In­ter­pretabil­ity Re­search for the Most Im­por­tant Century

Evan R. Murphy12 May 2022 19:59 UTC
16 points
0 comments8 min readLW link

Gato as the Dawn of Early AGI

David Udell15 May 2022 6:52 UTC
84 points
29 comments12 min readLW link

Deep­Mind’s gen­er­al­ist AI, Gato: A non-tech­ni­cal explainer

16 May 2022 21:21 UTC
56 points
5 comments6 min readLW link

Gato’s Gen­er­al­i­sa­tion: Pre­dic­tions and Ex­per­i­ments I’d Like to See

Oliver Sourbut18 May 2022 7:15 UTC
39 points
3 comments10 min readLW link

Un­der­stand­ing Gato’s Su­per­vised Re­in­force­ment Learning

Lorenzo Rex18 May 2022 11:08 UTC
2 points
0 comments1 min readLW link
(lorenzopieri.com)

A Story of AI Risk: In­struc­tGPT-N

peterbarnett26 May 2022 23:22 UTC
17 points
0 comments8 min readLW link

Real­ism about rationality

Richard_Ngo16 Sep 2018 10:46 UTC
177 points
145 comments4 min readLW link3 reviews
(thinkingcomplete.blogspot.com)

De­bate on In­stru­men­tal Con­ver­gence be­tween LeCun, Rus­sell, Ben­gio, Zador, and More

Ben Pace4 Oct 2019 4:08 UTC
188 points
57 comments15 min readLW link2 reviews

The Parable of Pre­dict-O-Matic

abramdemski15 Oct 2019 0:49 UTC
280 points
41 comments14 min readLW link2 reviews

2018 AI Align­ment Liter­a­ture Re­view and Char­ity Comparison

Larks18 Dec 2018 4:46 UTC
190 points
26 comments62 min readLW link1 review

An Ortho­dox Case Against Utility Functions

abramdemski7 Apr 2020 19:18 UTC
126 points
53 comments8 min readLW link2 reviews

“How con­ser­va­tive” should the par­tial max­imisers be?

Stuart_Armstrong13 Apr 2020 15:50 UTC
30 points
8 comments2 min readLW link

[AN #95]: A frame­work for think­ing about how to make AI go well

Rohin Shah15 Apr 2020 17:10 UTC
20 points
2 comments10 min readLW link
(mailchi.mp)

AI Align­ment Pod­cast: An Overview of Tech­ni­cal AI Align­ment in 2018 and 2019 with Buck Sh­legeris and Ro­hin Shah

Palus Astra16 Apr 2020 0:50 UTC
58 points
27 comments89 min readLW link

Open ques­tion: are min­i­mal cir­cuits dae­mon-free?

paulfchristiano5 May 2018 22:40 UTC
80 points
70 comments2 min readLW link1 review

Disen­tan­gling ar­gu­ments for the im­por­tance of AI safety

Richard_Ngo21 Jan 2019 12:41 UTC
128 points
23 comments8 min readLW link

In­te­grat­ing Hid­den Vari­ables Im­proves Approximation

johnswentworth16 Apr 2020 21:43 UTC
15 points
4 comments1 min readLW link

AI Ser­vices as a Re­search Paradigm

VojtaKovarik20 Apr 2020 13:00 UTC
30 points
12 comments4 min readLW link
(docs.google.com)

Databases of hu­man be­havi­our and prefer­ences?

Stuart_Armstrong21 Apr 2020 18:06 UTC
10 points
9 comments1 min readLW link

Critch on ca­reer ad­vice for ju­nior AI-x-risk-con­cerned researchers

Rob Bensinger12 May 2018 2:13 UTC
116 points
25 comments4 min readLW link

Refram­ing Impact

TurnTrout20 Sep 2019 19:03 UTC
85 points
15 comments3 min readLW link1 review

De­scrip­tion vs simu­lated prediction

Richard Korzekwa 22 Apr 2020 16:40 UTC
26 points
0 comments5 min readLW link
(aiimpacts.org)

Deep­Mind team on speci­fi­ca­tion gaming

JoshuaFox23 Apr 2020 8:01 UTC
30 points
2 comments1 min readLW link
(deepmind.com)

[Question] Does Agent-like Be­hav­ior Im­ply Agent-like Ar­chi­tec­ture?

Scott Garrabrant23 Aug 2019 2:01 UTC
46 points
7 comments1 min readLW link

Risks from Learned Op­ti­miza­tion: Con­clu­sion and Re­lated Work

7 Jun 2019 19:53 UTC
73 points
4 comments6 min readLW link

De­cep­tive Alignment

5 Jun 2019 20:16 UTC
84 points
11 comments17 min readLW link

The In­ner Align­ment Problem

4 Jun 2019 1:20 UTC
90 points
17 comments13 min readLW link

How the MtG Color Wheel Ex­plains AI Safety

Scott Garrabrant15 Feb 2019 23:42 UTC
57 points
4 comments6 min readLW link

[Question] How does Gra­di­ent Des­cent In­ter­act with Good­hart?

Scott Garrabrant2 Feb 2019 0:14 UTC
68 points
19 comments4 min readLW link

For­mal Open Prob­lem in De­ci­sion Theory

Scott Garrabrant29 Nov 2018 3:25 UTC
35 points
11 comments4 min readLW link

The Ubiquitous Con­verse Law­vere Problem

Scott Garrabrant29 Nov 2018 3:16 UTC
21 points
0 comments2 min readLW link

Embed­ded Curiosities

8 Nov 2018 14:19 UTC
86 points
1 comment2 min readLW link

Sub­sys­tem Alignment

6 Nov 2018 16:16 UTC
100 points
12 comments1 min readLW link

Ro­bust Delegation

4 Nov 2018 16:38 UTC
110 points
10 comments1 min readLW link

Embed­ded World-Models

2 Nov 2018 16:07 UTC
87 points
16 comments1 min readLW link

De­ci­sion Theory

31 Oct 2018 18:41 UTC
111 points
46 comments1 min readLW link

(A → B) → A

Scott Garrabrant11 Sep 2018 22:38 UTC
60 points
11 comments2 min readLW link

His­tory of the Devel­op­ment of Log­i­cal Induction

Scott Garrabrant29 Aug 2018 3:15 UTC
89 points
4 comments5 min readLW link

Op­ti­miza­tion Amplifies

Scott Garrabrant27 Jun 2018 1:51 UTC
96 points
12 comments4 min readLW link

What makes coun­ter­fac­tu­als com­pa­rable?

Chris_Leong24 Apr 2020 22:47 UTC
11 points
6 comments3 min readLW link

New Paper Ex­pand­ing on the Good­hart Taxonomy

Scott Garrabrant14 Mar 2018 9:01 UTC
17 points
4 comments1 min readLW link
(arxiv.org)

Sources of in­tu­itions and data on AGI

Scott Garrabrant31 Jan 2018 23:30 UTC
84 points
26 comments3 min readLW link

Corrigibility

paulfchristiano27 Nov 2018 21:50 UTC
50 points
5 comments6 min readLW link

AI pre­dic­tion case study 5: Omo­hun­dro’s AI drives

Stuart_Armstrong15 Mar 2013 9:09 UTC
10 points
5 comments8 min readLW link

Toy model: con­ver­gent in­stru­men­tal goals

Stuart_Armstrong25 Feb 2016 14:03 UTC
15 points
2 comments4 min readLW link

AI-cre­ated pseudo-deontology

Stuart_Armstrong12 Feb 2015 21:11 UTC
10 points
35 comments1 min readLW link

Eth­i­cal Injunctions

Eliezer Yudkowsky20 Oct 2008 23:00 UTC
59 points
76 comments9 min readLW link

Mo­ti­vat­ing Ab­strac­tion-First De­ci­sion Theory

johnswentworth29 Apr 2020 17:47 UTC
40 points
16 comments5 min readLW link

[AN #97]: Are there his­tor­i­cal ex­am­ples of large, ro­bust dis­con­ti­nu­ities?

Rohin Shah29 Apr 2020 17:30 UTC
15 points
0 comments10 min readLW link
(mailchi.mp)

My Up­dat­ing Thoughts on AI policy

Ben Pace1 Mar 2020 7:06 UTC
20 points
1 comment9 min readLW link

Use­ful Does Not Mean Secure

Ben Pace30 Nov 2019 2:05 UTC
46 points
12 comments11 min readLW link

[Question] What is the al­ter­na­tive to in­tent al­ign­ment called?

Richard_Ngo30 Apr 2020 2:16 UTC
12 points
6 comments1 min readLW link

Op­ti­mis­ing So­ciety to Con­strain Risk of War from an Ar­tifi­cial Su­per­in­tel­li­gence

JohnCDraper30 Apr 2020 10:47 UTC
3 points
1 comment51 min readLW link

Stan­ford En­cy­clo­pe­dia of Philos­o­phy on AI ethics and superintelligence

Kaj_Sotala2 May 2020 7:35 UTC
43 points
19 comments7 min readLW link
(plato.stanford.edu)

[Question] How does iter­ated am­plifi­ca­tion ex­ceed hu­man abil­ities?

riceissa2 May 2020 23:44 UTC
19 points
9 comments2 min readLW link

How uniform is the neo­cor­tex?

zhukeepa4 May 2020 2:16 UTC
76 points
23 comments11 min readLW link1 review

Scott Garrabrant’s prob­lem on re­cov­er­ing Brouwer as a corol­lary of Lawvere

Rupert4 May 2020 10:01 UTC
26 points
2 comments2 min readLW link

“AI and Effi­ciency”, OA (44✕ im­prove­ment in CNNs since 2012)

gwern5 May 2020 16:32 UTC
47 points
0 comments1 min readLW link
(openai.com)

Com­pet­i­tive safety via gra­dated curricula

Richard_Ngo5 May 2020 18:11 UTC
37 points
5 comments5 min readLW link

Model­ing nat­u­ral­ized de­ci­sion prob­lems in lin­ear logic

jessicata6 May 2020 0:15 UTC
14 points
2 comments6 min readLW link
(unstableontology.com)

[AN #98]: Un­der­stand­ing neu­ral net train­ing by see­ing which gra­di­ents were helpful

Rohin Shah6 May 2020 17:10 UTC
22 points
3 comments9 min readLW link
(mailchi.mp)

[Question] Is AI safety re­search less par­alleliz­able than AI re­search?

Mati_Roy10 May 2020 20:43 UTC
9 points
5 comments1 min readLW link

Thoughts on im­ple­ment­ing cor­rigible ro­bust alignment

Steven Byrnes26 Nov 2019 14:06 UTC
26 points
2 comments6 min readLW link

Wire­head­ing is in the eye of the beholder

Stuart_Armstrong30 Jan 2019 18:23 UTC
26 points
10 comments1 min readLW link

Wire­head­ing as a po­ten­tial prob­lem with the new im­pact measure

Stuart_Armstrong25 Sep 2018 14:15 UTC
25 points
20 comments4 min readLW link

Wire­head­ing and discontinuity

Michele Campolo18 Feb 2020 10:49 UTC
21 points
4 comments3 min readLW link

[AN #99]: Dou­bling times for the effi­ciency of AI algorithms

Rohin Shah13 May 2020 17:20 UTC
29 points
0 comments10 min readLW link
(mailchi.mp)

How should AIs up­date a prior over hu­man prefer­ences?

Stuart_Armstrong15 May 2020 13:14 UTC
17 points
9 comments2 min readLW link

Con­jec­ture Workshop

johnswentworth15 May 2020 22:41 UTC
34 points
2 comments2 min readLW link

Multi-agent safety

Richard_Ngo16 May 2020 1:59 UTC
30 points
8 comments5 min readLW link

The Mechanis­tic and Nor­ma­tive Struc­ture of Agency

G Gordon Worley III18 May 2020 16:03 UTC
15 points
4 comments1 min readLW link
(philpapers.org)

“Star­wink” by Alicorn

Zack_M_Davis18 May 2020 8:17 UTC
44 points
1 comment1 min readLW link
(alicorn.elcenia.com)

[AN #100]: What might go wrong if you learn a re­ward func­tion while acting

Rohin Shah20 May 2020 17:30 UTC
33 points
2 comments12 min readLW link
(mailchi.mp)

Prob­a­bil­ities, weights, sums: pretty much the same for re­ward functions

Stuart_Armstrong20 May 2020 15:19 UTC
11 points
1 comment2 min readLW link

[Question] Source code size vs learned model size in ML and in hu­mans?

riceissa20 May 2020 8:47 UTC
11 points
6 comments1 min readLW link

Com­par­ing re­ward learn­ing/​re­ward tam­per­ing formalisms

Stuart_Armstrong21 May 2020 12:03 UTC
9 points
3 comments3 min readLW link

AGIs as collectives

Richard_Ngo22 May 2020 20:36 UTC
22 points
23 comments4 min readLW link

[AN #101]: Why we should rigor­ously mea­sure and fore­cast AI progress

Rohin Shah27 May 2020 17:20 UTC
15 points
0 comments10 min readLW link
(mailchi.mp)

AI Safety Dis­cus­sion Days

Linda Linsefors27 May 2020 16:54 UTC
12 points
1 comment3 min readLW link

Build­ing brain-in­spired AGI is in­finitely eas­ier than un­der­stand­ing the brain

Steven Byrnes2 Jun 2020 14:13 UTC
49 points
14 comments7 min readLW link

Spar­sity and in­ter­pretabil­ity?

1 Jun 2020 13:25 UTC
41 points
3 comments7 min readLW link

GPT-3: A Summary

leogao2 Jun 2020 18:14 UTC
19 points
0 comments1 min readLW link
(leogao.dev)

Inac­cessible information

paulfchristiano3 Jun 2020 5:10 UTC
84 points
17 comments14 min readLW link2 reviews
(ai-alignment.com)

[AN #102]: Meta learn­ing by GPT-3, and a list of full pro­pos­als for AI alignment

Rohin Shah3 Jun 2020 17:20 UTC
38 points
6 comments10 min readLW link
(mailchi.mp)

Feed­back is cen­tral to agency

Alex Flint1 Jun 2020 12:56 UTC
28 points
1 comment3 min readLW link

Think­ing About Su­per-Hu­man AI: An Ex­am­i­na­tion of Likely Paths and Ul­ti­mate Constitution

meanderingmoose4 Jun 2020 23:22 UTC
−3 points
0 comments7 min readLW link

Emer­gence and Con­trol: An ex­am­i­na­tion of our abil­ity to gov­ern the be­hav­ior of in­tel­li­gent systems

meanderingmoose5 Jun 2020 17:10 UTC
1 point
0 comments6 min readLW link

GAN Discrim­i­na­tors Don’t Gen­er­al­ize?

tryactions8 Jun 2020 20:36 UTC
18 points
7 comments2 min readLW link

More on dis­am­biguat­ing “dis­con­ti­nu­ity”

Aryeh Englander9 Jun 2020 15:16 UTC
16 points
1 comment3 min readLW link

[AN #103]: ARCHES: an agenda for ex­is­ten­tial safety, and com­bin­ing nat­u­ral lan­guage with deep RL

Rohin Shah10 Jun 2020 17:20 UTC
27 points
1 comment10 min readLW link
(mailchi.mp)

Dutch-Book­ing CDT: Re­vised Argument

abramdemski27 Oct 2020 4:31 UTC
50 points
22 comments16 min readLW link

[Question] List of pub­lic pre­dic­tions of what GPT-X can or can’t do?

Daniel Kokotajlo14 Jun 2020 14:25 UTC
20 points
9 comments1 min readLW link

Achiev­ing AI al­ign­ment through de­liber­ate un­cer­tainty in mul­ti­a­gent systems

Florian Dietz15 Jun 2020 12:19 UTC
3 points
10 comments7 min readLW link

Su­per­ex­po­nen­tial His­toric Growth, by David Roodman

Ben Pace15 Jun 2020 21:49 UTC
43 points
6 comments5 min readLW link
(www.openphilanthropy.org)

Re­lat­ing HCH and Log­i­cal Induction

abramdemski16 Jun 2020 22:08 UTC
47 points
4 comments5 min readLW link

Image GPT

Daniel Kokotajlo18 Jun 2020 11:41 UTC
29 points
27 comments1 min readLW link
(openai.com)

[AN #104]: The per­ils of in­ac­cessible in­for­ma­tion, and what we can learn about AI al­ign­ment from COVID

Rohin Shah18 Jun 2020 17:10 UTC
19 points
5 comments8 min readLW link
(mailchi.mp)

[Question] If AI is based on GPT, how to en­sure its safety?

avturchin18 Jun 2020 20:33 UTC
20 points
11 comments1 min readLW link

What’s Your Cog­ni­tive Al­gorithm?

Raemon18 Jun 2020 22:16 UTC
69 points
23 comments13 min readLW link

Rele­vant pre-AGI possibilities

Daniel Kokotajlo20 Jun 2020 10:52 UTC
38 points
7 comments19 min readLW link
(aiimpacts.org)

Plau­si­ble cases for HRAD work, and lo­cat­ing the crux in the “re­al­ism about ra­tio­nal­ity” debate

riceissa22 Jun 2020 1:10 UTC
85 points
15 comments10 min readLW link

The In­dex­ing Problem

johnswentworth22 Jun 2020 19:11 UTC
35 points
2 comments4 min readLW link

[Question] Re­quest­ing feed­back/​ad­vice: what Type The­ory to study for AI safety?

rvnnt23 Jun 2020 17:03 UTC
7 points
4 comments3 min readLW link

Lo­cal­ity of goals

adamShimi22 Jun 2020 21:56 UTC
16 points
8 comments6 min readLW link

[Question] What is “In­stru­men­tal Cor­rigi­bil­ity”?

joebernstein23 Jun 2020 20:24 UTC
4 points
1 comment1 min readLW link

Models, myths, dreams, and Cheshire cat grins

Stuart_Armstrong24 Jun 2020 10:50 UTC
21 points
7 comments2 min readLW link

[AN #105]: The eco­nomic tra­jec­tory of hu­man­ity, and what we might mean by optimization

Rohin Shah24 Jun 2020 17:30 UTC
24 points
3 comments11 min readLW link
(mailchi.mp)

There’s an Awe­some AI Ethics List and it’s a lit­tle thin

AABoyles25 Jun 2020 13:43 UTC
13 points
1 comment1 min readLW link
(github.com)

GPT-3 Fic­tion Samples

gwern25 Jun 2020 16:12 UTC
62 points
18 comments1 min readLW link
(www.gwern.net)

Walk­through: The Trans­former Ar­chi­tec­ture [Part 1/​2]

Matthew Barnett30 Jul 2019 13:54 UTC
34 points
0 comments6 min readLW link

Ro­bust­ness as a Path to AI Alignment

abramdemski10 Oct 2017 8:14 UTC
45 points
9 comments9 min readLW link

Rad­i­cal Prob­a­bil­ism [Tran­script]

26 Jun 2020 22:14 UTC
46 points
12 comments6 min readLW link

AI safety via mar­ket making

evhub26 Jun 2020 23:07 UTC
52 points
45 comments11 min readLW link

[Question] Have gen­eral de­com­posers been for­mal­ized?

Quinn27 Jun 2020 18:09 UTC
8 points
5 comments1 min readLW link

Gary Mar­cus vs Cor­ti­cal Uniformity

Steven Byrnes28 Jun 2020 18:18 UTC
25 points
0 comments8 min readLW link

Web AI dis­cus­sion Groups

Donald Hobson30 Jun 2020 11:22 UTC
11 points
0 comments2 min readLW link

Com­par­ing AI Align­ment Ap­proaches to Min­i­mize False Pos­i­tive Risk

G Gordon Worley III30 Jun 2020 19:34 UTC
5 points
0 comments9 min readLW link

AvE: As­sis­tance via Empowerment

FactorialCode30 Jun 2020 22:07 UTC
12 points
1 comment1 min readLW link
(arxiv.org)

Evan Hub­inger on In­ner Align­ment, Outer Align­ment, and Pro­pos­als for Build­ing Safe Ad­vanced AI

Palus Astra1 Jul 2020 17:30 UTC
35 points
4 comments67 min readLW link

[AN #106]: Eval­u­at­ing gen­er­al­iza­tion abil­ity of learned re­ward models

Rohin Shah1 Jul 2020 17:20 UTC
14 points
2 comments11 min readLW link
(mailchi.mp)

The “AI De­bate” Debate

michaelcohen2 Jul 2020 10:16 UTC
20 points
20 comments3 min readLW link

Idea: Imi­ta­tion/​Value Learn­ing AIXI

Zachary Robertson3 Jul 2020 17:10 UTC
3 points
6 comments1 min readLW link

Split­ting De­bate up into Two Subsystems

Nandi3 Jul 2020 20:11 UTC
13 points
5 comments4 min readLW link

AI Un­safety via Non-Zero-Sum Debate

VojtaKovarik3 Jul 2020 22:03 UTC
25 points
10 comments5 min readLW link

Clas­sify­ing games like the Pri­soner’s Dilemma

philh4 Jul 2020 17:10 UTC
98 points
27 comments6 min readLW link1 review
(reasonableapproximation.net)

AI-Feyn­man as a bench­mark for what we should be aiming for

Faustus24 Jul 2020 9:24 UTC
8 points
1 comment2 min readLW link

Learn­ing the prior

paulfchristiano5 Jul 2020 21:00 UTC
78 points
27 comments8 min readLW link
(ai-alignment.com)

Bet­ter pri­ors as a safety problem

paulfchristiano5 Jul 2020 21:20 UTC
63 points
7 comments5 min readLW link
(ai-alignment.com)

[Question] How far is AGI?

Roko Jelavić5 Jul 2020 17:58 UTC
6 points
5 comments1 min readLW link

Clas­sify­ing speci­fi­ca­tion prob­lems as var­i­ants of Good­hart’s Law

Vika19 Aug 2019 20:40 UTC
67 points
5 comments5 min readLW link1 review

New safety re­search agenda: scal­able agent al­ign­ment via re­ward modeling

Vika20 Nov 2018 17:29 UTC
34 points
13 comments1 min readLW link
(medium.com)

De­sign­ing agent in­cen­tives to avoid side effects

11 Mar 2019 20:55 UTC
29 points
0 comments2 min readLW link
(medium.com)

Dis­cus­sion on the ma­chine learn­ing ap­proach to AI safety

Vika1 Nov 2018 20:54 UTC
26 points
3 comments4 min readLW link

Speci­fi­ca­tion gam­ing ex­am­ples in AI

Vika3 Apr 2018 12:30 UTC
39 points
9 comments1 min readLW link2 reviews

[Question] (an­swered: yes) Has any­one writ­ten up a con­sid­er­a­tion of Downs’s “Para­dox of Vot­ing” from the per­spec­tive of MIRI-ish de­ci­sion the­o­ries (UDT, FDT, or even just EDT)?

Jameson Quinn6 Jul 2020 18:26 UTC
10 points
24 comments1 min readLW link

New Deep­Mind AI Safety Re­search Blog

Vika27 Sep 2018 16:28 UTC
43 points
0 comments1 min readLW link
(medium.com)

Con­test: $1,000 for good ques­tions to ask to an Or­a­cle AI

Stuart_Armstrong31 Jul 2019 18:48 UTC
57 points
156 comments3 min readLW link

De­con­fus­ing Hu­man Values Re­search Agenda v1

G Gordon Worley III23 Mar 2020 16:25 UTC
24 points
12 comments4 min readLW link

[Question] How “hon­est” is GPT-3?

abramdemski8 Jul 2020 19:38 UTC
72 points
18 comments5 min readLW link

What does it mean to ap­ply de­ci­sion the­ory?

abramdemski8 Jul 2020 20:31 UTC
49 points
5 comments8 min readLW link

AI Re­search Con­sid­er­a­tions for Hu­man Ex­is­ten­tial Safety (ARCHES)

habryka9 Jul 2020 2:49 UTC
60 points
8 comments1 min readLW link
(arxiv.org)

The Un­rea­son­able Effec­tive­ness of Deep Learning

Richard_Ngo30 Sep 2018 15:48 UTC
85 points
5 comments13 min readLW link
(thinkingcomplete.blogspot.com)

mAIry’s room: AI rea­son­ing to solve philo­soph­i­cal problems

Stuart_Armstrong5 Mar 2019 20:24 UTC
91 points
41 comments6 min readLW link2 reviews

Failures of an em­bod­ied AIXI

So8res15 Jun 2014 18:29 UTC
46 points
46 comments12 min readLW link

The Prob­lem with AIXI

Rob Bensinger18 Mar 2014 1:55 UTC
43 points
78 comments23 min readLW link

Ver­sions of AIXI can be ar­bi­trar­ily stupid

Stuart_Armstrong10 Aug 2015 13:23 UTC
29 points
59 comments1 min readLW link

Reflec­tive AIXI and Anthropics

Diffractor24 Sep 2018 2:15 UTC
17 points
13 comments8 min readLW link

AIXI and Ex­is­ten­tial Despair

paulfchristiano8 Dec 2011 20:03 UTC
23 points
38 comments6 min readLW link

How to make AIXI-tl in­ca­pable of learning

itaibn027 Jan 2014 0:05 UTC
7 points
5 comments2 min readLW link

Help re­quest: What is the Kol­mogorov com­plex­ity of com­putable ap­prox­i­ma­tions to AIXI?

AnnaSalamon5 Dec 2010 10:23 UTC
7 points
9 comments1 min readLW link

“AIXIjs: A Soft­ware Demo for Gen­eral Re­in­force­ment Learn­ing”, As­lanides 2017

gwern29 May 2017 21:09 UTC
7 points
1 comment1 min readLW link
(arxiv.org)

Can AIXI be trained to do any­thing a hu­man can?

Stuart_Armstrong20 Oct 2014 13:12 UTC
5 points
9 comments2 min readLW link

Shap­ing eco­nomic in­cen­tives for col­lab­o­ra­tive AGI

Kaj_Sotala29 Jun 2018 16:26 UTC
45 points
15 comments4 min readLW link

Is the Star Trek Fed­er­a­tion re­ally in­ca­pable of build­ing AI?

Kaj_Sotala18 Mar 2018 10:30 UTC
19 points
4 comments2 min readLW link
(kajsotala.fi)

Some con­cep­tual high­lights from “Disjunc­tive Sce­nar­ios of Catas­trophic AI Risk”

Kaj_Sotala12 Feb 2018 12:30 UTC
32 points
4 comments6 min readLW link
(kajsotala.fi)

Mis­con­cep­tions about con­tin­u­ous takeoff

Matthew Barnett8 Oct 2019 21:31 UTC
79 points
38 comments4 min readLW link

Dist­in­guish­ing defi­ni­tions of takeoff

Matthew Barnett14 Feb 2020 0:16 UTC
57 points
6 comments6 min readLW link

Book re­view: Ar­tifi­cial In­tel­li­gence Safety and Security

PeterMcCluskey8 Dec 2018 3:47 UTC
27 points
3 comments8 min readLW link
(www.bayesianinvestor.com)

Why AI may not foom

John_Maxwell24 Mar 2013 8:11 UTC
29 points
81 comments12 min readLW link

Hu­mans Who Are Not Con­cen­trat­ing Are Not Gen­eral Intelligences

sarahconstantin25 Feb 2019 20:40 UTC
172 points
35 comments6 min readLW link1 review
(srconstantin.wordpress.com)

The Hacker Learns to Trust

Ben Pace22 Jun 2019 0:27 UTC
80 points
18 comments8 min readLW link
(medium.com)

Book Re­view: Hu­man Compatible

Scott Alexander31 Jan 2020 5:20 UTC
76 points
6 comments16 min readLW link
(slatestarcodex.com)

SSC Jour­nal Club: AI Timelines

Scott Alexander8 Jun 2017 19:00 UTC
11 points
15 comments8 min readLW link

Ar­gu­ments against my­opic training

Richard_Ngo9 Jul 2020 16:07 UTC
53 points
39 comments12 min readLW link

On mo­ti­va­tions for MIRI’s highly re­li­able agent de­sign research

jessicata29 Jan 2017 19:34 UTC
27 points
1 comment5 min readLW link

Why is the im­pact penalty time-in­con­sis­tent?

Stuart_Armstrong9 Jul 2020 17:26 UTC
16 points
1 comment2 min readLW link

My cur­rent take on the Paul-MIRI dis­agree­ment on al­ignabil­ity of messy AI

jessicata29 Jan 2017 20:52 UTC
21 points
0 comments10 min readLW link

Ben Go­ertzel: The Sin­gu­lar­ity In­sti­tute’s Scary Idea (and Why I Don’t Buy It)

Paul Crowley30 Oct 2010 9:31 UTC
42 points
442 comments1 min readLW link

An An­a­lytic Per­spec­tive on AI Alignment

DanielFilan1 Mar 2020 4:10 UTC
53 points
45 comments8 min readLW link
(danielfilan.com)

Mechanis­tic Trans­parency for Ma­chine Learning

DanielFilan11 Jul 2018 0:34 UTC
54 points
9 comments4 min readLW link

A model I use when mak­ing plans to re­duce AI x-risk

Ben Pace19 Jan 2018 0:21 UTC
69 points
41 comments6 min readLW link

AI Re­searchers On AI Risk

Scott Alexander22 May 2015 11:16 UTC
17 points
0 comments16 min readLW link

Mini ad­vent cal­en­dar of Xrisks: Ar­tifi­cial Intelligence

Stuart_Armstrong7 Dec 2012 11:26 UTC
5 points
5 comments1 min readLW link

For FAI: Is “Molec­u­lar Nan­otech­nol­ogy” putting our best foot for­ward?

leplen22 Jun 2013 4:44 UTC
79 points
118 comments3 min readLW link

UFAI can­not be the Great Filter

Thrasymachus22 Dec 2012 11:26 UTC
59 points
92 comments3 min readLW link

Don’t Fear The Filter

Scott Alexander29 May 2014 0:45 UTC
10 points
18 comments6 min readLW link

The Great Filter is early, or AI is hard

Stuart_Armstrong29 Aug 2014 16:17 UTC
32 points
76 comments1 min readLW link

Talk: Key Is­sues In Near-Term AI Safety Research

Aryeh Englander10 Jul 2020 18:36 UTC
22 points
1 comment1 min readLW link

Mesa-Op­ti­miz­ers vs “Steered Op­ti­miz­ers”

Steven Byrnes10 Jul 2020 16:49 UTC
43 points
5 comments8 min readLW link

AlphaS­tar: Im­pres­sive for RL progress, not for AGI progress

orthonormal2 Nov 2019 1:50 UTC
113 points
58 comments2 min readLW link1 review

The Catas­trophic Con­ver­gence Conjecture

TurnTrout14 Feb 2020 21:16 UTC
40 points
15 comments8 min readLW link

[Question] How well can the GPT ar­chi­tec­ture solve the par­ity task?

FactorialCode11 Jul 2020 19:02 UTC
19 points
3 comments1 min readLW link

Sun­day July 12 — talks by Scott Garrabrant, Alexflint, alexei, Stu­art_Armstrong

8 Jul 2020 0:27 UTC
19 points
2 comments1 min readLW link

[Link] Word-vec­tor based DL sys­tem achieves hu­man par­ity in ver­bal IQ tests

jacob_cannell13 Jun 2015 23:38 UTC
17 points
8 comments1 min readLW link

The Power of Intelligence

Eliezer Yudkowsky1 Jan 2007 20:00 UTC
53 points
4 comments4 min readLW link

Com­ments on CAIS

Richard_Ngo12 Jan 2019 15:20 UTC
74 points
14 comments7 min readLW link

[Question] What are CAIS’ bold­est near/​medium-term pre­dic­tions?

jacobjacob28 Mar 2019 13:14 UTC
31 points
17 comments1 min readLW link

Drexler on AI Risk

PeterMcCluskey1 Feb 2019 5:11 UTC
34 points
10 comments9 min readLW link
(www.bayesianinvestor.com)

Six AI Risk/​Strat­egy Ideas

Wei_Dai27 Aug 2019 0:40 UTC
63 points
18 comments4 min readLW link1 review

New re­port: In­tel­li­gence Ex­plo­sion Microeconomics

Eliezer Yudkowsky29 Apr 2013 23:14 UTC
72 points
251 comments3 min readLW link

Book re­view: Hu­man Compatible

PeterMcCluskey19 Jan 2020 3:32 UTC
37 points
2 comments5 min readLW link
(www.bayesianinvestor.com)

Thoughts on “Hu­man-Com­pat­i­ble”

TurnTrout10 Oct 2019 5:24 UTC
60 points
35 comments5 min readLW link

Book Re­view: The AI Does Not Hate You

PeterMcCluskey28 Oct 2019 17:45 UTC
26 points
0 comments5 min readLW link
(www.bayesianinvestor.com)

[Link] Book Re­view: ‘The AI Does Not Hate You’ by Tom Chivers (Scott Aaron­son)

eigen7 Oct 2019 18:16 UTC
19 points
0 comments1 min readLW link

Book Re­view: Life 3.0: Be­ing Hu­man in the Age of Ar­tifi­cial Intelligence

J Thomas Moros18 Jan 2018 17:18 UTC
8 points
0 comments1 min readLW link
(ferocioustruth.com)

Book Re­view: Weapons of Math Destruction

Zvi4 Jun 2017 21:20 UTC
1 point
0 comments16 min readLW link

DARPA Digi­tal Tu­tor: Four Months to To­tal Tech­ni­cal Ex­per­tise?

JohnBuridan6 Jul 2020 23:34 UTC
197 points
19 comments7 min readLW link

Paper: Su­per­in­tel­li­gence as a Cause or Cure for Risks of Astro­nom­i­cal Suffering

Kaj_Sotala3 Jan 2018 14:39 UTC
1 point
6 comments1 min readLW link
(www.informatica.si)

Prevent­ing s-risks via in­dex­i­cal un­cer­tainty, acausal trade and dom­i­na­tion in the multiverse

avturchin27 Sep 2018 10:09 UTC
7 points
6 comments4 min readLW link

Pre­face to CLR’s Re­search Agenda on Co­op­er­a­tion, Con­flict, and TAI

JesseClifton13 Dec 2019 21:02 UTC
56 points
10 comments2 min readLW link

Sec­tions 1 & 2: In­tro­duc­tion, Strat­egy and Governance

JesseClifton17 Dec 2019 21:27 UTC
34 points
5 comments14 min readLW link

Sec­tions 3 & 4: Cred­i­bil­ity, Peace­ful Bar­gain­ing Mechanisms

JesseClifton17 Dec 2019 21:46 UTC
19 points
2 comments12 min readLW link

Sec­tions 5 & 6: Con­tem­po­rary Ar­chi­tec­tures, Hu­mans in the Loop

JesseClifton20 Dec 2019 3:52 UTC
27 points
4 comments10 min readLW link

Sec­tion 7: Foun­da­tions of Ra­tional Agency

JesseClifton22 Dec 2019 2:05 UTC
14 points
4 comments8 min readLW link

What counts as defec­tion?

TurnTrout12 Jul 2020 22:03 UTC
81 points
21 comments5 min readLW link1 review

The Com­mit­ment Races problem

Daniel Kokotajlo23 Aug 2019 1:58 UTC
108 points
39 comments5 min readLW link

Align­ment Newslet­ter #36

Rohin Shah12 Dec 2018 1:10 UTC
21 points
0 comments11 min readLW link
(mailchi.mp)

Align­ment Newslet­ter #47

Rohin Shah4 Mar 2019 4:30 UTC
18 points
0 comments8 min readLW link
(mailchi.mp)

Un­der­stand­ing “Deep Dou­ble Des­cent”

evhub6 Dec 2019 0:00 UTC
134 points
51 comments5 min readLW link4 reviews

[LINK] Strong AI Startup Raises $15M

olalonde21 Aug 2012 20:47 UTC
24 points
13 comments1 min readLW link

An­nounc­ing the AI Align­ment Prize

cousin_it3 Nov 2017 15:47 UTC
95 points
78 comments1 min readLW link

I’m leav­ing AI al­ign­ment – you bet­ter stay

rmoehn12 Mar 2020 5:58 UTC
148 points
19 comments5 min readLW link

New pa­per: AGI Agent Safety by Iter­a­tively Im­prov­ing the Utility Function

Koen.Holtman15 Jul 2020 14:05 UTC
21 points
2 comments6 min readLW link

[Question] How should AI de­bate be judged?

abramdemski15 Jul 2020 22:20 UTC
48 points
27 comments6 min readLW link

Align­ment pro­pos­als and com­plex­ity classes

evhub16 Jul 2020 0:27 UTC
33 points
26 comments13 min readLW link

[AN #107]: The con­ver­gent in­stru­men­tal sub­goals of goal-di­rected agents

Rohin Shah16 Jul 2020 6:47 UTC
13 points
1 comment8 min readLW link
(mailchi.mp)

[AN #108]: Why we should scru­ti­nize ar­gu­ments for AI risk

Rohin Shah16 Jul 2020 6:47 UTC
19 points
6 comments12 min readLW link
(mailchi.mp)

En­vi­ron­ments as a bot­tle­neck in AGI development

Richard_Ngo17 Jul 2020 5:02 UTC
36 points
19 comments6 min readLW link

[Question] Can an agent use in­ter­ac­tive proofs to check the al­ign­ment of suc­ce­sors?

PabloAMC17 Jul 2020 19:07 UTC
7 points
2 comments1 min readLW link

Les­sons on AI Takeover from the conquistadors

17 Jul 2020 22:35 UTC
56 points
30 comments5 min readLW link

What Would I Do? Self-pre­dic­tion in Sim­ple Algorithms

Scott Garrabrant20 Jul 2020 4:27 UTC
54 points
13 comments5 min readLW link

Wri­teup: Progress on AI Safety via Debate

5 Feb 2020 21:04 UTC
91 points
17 comments33 min readLW link

Oper­a­tional­iz­ing Interpretability

lifelonglearner20 Jul 2020 5:22 UTC
20 points
0 comments4 min readLW link

Learn­ing Values in Practice

Stuart_Armstrong20 Jul 2020 18:38 UTC
24 points
0 comments5 min readLW link

Par­allels Between AI Safety by De­bate and Ev­i­dence Law

Cullen_OKeefe20 Jul 2020 22:52 UTC
10 points
1 comment2 min readLW link
(cullenokeefe.com)

The Redis­cov­ery of In­te­ri­or­ity in Ma­chine Learning

DanB21 Jul 2020 5:02 UTC
5 points
4 comments1 min readLW link
(danburfoot.net)

The “AI Dun­geons” Dragon Model is heav­ily path de­pen­dent (test­ing GPT-3 on ethics)

Rafael Harth21 Jul 2020 12:14 UTC
44 points
9 comments6 min readLW link

How good is hu­man­ity at co­or­di­na­tion?

Buck21 Jul 2020 20:01 UTC
77 points
43 comments3 min readLW link

Align­ment As A Bot­tle­neck To Use­ful­ness Of GPT-3

johnswentworth21 Jul 2020 20:02 UTC
108 points
57 comments3 min readLW link

$1000 bounty for OpenAI to show whether GPT3 was “de­liber­ately” pre­tend­ing to be stupi­der than it is

jacobjacob21 Jul 2020 18:42 UTC
54 points
40 comments2 min readLW link
(twitter.com)

[Preprint] The Com­pu­ta­tional Limits of Deep Learning

G Gordon Worley III21 Jul 2020 21:25 UTC
8 points
1 comment1 min readLW link
(arxiv.org)

[AN #109]: Teach­ing neu­ral nets to gen­er­al­ize the way hu­mans would

Rohin Shah22 Jul 2020 17:10 UTC
17 points
3 comments9 min readLW link
(mailchi.mp)

Re­search agenda for AI safety and a bet­ter civilization

agilecaveman22 Jul 2020 6:35 UTC
12 points
2 comments16 min readLW link

Weak HCH ac­cesses EXP

evhub22 Jul 2020 22:36 UTC
14 points
0 comments3 min readLW link

GPT-3 Gems

TurnTrout23 Jul 2020 0:46 UTC
33 points
10 comments48 min readLW link

Op­ti­miz­ing ar­bi­trary ex­pres­sions with a lin­ear num­ber of queries to a Log­i­cal In­duc­tion Or­a­cle (Car­toon Guide)

Donald Hobson23 Jul 2020 21:37 UTC
3 points
2 comments2 min readLW link

[Question] Con­struct a port­fo­lio to profit from AI progress.

sapphire25 Jul 2020 8:18 UTC
29 points
13 comments1 min readLW link

Think­ing soberly about the con­text and con­se­quences of Friendly AI

Mitchell_Porter16 Oct 2012 4:33 UTC
20 points
39 comments1 min readLW link

Goal re­ten­tion dis­cus­sion with Eliezer

MaxTegmark4 Sep 2014 22:23 UTC
92 points
26 comments6 min readLW link

[Question] Where do peo­ple dis­cuss do­ing things with GPT-3?

skybrian26 Jul 2020 14:31 UTC
2 points
7 comments1 min readLW link

You Can Prob­a­bly Am­plify GPT3 Directly

Zachary Robertson26 Jul 2020 21:58 UTC
34 points
14 comments6 min readLW link

[up­dated] how does gpt2′s train­ing cor­pus cap­ture in­ter­net dis­cus­sion? not well

nostalgebraist27 Jul 2020 22:30 UTC
25 points
3 comments2 min readLW link
(nostalgebraist.tumblr.com)

Agen­tic Lan­guage Model Memes

FactorialCode1 Aug 2020 18:03 UTC
16 points
1 comment2 min readLW link

A com­mu­nity-cu­rated repos­i­tory of in­ter­est­ing GPT-3 stuff

Rudi C28 Jul 2020 14:16 UTC
8 points
0 comments1 min readLW link
(github.com)

[Question] Does the lot­tery ticket hy­poth­e­sis sug­gest the scal­ing hy­poth­e­sis?

Daniel Kokotajlo28 Jul 2020 19:52 UTC
14 points
17 comments1 min readLW link

[Question] To what ex­tent are the scal­ing prop­er­ties of Trans­former net­works ex­cep­tional?

abramdemski28 Jul 2020 20:06 UTC
30 points
1 comment1 min readLW link

[Question] What hap­pens to var­i­ance as neu­ral net­work train­ing is scaled? What does it im­ply about “lot­tery tick­ets”?

abramdemski28 Jul 2020 20:22 UTC
25 points
4 comments1 min readLW link

[Question] How will in­ter­net fo­rums like LW be able to defend against GPT-style spam?

ChristianKl28 Jul 2020 20:12 UTC
14 points
18 comments1 min readLW link

Pre­dic­tions for GPT-N

hippke29 Jul 2020 1:16 UTC
34 points
31 comments1 min readLW link

An­nounce­ment: AI al­ign­ment prize win­ners and next round

cousin_it15 Jan 2018 14:33 UTC
80 points
68 comments2 min readLW link

Jeff Hawk­ins on neu­ro­mor­phic AGI within 20 years

Steven Byrnes15 Jul 2019 19:16 UTC
164 points
24 comments12 min readLW link

Cas­cades, Cy­cles, In­sight...

Eliezer Yudkowsky24 Nov 2008 9:33 UTC
24 points
31 comments8 min readLW link

...Re­cur­sion, Magic

Eliezer Yudkowsky25 Nov 2008 9:10 UTC
23 points
28 comments5 min readLW link

Refer­ences & Re­sources for LessWrong

XiXiDu10 Oct 2010 14:54 UTC
150 points
106 comments20 min readLW link

[Question] A game de­signed to beat AI?

Long try17 Mar 2020 3:51 UTC
13 points
29 comments1 min readLW link

Truly Part Of You

Eliezer Yudkowsky21 Nov 2007 2:18 UTC
135 points
59 comments4 min readLW link

[AN #110]: Learn­ing fea­tures from hu­man feed­back to en­able re­ward learning

Rohin Shah29 Jul 2020 17:20 UTC
13 points
2 comments10 min readLW link
(mailchi.mp)

Struc­tured Tasks for Lan­guage Models

Zachary Robertson29 Jul 2020 14:17 UTC
5 points
0 comments1 min readLW link

En­gag­ing Se­ri­ously with Short Timelines

sapphire29 Jul 2020 19:21 UTC
43 points
23 comments3 min readLW link

What Failure Looks Like: Distill­ing the Discussion

Ben Pace29 Jul 2020 21:49 UTC
74 points
12 comments7 min readLW link

Learn­ing the prior and generalization

evhub29 Jul 2020 22:49 UTC
34 points
16 comments4 min readLW link

[Question] Is the work on AI al­ign­ment rele­vant to GPT?

Richard_Kennaway30 Jul 2020 12:23 UTC
12 points
5 comments1 min readLW link

Ver­ifi­ca­tion and Transparency

DanielFilan8 Aug 2019 1:50 UTC
34 points
6 comments2 min readLW link
(danielfilan.com)

Robin Han­son on Lump­iness of AI Services

DanielFilan17 Feb 2019 23:08 UTC
15 points
2 comments2 min readLW link
(www.overcomingbias.com)

One Way to Think About ML Transparency

Matthew Barnett2 Sep 2019 23:27 UTC
26 points
28 comments5 min readLW link

What is In­ter­pretabil­ity?

17 Mar 2020 20:23 UTC
34 points
0 comments11 min readLW link

Re­laxed ad­ver­sar­ial train­ing for in­ner alignment

evhub10 Sep 2019 23:03 UTC
57 points
20 comments27 min readLW link

Con­clu­sion to ‘Refram­ing Im­pact’

TurnTrout28 Feb 2020 16:05 UTC
39 points
17 comments2 min readLW link

Bayesian Evolv­ing-to-Extinction

abramdemski14 Feb 2020 23:55 UTC
38 points
13 comments5 min readLW link

Do Suffi­ciently Ad­vanced Agents Use Logic?

abramdemski13 Sep 2019 19:53 UTC
39 points
11 comments9 min readLW link

World State is the Wrong Ab­strac­tion for Impact

TurnTrout1 Oct 2019 21:03 UTC
62 points
19 comments2 min readLW link

At­tain­able Utility Preser­va­tion: Concepts

TurnTrout17 Feb 2020 5:20 UTC
38 points
20 comments1 min readLW link

At­tain­able Utility Preser­va­tion: Em­piri­cal Results

22 Feb 2020 0:38 UTC
55 points
8 comments10 min readLW link1 review

How Low Should Fruit Hang Be­fore We Pick It?

TurnTrout25 Feb 2020 2:08 UTC
28 points
9 comments12 min readLW link

At­tain­able Utility Preser­va­tion: Scal­ing to Superhuman

TurnTrout27 Feb 2020 0:52 UTC
27 points
21 comments8 min readLW link

Rea­sons for Ex­cite­ment about Im­pact of Im­pact Mea­sure Research

TurnTrout27 Feb 2020 21:42 UTC
33 points
8 comments4 min readLW link

Power as Easily Ex­ploitable Opportunities

TurnTrout1 Aug 2020 2:14 UTC
26 points
5 comments6 min readLW link

[Question] Would AGIs par­ent young AGIs?

Vishrut Arya2 Aug 2020 0:57 UTC
3 points
6 comments1 min readLW link

If I were a well-in­ten­tioned AI… I: Image classifier

Stuart_Armstrong26 Feb 2020 12:39 UTC
35 points
4 comments5 min readLW link

Non-Con­se­quen­tial­ist Co­op­er­a­tion?

abramdemski11 Jan 2019 9:15 UTC
48 points
15 comments7 min readLW link

Cu­ri­os­ity Killed the Cat and the Asymp­tot­i­cally Op­ti­mal Agent

michaelcohen20 Feb 2020 17:28 UTC
27 points
15 comments1 min readLW link

If I were a well-in­ten­tioned AI… IV: Mesa-optimising

Stuart_Armstrong2 Mar 2020 12:16 UTC
26 points
2 comments6 min readLW link

Re­sponse to Oren Etz­ioni’s “How to know if ar­tifi­cial in­tel­li­gence is about to de­stroy civ­i­liza­tion”

Daniel Kokotajlo27 Feb 2020 18:10 UTC
27 points
5 comments8 min readLW link

Clar­ify­ing Power-Seek­ing and In­stru­men­tal Convergence

TurnTrout20 Dec 2019 19:59 UTC
42 points
7 comments3 min readLW link

How im­por­tant are MDPs for AGI (Safety)?

michaelcohen26 Mar 2020 20:32 UTC
14 points
8 comments2 min readLW link

Syn­the­siz­ing am­plifi­ca­tion and debate

evhub5 Feb 2020 22:53 UTC
33 points
10 comments4 min readLW link

is gpt-3 few-shot ready for real ap­pli­ca­tions?

nostalgebraist3 Aug 2020 19:50 UTC
31 points
5 comments9 min readLW link
(nostalgebraist.tumblr.com)

In­ter­pretabil­ity in ML: A Broad Overview

lifelonglearner4 Aug 2020 19:03 UTC
49 points
5 comments15 min readLW link

In­finite Data/​Com­pute Ar­gu­ments in Alignment

johnswentworth4 Aug 2020 20:21 UTC
49 points
6 comments2 min readLW link

Four Ways An Im­pact Mea­sure Could Help Alignment

Matthew Barnett8 Aug 2019 0:10 UTC
21 points
1 comment8 min readLW link

Un­der­stand­ing Re­cent Im­pact Measures

Matthew Barnett7 Aug 2019 4:57 UTC
16 points
6 comments7 min readLW link

A Sur­vey of Early Im­pact Measures

Matthew Barnett6 Aug 2019 1:22 UTC
23 points
0 comments8 min readLW link

Op­ti­miza­tion Reg­u­lariza­tion through Time Penalty

Linda Linsefors1 Jan 2019 13:05 UTC
11 points
4 comments3 min readLW link

Stable Poin­t­ers to Value III: Re­cur­sive Quantilization

abramdemski21 Jul 2018 8:06 UTC
19 points
4 comments4 min readLW link

Thoughts on Quantilizers

Stuart_Armstrong2 Jun 2017 16:24 UTC
2 points
0 comments2 min readLW link

Quan­tiliz­ers max­i­mize ex­pected util­ity sub­ject to a con­ser­va­tive cost constraint

jessicata28 Sep 2015 2:17 UTC
15 points
0 comments5 min readLW link

Quan­tilal con­trol for finite MDPs

Vanessa Kosoy12 Apr 2018 9:21 UTC
14 points
0 comments13 min readLW link

The limits of corrigibility

Stuart_Armstrong10 Apr 2018 10:49 UTC
27 points
9 comments4 min readLW link

Align­ment Newslet­ter #16: 07/​23/​18

Rohin Shah23 Jul 2018 16:20 UTC
42 points
0 comments12 min readLW link
(mailchi.mp)

Mea­sur­ing hard­ware overhang

hippke5 Aug 2020 19:59 UTC
105 points
14 comments4 min readLW link

[AN #111]: The Cir­cuits hy­pothe­ses for deep learning

Rohin Shah5 Aug 2020 17:40 UTC
23 points
0 comments9 min readLW link
(mailchi.mp)

Self-Fulfilling Prophe­cies Aren’t Always About Self-Awareness

John_Maxwell18 Nov 2019 23:11 UTC
14 points
7 comments4 min readLW link

The Good­hart Game

John_Maxwell18 Nov 2019 23:22 UTC
13 points
5 comments5 min readLW link

Why don’t sin­gu­lar­i­tar­i­ans bet on the cre­ation of AGI by buy­ing stocks?

John_Maxwell11 Mar 2020 16:27 UTC
43 points
20 comments4 min readLW link

The Dual­ist Pre­dict-O-Matic ($100 prize)

John_Maxwell17 Oct 2019 6:45 UTC
16 points
35 comments5 min readLW link

[Question] What AI safety prob­lems need solv­ing for safe AI re­search as­sis­tants?

John_Maxwell5 Nov 2019 2:09 UTC
14 points
13 comments1 min readLW link

Refin­ing the Evolu­tion­ary Anal­ogy to AI

berglund7 Aug 2020 23:13 UTC
9 points
2 comments4 min readLW link

The Fu­sion Power Gen­er­a­tor Scenario

johnswentworth8 Aug 2020 18:31 UTC
132 points
29 comments3 min readLW link

[Question] How much is known about the “in­fer­ence rules” of log­i­cal in­duc­tion?

Eigil Rischel8 Aug 2020 10:45 UTC
11 points
7 comments1 min readLW link

If I were a well-in­ten­tioned AI… II: Act­ing in a world

Stuart_Armstrong27 Feb 2020 11:58 UTC
20 points
0 comments3 min readLW link

If I were a well-in­ten­tioned AI… III: Ex­tremal Goodhart

Stuart_Armstrong28 Feb 2020 11:24 UTC
21 points
0 comments5 min readLW link

Towards a For­mal­i­sa­tion of Log­i­cal Counterfactuals

Bunthut8 Aug 2020 22:14 UTC
6 points
2 comments2 min readLW link

[Question] 10/​50/​90% chance of GPT-N Trans­for­ma­tive AI?

human_generated_text9 Aug 2020 0:10 UTC
24 points
8 comments1 min readLW link

[Question] Can we ex­pect more value from AI al­ign­ment than from an ASI with the goal of run­ning al­ter­nate tra­jec­to­ries of our uni­verse?

Maxime Riché9 Aug 2020 17:17 UTC
2 points
5 comments1 min readLW link

In defense of Or­a­cle (“Tool”) AI research

Steven Byrnes7 Aug 2019 19:14 UTC
20 points
11 comments4 min readLW link

How GPT-N will es­cape from its AI-box

hippke12 Aug 2020 19:34 UTC
7 points
9 comments1 min readLW link

Strong im­pli­ca­tion of prefer­ence uncertainty

Stuart_Armstrong12 Aug 2020 19:02 UTC
20 points
3 comments2 min readLW link

[AN #112]: Eng­ineer­ing a Safer World

Rohin Shah13 Aug 2020 17:20 UTC
25 points
1 comment12 min readLW link
(mailchi.mp)

Room and Board for Peo­ple Self-Learn­ing ML or Do­ing In­de­pen­dent ML Research

SamuelKnoche14 Aug 2020 17:19 UTC
7 points
1 comment1 min readLW link

Talk and Q&A—Dan Hendrycks—Paper: Align­ing AI With Shared Hu­man Values. On Dis­cord at Aug 28, 2020 8:00-10:00 AM GMT+8.

wassname14 Aug 2020 23:57 UTC
1 point
0 comments1 min readLW link

Search ver­sus design

Alex Flint16 Aug 2020 16:53 UTC
88 points
40 comments36 min readLW link1 review

Work on Se­cu­rity In­stead of Friendli­ness?

Wei_Dai21 Jul 2012 18:28 UTC
51 points
107 comments2 min readLW link

Goal-Direct­ed­ness: What Suc­cess Looks Like

adamShimi16 Aug 2020 18:33 UTC
9 points
0 comments2 min readLW link

[Question] A way to beat su­per­ra­tional/​EDT agents?

Abhimanyu Pallavi Sudhir17 Aug 2020 14:33 UTC
5 points
13 comments1 min readLW link

Learn­ing hu­man prefer­ences: op­ti­mistic and pes­simistic scenarios

Stuart_Armstrong18 Aug 2020 13:05 UTC
27 points
6 comments6 min readLW link

Mesa-Search vs Mesa-Control

abramdemski18 Aug 2020 18:51 UTC
54 points
45 comments7 min readLW link

Why we want un­bi­ased learn­ing processes

Stuart_Armstrong20 Feb 2018 14:48 UTC
13 points
3 comments3 min readLW link

In­tu­itive ex­am­ples of re­ward func­tion learn­ing?

Stuart_Armstrong6 Mar 2018 16:54 UTC
7 points
3 comments2 min readLW link

Open-Cat­e­gory Classification

TurnTrout28 Mar 2018 14:49 UTC
13 points
6 comments10 min readLW link

Look­ing for ad­ver­sar­ial col­lab­o­ra­tors to test our De­bate protocol

Beth Barnes19 Aug 2020 3:15 UTC
52 points
5 comments1 min readLW link

Walk­through of ‘For­mal­iz­ing Con­ver­gent In­stru­men­tal Goals’

TurnTrout26 Feb 2018 2:20 UTC
10 points
2 comments10 min readLW link

Am­bi­guity Detection

TurnTrout1 Mar 2018 4:23 UTC
11 points
9 comments4 min readLW link

Pe­nal­iz­ing Im­pact via At­tain­able Utility Preservation

TurnTrout28 Dec 2018 21:46 UTC
24 points
0 comments3 min readLW link
(arxiv.org)

What You See Isn’t Always What You Want

TurnTrout13 Sep 2019 4:17 UTC
30 points
12 comments3 min readLW link

[Question] In­stru­men­tal Oc­cam?

abramdemski31 Jan 2020 19:27 UTC
30 points
15 comments1 min readLW link

Com­pact vs. Wide Models

Vaniver16 Jul 2018 4:09 UTC
30 points
5 comments3 min readLW link

Alex Ir­pan: “My AI Timelines Have Sped Up”

Vaniver19 Aug 2020 16:23 UTC
43 points
20 comments1 min readLW link
(www.alexirpan.com)

[AN #113]: Check­ing the eth­i­cal in­tu­itions of large lan­guage models

Rohin Shah19 Aug 2020 17:10 UTC
23 points
0 comments9 min readLW link
(mailchi.mp)

AI safety as feather­less bipeds *with broad flat nails*

Stuart_Armstrong19 Aug 2020 10:22 UTC
35 points
1 comment1 min readLW link

Time Magaz­ine has an ar­ti­cle about the Sin­gu­lar­ity...

Raemon11 Feb 2011 2:20 UTC
40 points
13 comments1 min readLW link

How rapidly are GPUs im­prov­ing in price perfor­mance?

gallabytes25 Nov 2018 19:54 UTC
31 points
9 comments1 min readLW link
(mediangroup.org)

Our val­ues are un­der­defined, change­able, and manipulable

Stuart_Armstrong2 Nov 2017 11:09 UTC
25 points
6 comments3 min readLW link

[Question] What fund­ing sources ex­ist for tech­ni­cal AI safety re­search?

johnswentworth1 Oct 2019 15:30 UTC
26 points
5 comments1 min readLW link

Hu­mans can drive cars

Apprentice30 Jan 2014 11:55 UTC
53 points
89 comments2 min readLW link

A Less Wrong sin­gu­lar­ity ar­ti­cle?

Kaj_Sotala17 Nov 2009 14:15 UTC
31 points
215 comments1 min readLW link

The Bayesian Tyrant

abramdemski20 Aug 2020 0:08 UTC
130 points
17 comments6 min readLW link1 review

Con­cept Safety: Pro­duc­ing similar AI-hu­man con­cept spaces

Kaj_Sotala14 Apr 2015 20:39 UTC
50 points
45 comments8 min readLW link

[LINK] What should a rea­son­able per­son be­lieve about the Sin­gu­lar­ity?

Kaj_Sotala13 Jan 2011 9:32 UTC
38 points
14 comments2 min readLW link

The many ways AIs be­have badly

Stuart_Armstrong24 Apr 2018 11:40 UTC
10 points
3 comments2 min readLW link

July 2020 gw­ern.net newsletter

gwern20 Aug 2020 16:39 UTC
29 points
0 comments1 min readLW link
(www.gwern.net)

Do what we mean vs. do what we say

Rohin Shah30 Aug 2018 22:03 UTC
34 points
14 comments1 min readLW link

[Question] What’s a De­com­pos­able Align­ment Topic?

Logan Riggs21 Aug 2020 22:57 UTC
26 points
16 comments1 min readLW link

Tools ver­sus agents

Stuart_Armstrong16 May 2012 13:00 UTC
47 points
39 comments5 min readLW link

An un­al­igned benchmark

paulfchristiano17 Nov 2018 15:51 UTC
31 points
0 comments9 min readLW link

Fol­low­ing hu­man norms

Rohin Shah20 Jan 2019 23:59 UTC
28 points
10 comments5 min readLW link

nos­talge­braist: Re­cur­sive Good­hart’s Law

Kaj_Sotala26 Aug 2020 11:07 UTC
53 points
27 comments1 min readLW link
(nostalgebraist.tumblr.com)

[AN #114]: The­ory-in­spired safety solu­tions for pow­er­ful Bayesian RL agents

Rohin Shah26 Aug 2020 17:20 UTC
21 points
3 comments8 min readLW link
(mailchi.mp)

[Question] How hard would it be to change GPT-3 in a way that al­lows au­dio?

ChristianKl28 Aug 2020 14:42 UTC
8 points
5 comments1 min readLW link

Safe Scram­bling?

Hoagy29 Aug 2020 14:31 UTC
3 points
1 comment2 min readLW link

(Hu­mor) AI Align­ment Crit­i­cal Failure Table

Kaj_Sotala31 Aug 2020 19:51 UTC
24 points
2 comments1 min readLW link
(sl4.org)

What is am­bi­tious value learn­ing?

Rohin Shah1 Nov 2018 16:20 UTC
46 points
28 comments2 min readLW link

The easy goal in­fer­ence prob­lem is still hard

paulfchristiano3 Nov 2018 14:41 UTC
45 points
17 comments4 min readLW link

[AN #115]: AI safety re­search prob­lems in the AI-GA framework

Rohin Shah2 Sep 2020 17:10 UTC
19 points
16 comments6 min readLW link
(mailchi.mp)

Emo­tional valence vs RL re­ward: a video game analogy

Steven Byrnes3 Sep 2020 15:28 UTC
12 points
6 comments4 min readLW link

Us­ing GPT-N to Solve In­ter­pretabil­ity of Neu­ral Net­works: A Re­search Agenda

3 Sep 2020 18:27 UTC
64 points
12 comments2 min readLW link

“Learn­ing to Sum­ma­rize with Hu­man Feed­back”—OpenAI

Rekrul7 Sep 2020 17:59 UTC
57 points
3 comments1 min readLW link

[AN #116]: How to make ex­pla­na­tions of neu­rons compositional

Rohin Shah9 Sep 2020 17:20 UTC
21 points
2 comments9 min readLW link
(mailchi.mp)

Safer sand­box­ing via col­lec­tive separation

Richard_Ngo9 Sep 2020 19:49 UTC
23 points
6 comments4 min readLW link

[Question] Do mesa-op­ti­mizer risk ar­gu­ments rely on the train-test paradigm?

Ben Cottier10 Sep 2020 15:36 UTC
12 points
7 comments1 min readLW link

Safety via se­lec­tion for obedience

Richard_Ngo10 Sep 2020 10:04 UTC
31 points
1 comment5 min readLW link

How Much Com­pu­ta­tional Power Does It Take to Match the Hu­man Brain?

habryka12 Sep 2020 6:38 UTC
44 points
1 comment1 min readLW link
(www.openphilanthropy.org)

De­ci­sion The­ory is multifaceted

Michele Campolo13 Sep 2020 22:30 UTC
6 points
12 comments8 min readLW link

AI Safety Dis­cus­sion Day

Linda Linsefors15 Sep 2020 14:40 UTC
20 points
0 comments1 min readLW link

[AN #117]: How neu­ral nets would fare un­der the TEVV framework

Rohin Shah16 Sep 2020 17:20 UTC
27 points
0 comments7 min readLW link
(mailchi.mp)

Ap­ply­ing the Coun­ter­fac­tual Pri­soner’s Dilemma to Log­i­cal Uncertainty

Chris_Leong16 Sep 2020 10:34 UTC
9 points
5 comments2 min readLW link

Ar­tifi­cial In­tel­li­gence: A Modern Ap­proach (4th edi­tion) on the Align­ment Problem

Zack_M_Davis17 Sep 2020 2:23 UTC
72 points
12 comments5 min readLW link
(aima.cs.berkeley.edu)

The “Backchain­ing to Lo­cal Search” Tech­nique in AI Alignment

adamShimi18 Sep 2020 15:05 UTC
25 points
1 comment2 min readLW link

Draft re­port on AI timelines

Ajeya Cotra18 Sep 2020 23:47 UTC
179 points
55 comments1 min readLW link1 review

Why GPT wants to mesa-op­ti­mize & how we might change this

John_Maxwell19 Sep 2020 13:48 UTC
54 points
32 comments9 min readLW link

My (Mis)Ad­ven­tures With Al­gorith­mic Ma­chine Learning

AHartNtkn20 Sep 2020 5:31 UTC
16 points
4 comments41 min readLW link

[Question] What AI com­pa­nies would be most likely to have a pos­i­tive long-term im­pact on the world as a re­sult of in­vest­ing in them?

MikkW21 Sep 2020 23:41 UTC
8 points
2 comments2 min readLW link

An­thro­po­mor­phi­sa­tion vs value learn­ing: type 1 vs type 2 errors

Stuart_Armstrong22 Sep 2020 10:46 UTC
16 points
10 comments1 min readLW link

AI Ad­van­tages [Gems from the Wiki]

22 Sep 2020 22:44 UTC
22 points
7 comments2 min readLW link
(www.lesswrong.com)

A long re­ply to Ben Garfinkel on Scru­ti­niz­ing Clas­sic AI Risk Arguments

Søren Elverlin27 Sep 2020 17:51 UTC
17 points
6 comments1 min readLW link

De­hu­man­i­sa­tion *er­rors*

Stuart_Armstrong23 Sep 2020 9:51 UTC
13 points
0 comments1 min readLW link

[AN #118]: Risks, solu­tions, and pri­ori­ti­za­tion in a world with many AI systems

Rohin Shah23 Sep 2020 18:20 UTC
15 points
6 comments10 min readLW link
(mailchi.mp)

[Question] David Deutsch on Univer­sal Ex­plain­ers and AI

alanf24 Sep 2020 7:50 UTC
2 points
8 comments2 min readLW link

KL Diver­gence as Code Patch­ing Efficiency

Zachary Robertson27 Sep 2020 16:06 UTC
17 points
0 comments8 min readLW link

[Question] What to do with imi­ta­tion hu­mans, other than ask­ing them what the right thing to do is?

Charlie Steiner27 Sep 2020 21:51 UTC
10 points
6 comments1 min readLW link

[Question] What De­ci­sion The­ory is Im­plied By Pre­dic­tive Pro­cess­ing?

johnswentworth28 Sep 2020 17:20 UTC
55 points
17 comments1 min readLW link

AGI safety from first prin­ci­ples: Superintelligence

Richard_Ngo28 Sep 2020 19:53 UTC
72 points
5 comments9 min readLW link

AGI safety from first prin­ci­ples: Introduction

Richard_Ngo28 Sep 2020 19:53 UTC
104 points
17 comments2 min readLW link1 review

[Question] Ex­am­ples of self-gov­er­nance to re­duce tech­nol­ogy risk?

Jia29 Sep 2020 19:31 UTC
10 points
4 comments1 min readLW link

AGI safety from first prin­ci­ples: Goals and Agency

Richard_Ngo29 Sep 2020 19:06 UTC
57 points
15 comments15 min readLW link

“Un­su­per­vised” trans­la­tion as an (in­tent) al­ign­ment problem

paulfchristiano30 Sep 2020 0:50 UTC
61 points
15 comments4 min readLW link
(ai-alignment.com)

[AN #119]: AI safety when agents are shaped by en­vi­ron­ments, not rewards

Rohin Shah30 Sep 2020 17:10 UTC
11 points
0 comments11 min readLW link
(mailchi.mp)

AGI safety from first prin­ci­ples: Alignment

Richard_Ngo1 Oct 2020 3:13 UTC
50 points
2 comments13 min readLW link

AGI safety from first prin­ci­ples: Control

Richard_Ngo2 Oct 2020 21:51 UTC
55 points
4 comments9 min readLW link

AI race con­sid­er­a­tions in a re­port by the U.S. House Com­mit­tee on Armed Services

NunoSempere4 Oct 2020 12:11 UTC
42 points
4 comments13 min readLW link

[Question] Is there any work on in­cor­po­rat­ing aleatoric un­cer­tainty and/​or in­her­ent ran­dom­ness into AIXI?

capybaralet4 Oct 2020 8:10 UTC
9 points
7 comments1 min readLW link

AGI safety from first prin­ci­ples: Conclusion

Richard_Ngo4 Oct 2020 23:06 UTC
56 points
4 comments3 min readLW link

Univer­sal Eudaimonia

hg005 Oct 2020 13:45 UTC
19 points
6 comments2 min readLW link

The Align­ment Prob­lem: Ma­chine Learn­ing and Hu­man Values

Rohin Shah6 Oct 2020 17:41 UTC
119 points
6 comments6 min readLW link1 review
(www.amazon.com)

[AN #120]: Trac­ing the in­tel­lec­tual roots of AI and AI alignment

Rohin Shah7 Oct 2020 17:10 UTC
13 points
4 comments10 min readLW link
(mailchi.mp)

[Question] Brain­storm­ing pos­i­tive vi­sions of AI

jungofthewon7 Oct 2020 16:09 UTC
52 points
25 comments1 min readLW link

[Question] How can an AI demon­strate purely through chat that it is an AI, and not a hu­man?

hugh.mann7 Oct 2020 17:53 UTC
3 points
4 comments1 min readLW link

[Question] Why isn’t JS a pop­u­lar lan­guage for deep learn­ing?

Will Clark8 Oct 2020 14:36 UTC
12 points
21 comments1 min readLW link

[Question] If GPT-6 is hu­man-level AGI but costs $200 per page of out­put, what would hap­pen?

Daniel Kokotajlo9 Oct 2020 12:00 UTC
28 points
30 comments1 min readLW link

[Question] Shouldn’t there be a Chi­nese trans­la­tion of Hu­man Com­pat­i­ble?

MakoYass9 Oct 2020 8:47 UTC
18 points
13 comments1 min readLW link

Ideal­ized Fac­tored Cognition

Rafael Harth30 Nov 2020 18:49 UTC
34 points
6 comments11 min readLW link

[Question] Re­views of the book ‘The Align­ment Prob­lem’

Mati_Roy11 Oct 2020 7:41 UTC
8 points
3 comments1 min readLW link

[Question] Re­views of TV show NeXt (about AI safety)

Mati_Roy11 Oct 2020 4:31 UTC
25 points
4 comments1 min readLW link

The Achilles Heel Hy­poth­e­sis for AI

scasper