RSS

AI

Core TagLast edit: 25 Dec 2022 21:11 UTC by plex

Artificial Intelligence is the study of creating intelligence in algorithms. AI Alignment is the task of ensuring [powerful] AI system are aligned with human values and interests. The central concern is that a powerful enough AI, if not designed and implemented with sufficient understanding, would optimize something unintended by its creators and pose an existential threat to the future of humanity. This is known as the AI alignment problem.

Common terms in this space are superintelligence, AI Alignment, AI Safety, Friendly AI, Transformative AI, human-level-intelligence, AI Governance, and Beneficial AI. This entry and the associated tag roughly encompass all of these topics: anything part of the broad cluster of understanding AI and its future impacts on our civilization deserves this tag.

AI Alignment

There are narrow conceptions of alignment, where you’re trying to get it to do something like cure Alzheimer’s disease without destroying the rest of the world. And there’s much more ambitious notions of alignment, where you’re trying to get it to do the right thing and achieve a happy intergalactic civilization.

But both the narrow and the ambitious alignment have in common that you’re trying to have the AI do that thing rather than making a lot of paperclips.

See also General Intelligence.

Basic Alignment Theory

AIXI
Coherent Extrapolated Volition
Complexity of Value
Corrigibility
Deceptive Alignment
Decision Theory
Embedded Agency
Fixed Point Theorems
Goodhart’s Law
Goal-Directedness
Gradient Hacking
Infra-Bayesianism
Inner Alignment
Instrumental Convergence
Intelligence Explosion
Logical Induction
Logical Uncertainty
Mesa-Optimization
Multipolar Scenarios
Myopia
Newcomb’s Problem
Optimization
Orthogonality Thesis
Outer Alignment
Paperclip Maximizer
Power Seeking (AI)
Recursive Self-Improvement
Simulator Theory
Sharp Left Turn
Solomonoff Induction
Superintelligence
Symbol Grounding
Transformative AI
Treacherous Turn
Utility Functions
Whole Brain Emulation

Engineering Alignment

Agent Foundations
AI-assisted Alignment
AI Boxing (Containment)
Conservatism (AI)
Debate (AI safety technique)
Eliciting Latent Knowledge (ELK)
Factored Cognition
Humans Consulting HCH
Impact Measures
Inverse Reinforcement Learning
Iterated Amplification
Mild Optimization
Oracle AI
Reward Functions
RLHF
Shard Theory
Tool AI
Transparency /​ Interpretability
Tripwire
Value Learning

Organizations

AI Safety Camp
Alignment Research Center
Anthropic
Apart Research
AXRP
CHAI (UC Berkeley)
Conjecture (org)
DeepMind
FHI (Oxford)
Future of Life Institute
MIRI
OpenAI
Ought
SERI MATS

Strategy

AI Alignment Fieldbuilding
AI Governance
AI Persuasion
AI Risk
AI Risk Concrete Stories
AI Safety Public Materials
AI Services (CAIS)
AI Success Models
AI Takeoff
AI Timelines
Computing Overhang
Regulation and AI Risk
Restrain AI Development

Other

AI Alignment Intro Materials
AI Capabilities
AI Questions Open Thread
Compute
DALL-E
GPT
Language Models
Machine Learning
Narrow AI
Neuromorphic AI
Prompt Engineering
Reinforcement Learning
Research Agendas

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhub29 May 2020 20:38 UTC
194 points
36 comments38 min readLW link2 reviews

There’s No Fire Alarm for Ar­tifi­cial Gen­eral Intelligence

Eliezer Yudkowsky13 Oct 2017 21:38 UTC
124 points
71 comments25 min readLW link

Su­per­in­tel­li­gence FAQ

Scott Alexander20 Sep 2016 19:00 UTC
92 points
16 comments27 min readLW link

Risks from Learned Op­ti­miza­tion: Introduction

31 May 2019 23:44 UTC
166 points
42 comments12 min readLW link3 reviews

Embed­ded Agents

29 Oct 2018 19:53 UTC
198 points
41 comments1 min readLW link2 reviews

What failure looks like

paulfchristiano17 Mar 2019 20:18 UTC
319 points
49 comments8 min readLW link2 reviews

The Rocket Align­ment Problem

Eliezer Yudkowsky4 Oct 2018 0:38 UTC
198 points
42 comments15 min readLW link2 reviews

Challenges to Chris­ti­ano’s ca­pa­bil­ity am­plifi­ca­tion proposal

Eliezer Yudkowsky19 May 2018 18:18 UTC
115 points
54 comments23 min readLW link1 review

Embed­ded Agency (full-text ver­sion)

15 Nov 2018 19:49 UTC
143 points
15 comments54 min readLW link

A space of pro­pos­als for build­ing safe ad­vanced AI

Richard_Ngo10 Jul 2020 16:58 UTC
55 points
4 comments4 min readLW link

Biol­ogy-In­spired AGI Timelines: The Trick That Never Works

Eliezer Yudkowsky1 Dec 2021 22:35 UTC
181 points
143 comments65 min readLW link

PreDCA: vanessa kosoy’s al­ign­ment protocol

carado20 Aug 2022 10:03 UTC
46 points
8 comments7 min readLW link
(carado.moe)

larger lan­guage mod­els may dis­ap­point you [or, an eter­nally un­finished draft]

nostalgebraist26 Nov 2021 23:08 UTC
237 points
29 comments31 min readLW link1 review

Deep­mind’s Go­pher—more pow­er­ful than GPT-3

hath8 Dec 2021 17:06 UTC
86 points
27 comments1 min readLW link
(deepmind.com)

Pro­ject pro­posal: Test­ing the IBP defi­ni­tion of agent

9 Aug 2022 1:09 UTC
21 points
4 comments2 min readLW link

Good­hart Taxonomy

Scott Garrabrant30 Dec 2017 16:38 UTC
180 points
33 comments10 min readLW link

AI Align­ment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC
125 points
6 comments35 min readLW link

Some AI re­search ar­eas and their rele­vance to ex­is­ten­tial safety

Andrew_Critch19 Nov 2020 3:18 UTC
199 points
40 comments50 min readLW link2 reviews

Mo­ravec’s Para­dox Comes From The Availa­bil­ity Heuristic

james.lucassen20 Oct 2021 6:23 UTC
32 points
2 comments2 min readLW link
(jlucassen.com)

In­fer­ence cost limits the im­pact of ever larger models

SoerenMind23 Oct 2021 10:51 UTC
36 points
28 comments2 min readLW link

[Linkpost] Chi­nese gov­ern­ment’s guidelines on AI

RomanS10 Dec 2021 21:10 UTC
61 points
14 comments1 min readLW link

That Alien Message

Eliezer Yudkowsky22 May 2008 5:55 UTC
304 points
173 comments10 min readLW link

Episte­molog­i­cal Fram­ing for AI Align­ment Research

adamShimi8 Mar 2021 22:05 UTC
53 points
7 comments9 min readLW link

Effi­cien­tZero: hu­man ALE sam­ple-effi­ciency w/​MuZero+self-supervised

gwern2 Nov 2021 2:32 UTC
134 points
52 comments1 min readLW link
(arxiv.org)

Dis­cus­sion with Eliezer Yud­kowsky on AGI interventions

11 Nov 2021 3:01 UTC
325 points
257 comments34 min readLW link

Shul­man and Yud­kowsky on AI progress

3 Dec 2021 20:05 UTC
90 points
16 comments20 min readLW link

Fu­ture ML Sys­tems Will Be Qual­i­ta­tively Different

jsteinhardt11 Jan 2022 19:50 UTC
113 points
10 comments5 min readLW link
(bounded-regret.ghost.io)

[Linkpost] Tro­janNet: Embed­ding Hid­den Tro­jan Horse Models in Neu­ral Networks

Gunnar_Zarncke11 Feb 2022 1:17 UTC
13 points
1 comment1 min readLW link

Briefly think­ing through some analogs of debate

Eli Tyre11 Sep 2022 12:02 UTC
20 points
3 comments4 min readLW link

Ro­bust­ness to Scale

Scott Garrabrant21 Feb 2018 22:55 UTC
109 points
22 comments2 min readLW link1 review

Chris Olah’s views on AGI safety

evhub1 Nov 2019 20:13 UTC
197 points
38 comments12 min readLW link2 reviews

[AN #96]: Buck and I dis­cuss/​ar­gue about AI Alignment

Rohin Shah22 Apr 2020 17:20 UTC
17 points
4 comments10 min readLW link
(mailchi.mp)

Matt Botv­inick on the spon­ta­neous emer­gence of learn­ing algorithms

Adam Scholl12 Aug 2020 7:47 UTC
147 points
87 comments5 min readLW link

A de­scrip­tive, not pre­scrip­tive, overview of cur­rent AI Align­ment Research

6 Jun 2022 21:59 UTC
126 points
21 comments7 min readLW link

Co­her­ence ar­gu­ments do not en­tail goal-di­rected behavior

Rohin Shah3 Dec 2018 3:26 UTC
101 points
69 comments7 min readLW link3 reviews

Align­ment By Default

johnswentworth12 Aug 2020 18:54 UTC
153 points
92 comments11 min readLW link2 reviews

Book re­view: “A Thou­sand Brains” by Jeff Hawkins

Steven Byrnes4 Mar 2021 5:10 UTC
110 points
18 comments19 min readLW link

Model­ling Trans­for­ma­tive AI Risks (MTAIR) Pro­ject: Introduction

16 Aug 2021 7:12 UTC
89 points
0 comments9 min readLW link

In­fra-Bayesian phys­i­cal­ism: a for­mal the­ory of nat­u­ral­ized induction

Vanessa Kosoy30 Nov 2021 22:25 UTC
98 points
20 comments42 min readLW link1 review

What an ac­tu­ally pes­simistic con­tain­ment strat­egy looks like

lc5 Apr 2022 0:19 UTC
554 points
136 comments6 min readLW link

Why I think strong gen­eral AI is com­ing soon

porby28 Sep 2022 5:40 UTC
269 points
126 comments34 min readLW link

AlphaGo Zero and the Foom Debate

Eliezer Yudkowsky21 Oct 2017 2:18 UTC
89 points
17 comments3 min readLW link

Trade­off be­tween de­sir­able prop­er­ties for baseline choices in im­pact measures

Vika4 Jul 2020 11:56 UTC
37 points
24 comments5 min readLW link

Com­pe­ti­tion: Am­plify Ro­hin’s Pre­dic­tion on AGI re­searchers & Safety Concerns

stuhlmueller21 Jul 2020 20:06 UTC
80 points
40 comments3 min readLW link

the scal­ing “in­con­sis­tency”: openAI’s new insight

nostalgebraist7 Nov 2020 7:40 UTC
146 points
14 comments9 min readLW link
(nostalgebraist.tumblr.com)

2019 Re­view Rewrite: Seek­ing Power is Often Ro­bustly In­stru­men­tal in MDPs

TurnTrout23 Dec 2020 17:16 UTC
35 points
0 comments4 min readLW link
(www.lesswrong.com)

Boot­strapped Alignment

Gordon Seidoh Worley27 Feb 2021 15:46 UTC
19 points
12 comments2 min readLW link

Mul­ti­modal Neu­rons in Ar­tifi­cial Neu­ral Networks

Kaj_Sotala5 Mar 2021 9:01 UTC
57 points
2 comments2 min readLW link
(distill.pub)

Re­view of “Fun with +12 OOMs of Com­pute”

28 Mar 2021 14:55 UTC
60 points
20 comments8 min readLW link

Draft re­port on ex­is­ten­tial risk from power-seek­ing AI

Joe Carlsmith28 Apr 2021 21:41 UTC
80 points
23 comments1 min readLW link

Rogue AGI Em­bod­ies Valuable In­tel­lec­tual Property

3 Jun 2021 20:37 UTC
70 points
9 comments3 min readLW link

Deep­Mind: Gen­er­ally ca­pa­ble agents emerge from open-ended play

Daniel Kokotajlo27 Jul 2021 14:19 UTC
247 points
53 comments2 min readLW link
(deepmind.com)

Analo­gies and Gen­eral Pri­ors on Intelligence

20 Aug 2021 21:03 UTC
57 points
12 comments14 min readLW link

We’re already in AI takeoff

Valentine8 Mar 2022 23:09 UTC
120 points
115 comments7 min readLW link

It Looks Like You’re Try­ing To Take Over The World

gwern9 Mar 2022 16:35 UTC
386 points
125 comments1 min readLW link
(www.gwern.net)

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC
45 points
0 comments59 min readLW link

Why all the fuss about re­cur­sive self-im­prove­ment?

So8res12 Jun 2022 20:53 UTC
150 points
62 comments7 min readLW link

AI Safety bounty for prac­ti­cal ho­mo­mor­phic encryption

acylhalide19 Aug 2022 12:27 UTC
29 points
9 comments4 min readLW link

Paper: Dis­cov­er­ing novel al­gorithms with AlphaTen­sor [Deep­mind]

LawrenceC5 Oct 2022 16:20 UTC
80 points
18 comments1 min readLW link
(www.deepmind.com)

The Teacup Test

lsusr8 Oct 2022 4:25 UTC
71 points
28 comments2 min readLW link

Dis­con­tin­u­ous progress in his­tory: an update

KatjaGrace14 Apr 2020 0:00 UTC
179 points
25 comments31 min readLW link1 review
(aiimpacts.org)

Repli­ca­tion Dy­nam­ics Bridge to RL in Ther­mo­dy­namic Limit

Zachary Robertson18 May 2020 1:02 UTC
6 points
1 comment2 min readLW link

The ground of optimization

Alex Flint20 Jun 2020 0:38 UTC
218 points
74 comments27 min readLW link1 review

Model­ling Con­tin­u­ous Progress

Sammy Martin23 Jun 2020 18:06 UTC
29 points
3 comments7 min readLW link

Refram­ing Su­per­in­tel­li­gence: Com­pre­hen­sive AI Ser­vices as Gen­eral Intelligence

Rohin Shah8 Jan 2019 7:12 UTC
118 points
75 comments5 min readLW link2 reviews
(www.fhi.ox.ac.uk)

Clas­sifi­ca­tion of AI al­ign­ment re­search: de­con­fu­sion, “good enough” non-su­per­in­tel­li­gent AI al­ign­ment, su­per­in­tel­li­gent AI alignment

philip_b14 Jul 2020 22:48 UTC
35 points
25 comments3 min readLW link

Col­lec­tion of GPT-3 results

Kaj_Sotala18 Jul 2020 20:04 UTC
89 points
24 comments1 min readLW link
(twitter.com)

Hiring en­g­ineers and re­searchers to help al­ign GPT-3

paulfchristiano1 Oct 2020 18:54 UTC
206 points
14 comments3 min readLW link

The date of AI Takeover is not the day the AI takes over

Daniel Kokotajlo22 Oct 2020 10:41 UTC
116 points
32 comments2 min readLW link1 review

[Question] What could one do with truly un­limited com­pu­ta­tional power?

Yitz11 Nov 2020 10:03 UTC
30 points
22 comments2 min readLW link

AGI Predictions

21 Nov 2020 3:46 UTC
110 points
36 comments4 min readLW link

[Question] What are the best prece­dents for in­dus­tries failing to in­vest in valuable AI re­search?

Daniel Kokotajlo14 Dec 2020 23:57 UTC
18 points
17 comments1 min readLW link

Ex­trap­o­lat­ing GPT-N performance

Lanrian18 Dec 2020 21:41 UTC
103 points
31 comments25 min readLW link1 review

De­bate up­date: Obfus­cated ar­gu­ments problem

Beth Barnes23 Dec 2020 3:24 UTC
125 points
21 comments16 min readLW link

Liter­a­ture Re­view on Goal-Directedness

18 Jan 2021 11:15 UTC
69 points
21 comments31 min readLW link

[Question] How will OpenAI + GitHub’s Copi­lot af­fect pro­gram­ming?

smountjoy29 Jun 2021 16:42 UTC
55 points
23 comments1 min readLW link

Model­ing Risks From Learned Optimization

Ben Cottier12 Oct 2021 20:54 UTC
44 points
0 comments12 min readLW link

Truth­ful AI: Devel­op­ing and gov­ern­ing AI that does not lie

18 Oct 2021 18:37 UTC
81 points
9 comments10 min readLW link

Effi­cien­tZero: How It Works

1a3orn26 Nov 2021 15:17 UTC
273 points
42 comments29 min readLW link

The­o­ret­i­cal Neu­ro­science For Align­ment Theory

Cameron Berg7 Dec 2021 21:50 UTC
62 points
19 comments23 min readLW link

Magna Alta Doctrina

jacob_cannell11 Dec 2021 21:54 UTC
37 points
7 comments28 min readLW link

DL to­wards the un­al­igned Re­cur­sive Self-Op­ti­miza­tion attractor

jacob_cannell18 Dec 2021 2:15 UTC
32 points
22 comments4 min readLW link

Reg­u­lariza­tion Causes Mo­du­lar­ity Causes Generalization

dkirmani1 Jan 2022 23:34 UTC
49 points
7 comments3 min readLW link

Is Gen­eral In­tel­li­gence “Com­pact”?

DragonGod4 Jul 2022 13:27 UTC
21 points
6 comments22 min readLW link

The Tree of Life: Stan­ford AI Align­ment The­ory of Change

Gabriel Mukobi2 Jul 2022 18:36 UTC
22 points
0 comments14 min readLW link

Shard The­ory: An Overview

David Udell11 Aug 2022 5:44 UTC
135 points
34 comments10 min readLW link

How evolu­tion suc­ceeds and fails at value alignment

Ocracoke21 Aug 2022 7:14 UTC
21 points
2 comments4 min readLW link

An Un­trol­lable Math­e­mat­i­cian Illustrated

abramdemski20 Mar 2018 0:00 UTC
155 points
38 comments1 min readLW link1 review

Con­di­tions for Mesa-Optimization

1 Jun 2019 20:52 UTC
75 points
48 comments12 min readLW link

Thoughts on Hu­man Models

21 Feb 2019 9:10 UTC
124 points
32 comments10 min readLW link1 review

In­ner al­ign­ment in the brain

Steven Byrnes22 Apr 2020 13:14 UTC
76 points
16 comments16 min readLW link

Prob­lem re­lax­ation as a tactic

TurnTrout22 Apr 2020 23:44 UTC
113 points
8 comments7 min readLW link

[Question] How should po­ten­tial AI al­ign­ment re­searchers gauge whether the field is right for them?

TurnTrout6 May 2020 12:24 UTC
20 points
5 comments1 min readLW link

Speci­fi­ca­tion gam­ing: the flip side of AI ingenuity

6 May 2020 23:51 UTC
46 points
8 comments6 min readLW link

Les­sons from Isaac: Pit­falls of Reason

adamShimi8 May 2020 20:44 UTC
9 points
0 comments8 min readLW link

Cor­rigi­bil­ity as out­side view

TurnTrout8 May 2020 21:56 UTC
36 points
11 comments4 min readLW link

[Question] How to choose a PhD with AI Safety in mind

Ariel Kwiatkowski15 May 2020 22:19 UTC
9 points
1 comment1 min readLW link

Re­ward func­tions and up­dat­ing as­sump­tions can hide a mul­ti­tude of sins

Stuart_Armstrong18 May 2020 15:18 UTC
16 points
2 comments9 min readLW link

Pos­si­ble take­aways from the coro­n­avirus pan­demic for slow AI takeoff

Vika31 May 2020 17:51 UTC
135 points
36 comments3 min readLW link1 review

Fo­cus: you are al­lowed to be bad at ac­com­plish­ing your goals

adamShimi3 Jun 2020 21:04 UTC
19 points
17 comments3 min readLW link

Re­ply to Paul Chris­ti­ano on Inac­cessible Information

Alex Flint5 Jun 2020 9:10 UTC
77 points
15 comments6 min readLW link

Our take on CHAI’s re­search agenda in un­der 1500 words

Alex Flint17 Jun 2020 12:24 UTC
112 points
19 comments5 min readLW link

[Question] Ques­tion on GPT-3 Ex­cel Demo

Zhitao Hou22 Jun 2020 20:31 UTC
0 points
2 comments1 min readLW link

Dy­namic in­con­sis­tency of the in­ac­tion and ini­tial state baseline

Stuart_Armstrong7 Jul 2020 12:02 UTC
30 points
8 comments2 min readLW link

Cortés, Pizarro, and Afonso as Prece­dents for Takeover

Daniel Kokotajlo1 Mar 2020 3:49 UTC
145 points
75 comments11 min readLW link1 review

[Question] What prob­lem would you like to see Re­in­force­ment Learn­ing ap­plied to?

Julian Schrittwieser8 Jul 2020 2:40 UTC
43 points
4 comments1 min readLW link

My cur­rent frame­work for think­ing about AGI timelines

zhukeepa30 Mar 2020 1:23 UTC
107 points
5 comments3 min readLW link

[Question] To what ex­tent is GPT-3 ca­pa­ble of rea­son­ing?

TurnTrout20 Jul 2020 17:10 UTC
70 points
74 comments16 min readLW link

Repli­cat­ing the repli­ca­tion crisis with GPT-3?

skybrian22 Jul 2020 21:20 UTC
29 points
10 comments1 min readLW link

Can you get AGI from a Trans­former?

Steven Byrnes23 Jul 2020 15:27 UTC
114 points
39 comments12 min readLW link

Writ­ing with GPT-3

Jacob Falkovich24 Jul 2020 15:22 UTC
42 points
0 comments4 min readLW link

In­ner Align­ment: Ex­plain like I’m 12 Edition

Rafael Harth1 Aug 2020 15:24 UTC
175 points
46 comments13 min readLW link2 reviews

Devel­op­men­tal Stages of GPTs

orthonormal26 Jul 2020 22:03 UTC
140 points
74 comments7 min readLW link1 review

Gen­er­al­iz­ing the Power-Seek­ing Theorems

TurnTrout27 Jul 2020 0:28 UTC
40 points
6 comments4 min readLW link

Are we in an AI over­hang?

Andy Jones27 Jul 2020 12:48 UTC
255 points
109 comments4 min readLW link

[Question] What spe­cific dan­gers arise when ask­ing GPT-N to write an Align­ment Fo­rum post?

Matthew Barnett28 Jul 2020 2:56 UTC
44 points
14 comments1 min readLW link

[Question] Prob­a­bil­ity that other ar­chi­tec­tures will scale as well as Trans­form­ers?

Daniel Kokotajlo28 Jul 2020 19:36 UTC
22 points
4 comments1 min readLW link

What a 20-year-lead in mil­i­tary tech might look like

Daniel Kokotajlo29 Jul 2020 20:10 UTC
68 points
44 comments16 min readLW link

[Question] What if memes are com­mon in highly ca­pa­ble minds?

Daniel Kokotajlo30 Jul 2020 20:45 UTC
36 points
15 comments2 min readLW link

Three men­tal images from think­ing about AGI de­bate & corrigibility

Steven Byrnes3 Aug 2020 14:29 UTC
55 points
35 comments4 min readLW link

Solv­ing Key Align­ment Prob­lems Group

Logan Riggs3 Aug 2020 19:30 UTC
19 points
7 comments2 min readLW link

How eas­ily can we sep­a­rate a friendly AI in de­sign space from one which would bring about a hy­per­ex­is­ten­tial catas­tro­phe?

Anirandis10 Sep 2020 0:40 UTC
19 points
20 comments2 min readLW link

My com­pu­ta­tional frame­work for the brain

Steven Byrnes14 Sep 2020 14:19 UTC
144 points
26 comments13 min readLW link1 review

[Question] Where is hu­man level on text pre­dic­tion? (GPTs task)

Daniel Kokotajlo20 Sep 2020 9:00 UTC
27 points
19 comments1 min readLW link

Needed: AI in­fo­haz­ard policy

Vanessa Kosoy21 Sep 2020 15:26 UTC
61 points
17 comments2 min readLW link

The Col­lid­ing Ex­po­nen­tials of AI

Vermillion14 Oct 2020 23:31 UTC
27 points
16 comments5 min readLW link

“Lit­tle glimpses of em­pa­thy” as the foun­da­tion for so­cial emotions

Steven Byrnes22 Oct 2020 11:02 UTC
31 points
1 comment5 min readLW link

In­tro­duc­tion to Carte­sian Frames

Scott Garrabrant22 Oct 2020 13:00 UTC
145 points
29 comments22 min readLW link1 review

“Carte­sian Frames” Talk #2 this Sun­day at 2pm (PT)

Rob Bensinger28 Oct 2020 13:59 UTC
30 points
0 comments1 min readLW link

Does SGD Pro­duce De­cep­tive Align­ment?

Mark Xu6 Nov 2020 23:48 UTC
85 points
6 comments16 min readLW link

[Question] How can I bet on short timelines?

Daniel Kokotajlo7 Nov 2020 12:44 UTC
43 points
16 comments2 min readLW link

Non-Ob­struc­tion: A Sim­ple Con­cept Mo­ti­vat­ing Corrigibility

TurnTrout21 Nov 2020 19:35 UTC
67 points
19 comments19 min readLW link

Carte­sian Frames Definitions

Rob Bensinger8 Nov 2020 12:44 UTC
25 points
0 comments4 min readLW link

Com­mu­ni­ca­tion Prior as Align­ment Strategy

johnswentworth12 Nov 2020 22:06 UTC
40 points
8 comments6 min readLW link

How Rood­man’s GWP model trans­lates to TAI timelines

Daniel Kokotajlo16 Nov 2020 14:05 UTC
22 points
5 comments3 min readLW link

Normativity

abramdemski18 Nov 2020 16:52 UTC
46 points
11 comments9 min readLW link

In­ner Align­ment in Salt-Starved Rats

Steven Byrnes19 Nov 2020 2:40 UTC
136 points
39 comments11 min readLW link2 reviews

Con­tin­u­ing the take­offs debate

Richard_Ngo23 Nov 2020 15:58 UTC
67 points
13 comments9 min readLW link

The next AI win­ter will be due to en­ergy costs

hippke24 Nov 2020 16:53 UTC
57 points
7 comments2 min readLW link

Re­cur­sive Quan­tiliz­ers II

abramdemski2 Dec 2020 15:26 UTC
30 points
15 comments13 min readLW link

Su­per­vised learn­ing in the brain, part 4: com­pres­sion /​ filtering

Steven Byrnes5 Dec 2020 17:06 UTC
12 points
0 comments5 min readLW link

Con­ser­vatism in neo­cor­tex-like AGIs

Steven Byrnes8 Dec 2020 16:37 UTC
22 points
5 comments8 min readLW link

Avoid­ing Side Effects in Com­plex Environments

12 Dec 2020 0:34 UTC
62 points
9 comments2 min readLW link
(avoiding-side-effects.github.io)

The Power of Annealing

meanderingmoose14 Dec 2020 11:02 UTC
25 points
6 comments5 min readLW link

[link] The AI Gir­lfriend Se­duc­ing China’s Lonely Men

Kaj_Sotala14 Dec 2020 20:18 UTC
34 points
11 comments1 min readLW link
(www.sixthtone.com)

Oper­a­tional­iz­ing com­pat­i­bil­ity with strat­egy-stealing

evhub24 Dec 2020 22:36 UTC
41 points
6 comments4 min readLW link

De­fus­ing AGI Danger

Mark Xu24 Dec 2020 22:58 UTC
48 points
9 comments9 min readLW link

Multi-di­men­sional re­wards for AGI in­ter­pretabil­ity and control

Steven Byrnes4 Jan 2021 3:08 UTC
19 points
8 comments10 min readLW link

DALL-E by OpenAI

Daniel Kokotajlo5 Jan 2021 20:05 UTC
97 points
22 comments1 min readLW link

Re­view of ‘But ex­actly how com­plex and frag­ile?’

TurnTrout6 Jan 2021 18:39 UTC
55 points
0 comments8 min readLW link

The Case for a Jour­nal of AI Alignment

adamShimi9 Jan 2021 18:13 UTC
45 points
32 comments4 min readLW link

Trans­parency and AGI safety

jylin0411 Jan 2021 18:51 UTC
52 points
12 comments30 min readLW link

Birds, Brains, Planes, and AI: Against Ap­peals to the Com­plex­ity/​Mys­te­ri­ous­ness/​Effi­ciency of the Brain

Daniel Kokotajlo18 Jan 2021 12:08 UTC
184 points
85 comments14 min readLW link1 review

In­fra-Bayesi­anism Unwrapped

adamShimi20 Jan 2021 13:35 UTC
41 points
0 comments24 min readLW link

Op­ti­mal play in hu­man-judged De­bate usu­ally won’t an­swer your question

Joe_Collman27 Jan 2021 7:34 UTC
33 points
12 comments12 min readLW link

Creat­ing AGI Safety Interlocks

Koen.Holtman5 Feb 2021 12:01 UTC
7 points
4 comments8 min readLW link

Timeline of AI safety

riceissa7 Feb 2021 22:29 UTC
63 points
6 comments2 min readLW link
(timelines.issarice.com)

Tour­ne­sol, YouTube and AI Risk

adamShimi12 Feb 2021 18:56 UTC
36 points
13 comments4 min readLW link

In­ter­net En­cy­clo­pe­dia of Philos­o­phy on Ethics of Ar­tifi­cial Intelligence

Kaj_Sotala20 Feb 2021 13:54 UTC
15 points
1 comment4 min readLW link
(iep.utm.edu)

Be­hav­ioral Suffi­cient Statis­tics for Goal-Directedness

adamShimi11 Mar 2021 15:01 UTC
21 points
12 comments9 min readLW link

A sim­ple way to make GPT-3 fol­low instructions

Quintin Pope8 Mar 2021 2:57 UTC
11 points
5 comments4 min readLW link

Towards a Mechanis­tic Un­der­stand­ing of Goal-Directedness

Mark Xu9 Mar 2021 20:17 UTC
45 points
1 comment5 min readLW link

AXRP Epi­sode 5 - In­fra-Bayesi­anism with Vanessa Kosoy

DanielFilan10 Mar 2021 4:30 UTC
33 points
12 comments35 min readLW link

Com­ments on “The Sin­gu­lar­ity is Nowhere Near”

Steven Byrnes16 Mar 2021 23:59 UTC
50 points
6 comments8 min readLW link

Is RL in­volved in sen­sory pro­cess­ing?

Steven Byrnes18 Mar 2021 13:57 UTC
21 points
21 comments5 min readLW link

Against evolu­tion as an anal­ogy for how hu­mans will cre­ate AGI

Steven Byrnes23 Mar 2021 12:29 UTC
44 points
25 comments25 min readLW link

My AGI Threat Model: Misal­igned Model-Based RL Agent

Steven Byrnes25 Mar 2021 13:45 UTC
66 points
40 comments16 min readLW link

Co­her­ence ar­gu­ments im­ply a force for goal-di­rected behavior

KatjaGrace26 Mar 2021 16:10 UTC
88 points
27 comments14 min readLW link
(aiimpacts.org)

Trans­parency Trichotomy

Mark Xu28 Mar 2021 20:26 UTC
25 points
2 comments7 min readLW link

Hard­ware is already ready for the sin­gu­lar­ity. Al­gorithm knowl­edge is the only bar­rier.

Andrew Vlahos30 Mar 2021 22:48 UTC
16 points
3 comments3 min readLW link

Ben Go­ertzel’s “Kinds of Minds”

JoshuaFox11 Apr 2021 12:41 UTC
12 points
4 comments1 min readLW link

Up­dat­ing the Lot­tery Ticket Hypothesis

johnswentworth18 Apr 2021 21:45 UTC
73 points
41 comments2 min readLW link

Three rea­sons to ex­pect long AI timelines

Matthew Barnett22 Apr 2021 18:44 UTC
68 points
29 comments11 min readLW link
(matthewbarnett.substack.com)

Be­ware over-use of the agent model

Alex Flint25 Apr 2021 22:19 UTC
28 points
10 comments5 min readLW link1 review

Agents Over Carte­sian World Models

27 Apr 2021 2:06 UTC
62 points
3 comments27 min readLW link

Less Real­is­tic Tales of Doom

Mark Xu6 May 2021 23:01 UTC
110 points
13 comments4 min readLW link

Challenge: know ev­ery­thing that the best go bot knows about go

DanielFilan11 May 2021 5:10 UTC
48 points
93 comments2 min readLW link
(danielfilan.com)

For­mal In­ner Align­ment, Prospectus

abramdemski12 May 2021 19:57 UTC
91 points
57 comments16 min readLW link

Agency in Con­way’s Game of Life

Alex Flint13 May 2021 1:07 UTC
97 points
81 comments9 min readLW link1 review

Knowl­edge Neu­rons in Pre­trained Transformers

evhub17 May 2021 22:54 UTC
98 points
7 comments2 min readLW link
(arxiv.org)

De­cou­pling de­liber­a­tion from competition

paulfchristiano25 May 2021 18:50 UTC
72 points
16 comments9 min readLW link
(ai-alignment.com)

Power dy­nam­ics as a blind spot or blurry spot in our col­lec­tive world-mod­el­ing, es­pe­cially around AI

Andrew_Critch1 Jun 2021 18:45 UTC
176 points
26 comments6 min readLW link

Game-the­o­retic Align­ment in terms of At­tain­able Utility

8 Jun 2021 12:36 UTC
20 points
2 comments9 min readLW link

Beijing Academy of Ar­tifi­cial In­tel­li­gence an­nounces 1,75 trillion pa­ram­e­ters model, Wu Dao 2.0

Ozyrus3 Jun 2021 12:07 UTC
23 points
9 comments1 min readLW link
(www.engadget.com)

An In­tu­itive Guide to Garrabrant Induction

Mark Xu3 Jun 2021 22:21 UTC
115 points
18 comments24 min readLW link

Con­ser­va­tive Agency with Mul­ti­ple Stakeholders

TurnTrout8 Jun 2021 0:30 UTC
31 points
0 comments3 min readLW link

Sup­ple­ment to “Big pic­ture of pha­sic dopamine”

Steven Byrnes8 Jun 2021 13:08 UTC
13 points
2 comments9 min readLW link

Look­ing Deeper at Deconfusion

adamShimi13 Jun 2021 21:29 UTC
57 points
13 comments15 min readLW link

[Question] Open prob­lem: how can we quan­tify player al­ign­ment in 2x2 nor­mal-form games?

TurnTrout16 Jun 2021 2:09 UTC
23 points
59 comments1 min readLW link

Re­ward Is Not Enough

Steven Byrnes16 Jun 2021 13:52 UTC
105 points
18 comments10 min readLW link

En­vi­ron­men­tal Struc­ture Can Cause In­stru­men­tal Convergence

TurnTrout22 Jun 2021 22:26 UTC
71 points
44 comments16 min readLW link
(arxiv.org)

AXRP Epi­sode 9 - Finite Fac­tored Sets with Scott Garrabrant

DanielFilan24 Jun 2021 22:10 UTC
56 points
2 comments58 min readLW link

Mus­ings on gen­eral sys­tems alignment

Alex Flint30 Jun 2021 18:16 UTC
31 points
11 comments3 min readLW link

Thoughts on safety in pre­dic­tive learning

Steven Byrnes30 Jun 2021 19:17 UTC
18 points
17 comments19 min readLW link

The More Power At Stake, The Stronger In­stru­men­tal Con­ver­gence Gets For Op­ti­mal Policies

TurnTrout11 Jul 2021 17:36 UTC
45 points
7 comments6 min readLW link

A world in which the al­ign­ment prob­lem seems lower-stakes

TurnTrout8 Jul 2021 2:31 UTC
19 points
17 comments2 min readLW link

Frac­tional progress es­ti­mates for AI timelines and im­plied re­source requirements

15 Jul 2021 18:43 UTC
55 points
6 comments7 min readLW link

Ex­per­i­men­ta­tion with AI-gen­er­ated images (VQGAN+CLIP) | So­larpunk air­ships flee­ing a dragon

Kaj_Sotala15 Jul 2021 11:00 UTC
44 points
4 comments2 min readLW link
(kajsotala.fi)

Seek­ing Power is Con­ver­gently In­stru­men­tal in a Broad Class of Environments

TurnTrout8 Aug 2021 2:02 UTC
41 points
15 comments8 min readLW link

LCDT, A My­opic De­ci­sion Theory

3 Aug 2021 22:41 UTC
50 points
51 comments15 min readLW link

When Most VNM-Co­her­ent Prefer­ence Order­ings Have Con­ver­gent In­stru­men­tal Incentives

TurnTrout9 Aug 2021 17:22 UTC
52 points
4 comments5 min readLW link

Two AI-risk-re­lated game de­sign ideas

Daniel Kokotajlo5 Aug 2021 13:36 UTC
47 points
9 comments5 min readLW link

Re­search agenda update

Steven Byrnes6 Aug 2021 19:24 UTC
54 points
40 comments7 min readLW link

What 2026 looks like

Daniel Kokotajlo6 Aug 2021 16:14 UTC
371 points
109 comments16 min readLW link1 review

Satis­ficers Tend To Seek Power: In­stru­men­tal Con­ver­gence Via Retargetability

TurnTrout18 Nov 2021 1:54 UTC
69 points
8 comments17 min readLW link
(www.overleaf.com)

Dopamine-su­per­vised learn­ing in mam­mals & fruit flies

Steven Byrnes10 Aug 2021 16:13 UTC
16 points
6 comments8 min readLW link

Free course re­view — Reli­able and In­ter­pretable Ar­tifi­cial In­tel­li­gence (ETH Zurich)

Jan Czechowski10 Aug 2021 16:36 UTC
7 points
0 comments3 min readLW link

Tech­ni­cal Pre­dic­tions Re­lated to AI Safety

lsusr13 Aug 2021 0:29 UTC
28 points
12 comments8 min readLW link

Provide feed­back on Open Philan­thropy’s AI al­ign­ment RFP

20 Aug 2021 19:52 UTC
56 points
6 comments1 min readLW link

AI Safety Papers: An App for the TAI Safety Database

ozziegooen21 Aug 2021 2:02 UTC
74 points
13 comments2 min readLW link

Ran­dal Koene on brain un­der­stand­ing be­fore whole brain emulation

Steven Byrnes23 Aug 2021 20:59 UTC
36 points
12 comments3 min readLW link

MIRI/​OP ex­change about de­ci­sion theory

Rob Bensinger25 Aug 2021 22:44 UTC
47 points
7 comments10 min readLW link

Good­hart Ethology

Charlie Steiner17 Sep 2021 17:31 UTC
18 points
4 comments14 min readLW link

[Question] What are good al­ign­ment con­fer­ence pa­pers?

adamShimi28 Aug 2021 13:35 UTC
12 points
2 comments1 min readLW link

Brain-Com­puter In­ter­faces and AI Alignment

niplav28 Aug 2021 19:48 UTC
31 points
6 comments11 min readLW link

Su­per­in­tel­li­gent In­tro­spec­tion: A Counter-ar­gu­ment to the Orthog­o­nal­ity Thesis

DirectedEvolution29 Aug 2021 4:53 UTC
3 points
18 comments4 min readLW link

Align­ment Re­search = Con­cep­tual Align­ment Re­search + Ap­plied Align­ment Research

adamShimi30 Aug 2021 21:13 UTC
37 points
14 comments5 min readLW link

AXRP Epi­sode 11 - At­tain­able Utility and Power with Alex Turner

DanielFilan25 Sep 2021 21:10 UTC
19 points
5 comments52 min readLW link

Is progress in ML-as­sisted the­o­rem-prov­ing benefi­cial?

mako yass28 Sep 2021 1:54 UTC
10 points
3 comments1 min readLW link

Take­off Speeds and Discontinuities

30 Sep 2021 13:50 UTC
62 points
1 comment15 min readLW link

My take on Vanessa Kosoy’s take on AGI safety

Steven Byrnes30 Sep 2021 12:23 UTC
84 points
10 comments31 min readLW link

[Pre­dic­tion] We are in an Al­gorith­mic Overhang

lsusr29 Sep 2021 23:40 UTC
31 points
14 comments1 min readLW link

In­ter­view with Skynet

lsusr30 Sep 2021 2:20 UTC
49 points
1 comment2 min readLW link

AI learns be­trayal and how to avoid it

Stuart_Armstrong30 Sep 2021 9:39 UTC
30 points
4 comments2 min readLW link

The Dark Side of Cog­ni­tion Hypothesis

Cameron Berg3 Oct 2021 20:10 UTC
19 points
1 comment16 min readLW link

[Question] How to think about and deal with OpenAI

Rafael Harth9 Oct 2021 13:10 UTC
107 points
71 comments1 min readLW link

NVIDIA and Microsoft re­leases 530B pa­ram­e­ter trans­former model, Me­ga­tron-Tur­ing NLG

Ozyrus11 Oct 2021 15:28 UTC
51 points
36 comments1 min readLW link
(developer.nvidia.com)

Post­mod­ern Warfare

lsusr25 Oct 2021 9:02 UTC
61 points
25 comments2 min readLW link

A very crude de­cep­tion eval is already passed

Beth Barnes29 Oct 2021 17:57 UTC
105 points
8 comments2 min readLW link

Study Guide

johnswentworth6 Nov 2021 1:23 UTC
220 points
41 comments16 min readLW link

Re: At­tempted Gears Anal­y­sis of AGI In­ter­ven­tion Dis­cus­sion With Eliezer

lsusr15 Nov 2021 10:02 UTC
20 points
8 comments15 min readLW link

Ngo and Yud­kowsky on al­ign­ment difficulty

15 Nov 2021 20:31 UTC
235 points
143 comments99 min readLW link

Cor­rigi­bil­ity Can Be VNM-Incoherent

TurnTrout20 Nov 2021 0:30 UTC
64 points
24 comments7 min readLW link

Visi­ble Thoughts Pro­ject and Bounty Announcement

So8res30 Nov 2021 0:19 UTC
245 points
104 comments13 min readLW link

In­ter­pret­ing Yud­kowsky on Deep vs Shal­low Knowledge

adamShimi5 Dec 2021 17:32 UTC
100 points
32 comments24 min readLW link

Are there al­ter­na­tive to solv­ing value trans­fer and ex­trap­o­la­tion?

Stuart_Armstrong6 Dec 2021 18:53 UTC
19 points
7 comments5 min readLW link

Con­sid­er­a­tions on in­ter­ac­tion be­tween AI and ex­pected value of the fu­ture

Beth Barnes7 Dec 2021 2:46 UTC
64 points
28 comments4 min readLW link

Some thoughts on why ad­ver­sar­ial train­ing might be useful

Beth Barnes8 Dec 2021 1:28 UTC
9 points
5 comments3 min readLW link

The Plan

johnswentworth10 Dec 2021 23:41 UTC
235 points
77 comments14 min readLW link

Moore’s Law, AI, and the pace of progress

Veedrac11 Dec 2021 3:02 UTC
120 points
39 comments24 min readLW link

Sum­mary of the Acausal At­tack Is­sue for AIXI

Diffractor13 Dec 2021 8:16 UTC
14 points
6 comments4 min readLW link

Con­se­quen­tial­ism & corrigibility

Steven Byrnes14 Dec 2021 13:23 UTC
60 points
27 comments7 min readLW link

Should we rely on the speed prior for safety?

Marc-Everin Carauleanu14 Dec 2021 20:45 UTC
14 points
6 comments5 min readLW link

The Case for Rad­i­cal Op­ti­mism about Interpretability

Quintin Pope16 Dec 2021 23:38 UTC
57 points
16 comments8 min readLW link1 review

Re­searcher in­cen­tives cause smoother progress on bench­marks

ryan_greenblatt21 Dec 2021 4:13 UTC
20 points
4 comments1 min readLW link

Self-Or­ganised Neu­ral Net­works: A sim­ple, nat­u­ral and effi­cient way to intelligence

D𝜋1 Jan 2022 23:24 UTC
41 points
51 comments44 min readLW link

Prizes for ELK proposals

paulfchristiano3 Jan 2022 20:23 UTC
141 points
156 comments7 min readLW link

D𝜋′s Spik­ing Network

lsusr4 Jan 2022 4:08 UTC
50 points
37 comments4 min readLW link

More Is Differ­ent for AI

jsteinhardt4 Jan 2022 19:30 UTC
137 points
22 comments3 min readLW link
(bounded-regret.ghost.io)

In­stru­men­tal Con­ver­gence For Real­is­tic Agent Objectives

TurnTrout22 Jan 2022 0:41 UTC
35 points
9 comments9 min readLW link

What’s Up With Con­fus­ingly Per­va­sive Con­se­quen­tial­ism?

Raemon20 Jan 2022 19:22 UTC
169 points
88 comments4 min readLW link

[In­tro to brain-like-AGI safety] 1. What’s the prob­lem & Why work on it now?

Steven Byrnes26 Jan 2022 15:23 UTC
119 points
19 comments23 min readLW link

Ar­gu­ments about Highly Reli­able Agent De­signs as a Use­ful Path to Ar­tifi­cial In­tel­li­gence Safety

27 Jan 2022 13:13 UTC
27 points
0 comments1 min readLW link
(arxiv.org)

Com­pet­i­tive pro­gram­ming with AlphaCode

Algon2 Feb 2022 16:49 UTC
58 points
37 comments15 min readLW link
(deepmind.com)

Thoughts on AGI safety from the top

jylin042 Feb 2022 20:06 UTC
35 points
3 comments32 min readLW link

Paradigm-build­ing from first prin­ci­ples: Effec­tive al­tru­ism, AGI, and alignment

Cameron Berg8 Feb 2022 16:12 UTC
24 points
5 comments14 min readLW link

[In­tro to brain-like-AGI safety] 3. Two sub­sys­tems: Learn­ing & Steering

Steven Byrnes9 Feb 2022 13:09 UTC
59 points
3 comments24 min readLW link

[In­tro to brain-like-AGI safety] 4. The “short-term pre­dic­tor”

Steven Byrnes16 Feb 2022 13:12 UTC
51 points
11 comments13 min readLW link

ELK Pro­posal: Think­ing Via A Hu­man Imitator

TurnTrout22 Feb 2022 1:52 UTC
28 points
6 comments11 min readLW link

Why I’m co-found­ing Aligned AI

Stuart_Armstrong17 Feb 2022 19:55 UTC
93 points
54 comments3 min readLW link

Im­pli­ca­tions of au­to­mated on­tol­ogy identification

18 Feb 2022 3:30 UTC
67 points
29 comments23 min readLW link

Align­ment re­search exercises

Richard_Ngo21 Feb 2022 20:24 UTC
146 points
17 comments8 min readLW link

[In­tro to brain-like-AGI safety] 5. The “long-term pre­dic­tor”, and TD learning

Steven Byrnes23 Feb 2022 14:44 UTC
41 points
25 comments21 min readLW link

How do new mod­els from OpenAI, Deep­Mind and An­thropic perform on Truth­fulQA?

Owain_Evans26 Feb 2022 12:46 UTC
42 points
3 comments11 min readLW link

Es­ti­mat­ing Brain-Equiv­a­lent Com­pute from Image Recog­ni­tion Al­gorithms

Gunnar_Zarncke27 Feb 2022 2:45 UTC
14 points
4 comments2 min readLW link

[Link] Aligned AI AMA

Stuart_Armstrong1 Mar 2022 12:01 UTC
18 points
0 comments1 min readLW link

[In­tro to brain-like-AGI safety] 6. Big pic­ture of mo­ti­va­tion, de­ci­sion-mak­ing, and RL

Steven Byrnes2 Mar 2022 15:26 UTC
41 points
13 comments16 min readLW link

[Question] Would (my­opic) gen­eral pub­lic good pro­duc­ers sig­nifi­cantly ac­cel­er­ate the de­vel­op­ment of AGI?

mako yass2 Mar 2022 23:47 UTC
25 points
10 comments1 min readLW link

[In­tro to brain-like-AGI safety] 7. From hard­coded drives to fore­sighted plans: A worked example

Steven Byrnes9 Mar 2022 14:28 UTC
56 points
0 comments9 min readLW link

[In­tro to brain-like-AGI safety] 9. Take­aways from neuro 2/​2: On AGI motivation

Steven Byrnes23 Mar 2022 12:48 UTC
31 points
6 comments23 min readLW link

Hu­mans pre­tend­ing to be robots pre­tend­ing to be human

Richard_Kennaway28 Mar 2022 15:13 UTC
27 points
15 comments1 min readLW link

[In­tro to brain-like-AGI safety] 10. The al­ign­ment problem

Steven Byrnes30 Mar 2022 13:24 UTC
34 points
4 comments21 min readLW link

AXRP Epi­sode 13 - First Prin­ci­ples of AGI Safety with Richard Ngo

DanielFilan31 Mar 2022 5:20 UTC
24 points
1 comment48 min readLW link

Un­con­trol­lable Su­per-Pow­er­ful Explosives

Sammy Martin2 Apr 2022 20:13 UTC
53 points
12 comments5 min readLW link

The case for Do­ing Some­thing Else (if Align­ment is doomed)

Rafael Harth5 Apr 2022 17:52 UTC
81 points
14 comments2 min readLW link

[In­tro to brain-like-AGI safety] 11. Safety ≠ al­ign­ment (but they’re close!)

Steven Byrnes6 Apr 2022 13:39 UTC
25 points
1 comment10 min readLW link

Strate­gic Con­sid­er­a­tions Re­gard­ing Autis­tic/​Literal AI

Chris_Leong6 Apr 2022 14:57 UTC
−1 points
2 comments2 min readLW link

DALL·E 2 by OpenAI

P.6 Apr 2022 14:17 UTC
44 points
51 comments1 min readLW link
(openai.com)

How to train your trans­former

p.b.7 Apr 2022 9:34 UTC
6 points
0 comments8 min readLW link

AMA Con­jec­ture, A New Align­ment Startup

adamShimi9 Apr 2022 9:43 UTC
46 points
42 comments1 min readLW link

Worse than an un­al­igned AGI

shminux10 Apr 2022 3:35 UTC
−1 points
12 comments1 min readLW link

[Question] Did OpenAI let GPT out of the box?

ChristianKl16 Apr 2022 14:56 UTC
4 points
12 comments1 min readLW link

In­stru­men­tal Con­ver­gence To Offer Hope?

michael_mjd22 Apr 2022 1:56 UTC
12 points
7 comments3 min readLW link

[In­tro to brain-like-AGI safety] 13. Sym­bol ground­ing & hu­man so­cial instincts

Steven Byrnes27 Apr 2022 13:30 UTC
54 points
13 comments14 min readLW link

[In­tro to brain-like-AGI safety] 14. Con­trol­led AGI

Steven Byrnes11 May 2022 13:17 UTC
26 points
25 comments18 min readLW link

[Question] What’s keep­ing con­cerned ca­pa­bil­ities gain re­searchers from leav­ing the field?

sovran12 May 2022 12:16 UTC
19 points
4 comments1 min readLW link

[Question] What’s keep­ing con­cerned ca­pa­bil­ities gain re­searchers from leav­ing the field?

sovran12 May 2022 12:16 UTC
19 points
4 comments1 min readLW link

Read­ing the ethi­cists: A re­view of ar­ti­cles on AI in the jour­nal Science and Eng­ineer­ing Ethics

Charlie Steiner18 May 2022 20:52 UTC
50 points
8 comments14 min readLW link

Con­fused why a “ca­pa­bil­ities re­search is good for al­ign­ment progress” po­si­tion isn’t dis­cussed more

Kaj_Sotala2 Jun 2022 21:41 UTC
132 points
26 comments4 min readLW link

I’m try­ing out “as­ter­oid mind­set”

Alex_Altair3 Jun 2022 13:35 UTC
85 points
5 comments4 min readLW link

An­nounc­ing the Align­ment of Com­plex Sys­tems Re­search Group

4 Jun 2022 4:10 UTC
79 points
18 comments5 min readLW link

AGI Ruin: A List of Lethalities

Eliezer Yudkowsky5 Jun 2022 22:05 UTC
725 points
653 comments30 min readLW link

Yes, AI re­search will be sub­stan­tially cur­tailed if a lab causes a ma­jor disaster

lc14 Jun 2022 22:17 UTC
96 points
35 comments2 min readLW link

Lamda is not an LLM

Kevin19 Jun 2022 11:13 UTC
7 points
10 comments1 min readLW link
(www.wired.com)

Google’s new text-to-image model—Parti, a demon­stra­tion of scal­ing benefits

Kayden22 Jun 2022 20:00 UTC
32 points
4 comments1 min readLW link

[Link] OpenAI: Learn­ing to Play Minecraft with Video PreTrain­ing (VPT)

Aryeh Englander23 Jun 2022 16:29 UTC
53 points
3 comments1 min readLW link

An­nounc­ing Epoch: A re­search or­ga­ni­za­tion in­ves­ti­gat­ing the road to Trans­for­ma­tive AI

27 Jun 2022 13:55 UTC
95 points
2 comments2 min readLW link
(epochai.org)

Paper: Fore­cast­ing world events with neu­ral nets

1 Jul 2022 19:40 UTC
39 points
3 comments4 min readLW link

Naive Hy­pothe­ses on AI Alignment

Shoshannah Tekofsky2 Jul 2022 19:03 UTC
89 points
29 comments5 min readLW link

Hu­mans provide an un­tapped wealth of ev­i­dence about alignment

14 Jul 2022 2:31 UTC
175 points
92 comments10 min readLW link

Ex­am­ples of AI In­creas­ing AI Progress

ThomasW17 Jul 2022 20:06 UTC
104 points
14 comments1 min readLW link

Fore­cast­ing ML Bench­marks in 2023

jsteinhardt18 Jul 2022 2:50 UTC
36 points
19 comments12 min readLW link
(bounded-regret.ghost.io)

Ro­bust­ness to Scal­ing Down: More Im­por­tant Than I Thought

adamShimi23 Jul 2022 11:40 UTC
37 points
5 comments3 min readLW link

Com­par­ing Four Ap­proaches to In­ner Alignment

Lucas Teixeira29 Jul 2022 21:06 UTC
33 points
1 comment9 min readLW link

Where are the red lines for AI?

Karl von Wendt5 Aug 2022 9:34 UTC
23 points
8 comments6 min readLW link

Jack Clark on the re­al­ities of AI policy

Kaj_Sotala7 Aug 2022 8:44 UTC
66 points
3 comments3 min readLW link
(threadreaderapp.com)

GD’s Im­plicit Bias on Separable Data

Xander Davies17 Oct 2022 4:13 UTC
23 points
0 comments7 min readLW link

AI Trans­parency: Why it’s crit­i­cal and how to ob­tain it.

Zohar Jackson14 Aug 2022 10:31 UTC
6 points
1 comment5 min readLW link

Brain-like AGI pro­ject “ain­telope”

Gunnar_Zarncke14 Aug 2022 16:33 UTC
48 points
2 comments1 min readLW link

A Mechanis­tic In­ter­pretabil­ity Anal­y­sis of Grokking

15 Aug 2022 2:41 UTC
338 points
39 comments42 min readLW link
(colab.research.google.com)

What if we ap­proach AI safety like a tech­ni­cal en­g­ineer­ing safety problem

zeshen20 Aug 2022 10:29 UTC
30 points
5 comments7 min readLW link

AI art isn’t “about to shake things up”. It’s already here.

Davis_Kingsley22 Aug 2022 11:17 UTC
65 points
19 comments3 min readLW link

Some con­cep­tual al­ign­ment re­search projects

Richard_Ngo25 Aug 2022 22:51 UTC
168 points
14 comments3 min readLW link

Lev­el­ling Up in AI Safety Re­search Engineering

Gabriel Mukobi2 Sep 2022 4:59 UTC
40 points
7 comments17 min readLW link

The shard the­ory of hu­man values

4 Sep 2022 4:28 UTC
202 points
57 comments24 min readLW link

Quintin’s al­ign­ment pa­pers roundup—week 1

Quintin Pope10 Sep 2022 6:39 UTC
119 points
5 comments9 min readLW link

LOVE in a sim­box is all you need

jacob_cannell28 Sep 2022 18:25 UTC
59 points
69 comments44 min readLW link

A shot at the di­a­mond-al­ign­ment problem

TurnTrout6 Oct 2022 18:29 UTC
77 points
53 comments15 min readLW link

More ex­am­ples of goal misgeneralization

7 Oct 2022 14:38 UTC
51 points
8 comments2 min readLW link
(deepmindsafetyresearch.medium.com)

[Cross­post] AlphaTen­sor, Taste, and the Scal­a­bil­ity of AI

jamierumbelow9 Oct 2022 19:42 UTC
16 points
4 comments1 min readLW link
(jamieonsoftware.com)

QAPR 4: In­duc­tive biases

Quintin Pope10 Oct 2022 22:08 UTC
63 points
2 comments18 min readLW link

In­finite Pos­si­bil­ity Space and the Shut­down Problem

magfrump18 Oct 2022 5:37 UTC
6 points
0 comments2 min readLW link
(www.magfrump.net)

Cruxes in Katja Grace’s Counterarguments

azsantosk16 Oct 2022 8:44 UTC
16 points
0 comments7 min readLW link

Deep­Mind on Strat­ego, an im­perfect in­for­ma­tion game

sanxiyn24 Oct 2022 5:57 UTC
15 points
9 comments1 min readLW link
(arxiv.org)

An­nounc­ing: What Fu­ture World? - Grow­ing the AI Gover­nance Community

DavidCorfield2 Nov 2022 1:24 UTC
1 point
0 comments1 min readLW link

Poster Ses­sion on AI Safety

Neil Crawford12 Nov 2022 3:50 UTC
7 points
6 comments1 min readLW link

AI will change the world, but won’t take it over by play­ing “3-di­men­sional chess”.

22 Nov 2022 18:57 UTC
103 points
86 comments24 min readLW link

A challenge for AGI or­ga­ni­za­tions, and a challenge for readers

1 Dec 2022 23:11 UTC
265 points
30 comments2 min readLW link

Towards Hodge-podge Alignment

Cleo Nardo19 Dec 2022 20:12 UTC
65 points
26 comments9 min readLW link

[AN #94]: AI al­ign­ment as trans­la­tion be­tween hu­mans and machines

Rohin Shah8 Apr 2020 17:10 UTC
11 points
0 comments7 min readLW link
(mailchi.mp)

[Question] What are the rel­a­tive speeds of AI ca­pa­bil­ities and AI safety?

NunoSempere24 Apr 2020 18:21 UTC
8 points
2 comments1 min readLW link

Seek­ing Power is Often Con­ver­gently In­stru­men­tal in MDPs

5 Dec 2019 2:33 UTC
153 points
38 comments16 min readLW link2 reviews
(arxiv.org)

“Don’t even think about hell”

emmab2 May 2020 8:06 UTC
6 points
2 comments1 min readLW link

[Question] AI Box­ing for Hard­ware-bound agents (aka the China al­ign­ment prob­lem)

Logan Zoellner8 May 2020 15:50 UTC
11 points
27 comments10 min readLW link

Point­ing to a Flower

johnswentworth18 May 2020 18:54 UTC
59 points
18 comments9 min readLW link

Learn­ing and ma­nipu­lat­ing learning

Stuart_Armstrong19 May 2020 13:02 UTC
39 points
5 comments10 min readLW link

[Question] Why aren’t we test­ing gen­eral in­tel­li­gence dis­tri­bu­tion?

B Jacobs26 May 2020 16:07 UTC
25 points
7 comments1 min readLW link

OpenAI an­nounces GPT-3

gwern29 May 2020 1:49 UTC
67 points
23 comments1 min readLW link
(arxiv.org)

GPT-3: a dis­ap­point­ing paper

nostalgebraist29 May 2020 19:06 UTC
65 points
44 comments8 min readLW link1 review

In­tro­duc­tion to Ex­is­ten­tial Risks from Ar­tifi­cial In­tel­li­gence, for an EA audience

JoshuaFox2 Jun 2020 8:30 UTC
10 points
1 comment1 min readLW link

Prepar­ing for “The Talk” with AI projects

Daniel Kokotajlo13 Jun 2020 23:01 UTC
64 points
16 comments3 min readLW link

[Question] What are the high-level ap­proaches to AI al­ign­ment?

Gordon Seidoh Worley16 Jun 2020 17:10 UTC
12 points
13 comments1 min readLW link

Re­sults of $1,000 Or­a­cle con­test!

Stuart_Armstrong17 Jun 2020 17:44 UTC
58 points
2 comments1 min readLW link

[Question] Like­li­hood of hy­per­ex­is­ten­tial catas­tro­phe from a bug?

Anirandis18 Jun 2020 16:23 UTC
13 points
27 comments1 min readLW link

AI Benefits Post 1: In­tro­duc­ing “AI Benefits”

Cullen_OKeefe22 Jun 2020 16:59 UTC
11 points
3 comments3 min readLW link

Goals and short descriptions

Michele Campolo2 Jul 2020 17:41 UTC
14 points
8 comments5 min readLW link

Re­search ideas to study hu­mans with AI Safety in mind

Riccardo Volpato3 Jul 2020 16:01 UTC
23 points
2 comments5 min readLW link

AI Benefits Post 3: Direct and Indi­rect Ap­proaches to AI Benefits

Cullen_OKeefe6 Jul 2020 18:48 UTC
8 points
0 comments2 min readLW link

An­titrust-Com­pli­ant AI In­dus­try Self-Regulation

Cullen_OKeefe7 Jul 2020 20:53 UTC
9 points
3 comments1 min readLW link
(cullenokeefe.com)

Should AI Be Open?

Scott Alexander17 Dec 2015 8:25 UTC
20 points
3 comments13 min readLW link

Meta Pro­gram­ming GPT: A route to Su­per­in­tel­li­gence?

dmtea11 Jul 2020 14:51 UTC
10 points
7 comments4 min readLW link

The Dilemma of Worse Than Death Scenarios

arkaeik10 Jul 2018 9:18 UTC
5 points
18 comments4 min readLW link

[Question] What are the mostly likely ways AGI will emerge?

Craig Quiter14 Jul 2020 0:58 UTC
3 points
7 comments1 min readLW link

AI Benefits Post 4: Out­stand­ing Ques­tions on Select­ing Benefits

Cullen_OKeefe14 Jul 2020 17:26 UTC
4 points
4 comments5 min readLW link

Solv­ing Math Prob­lems by Relay

17 Jul 2020 15:32 UTC
98 points
26 comments7 min readLW link

AI Benefits Post 5: Out­stand­ing Ques­tions on Govern­ing Benefits

Cullen_OKeefe21 Jul 2020 16:46 UTC
4 points
0 comments4 min readLW link

[Question] Why is pseudo-al­ign­ment “worse” than other ways ML can fail to gen­er­al­ize?

nostalgebraist18 Jul 2020 22:54 UTC
45 points
10 comments2 min readLW link

[Question] “Do Noth­ing” util­ity func­tion, 3½ years later?

niplav20 Jul 2020 11:09 UTC
5 points
3 comments1 min readLW link

[AN #80]: Why AI risk might be solved with­out ad­di­tional in­ter­ven­tion from longtermists

Rohin Shah2 Jan 2020 18:20 UTC
34 points
94 comments10 min readLW link
(mailchi.mp)

Ac­cess to AI: a hu­man right?

dmtea25 Jul 2020 9:38 UTC
5 points
3 comments2 min readLW link

The Rise of Com­mon­sense Reasoning

DragonGod27 Jul 2020 19:01 UTC
8 points
0 comments1 min readLW link
(www.reddit.com)

AI and Efficiency

DragonGod27 Jul 2020 20:58 UTC
9 points
1 comment1 min readLW link
(openai.com)

FHI Re­port: How Will Na­tional Se­cu­rity Con­sid­er­a­tions Affect An­titrust De­ci­sions in AI? An Ex­am­i­na­tion of His­tor­i­cal Precedents

Cullen_OKeefe28 Jul 2020 18:34 UTC
2 points
0 comments1 min readLW link
(www.fhi.ox.ac.uk)

The “best pre­dic­tor is mal­i­cious op­ti­miser” problem

Donald Hobson29 Jul 2020 11:49 UTC
14 points
10 comments2 min readLW link

Suffi­ciently Ad­vanced Lan­guage Models Can Do Re­in­force­ment Learning

Zachary Robertson2 Aug 2020 15:32 UTC
21 points
7 comments7 min readLW link

[Question] What are the most im­por­tant pa­pers/​post/​re­sources to read to un­der­stand more of GPT-3?

adamShimi2 Aug 2020 20:53 UTC
22 points
4 comments1 min readLW link

[Question] What should an Ein­stein-like figure in Ma­chine Learn­ing do?

Razied5 Aug 2020 23:52 UTC
3 points
3 comments1 min readLW link

Book re­view: Ar­chi­tects of In­tel­li­gence by Martin Ford (2018)

Ofer11 Aug 2020 17:30 UTC
15 points
0 comments2 min readLW link

[Question] Will OpenAI’s work un­in­ten­tion­ally in­crease ex­is­ten­tial risks re­lated to AI?

adamShimi11 Aug 2020 18:16 UTC
50 points
56 comments1 min readLW link

Blog post: A tale of two re­search communities

Aryeh Englander12 Aug 2020 20:41 UTC
14 points
0 comments4 min readLW link

Map­ping Out Alignment

15 Aug 2020 1:02 UTC
42 points
0 comments5 min readLW link

My Un­der­stand­ing of Paul Chris­ti­ano’s Iter­ated Am­plifi­ca­tion AI Safety Re­search Agenda

Chi Nguyen15 Aug 2020 20:02 UTC
119 points
21 comments39 min readLW link

GPT-3, be­lief, and consistency

skybrian16 Aug 2020 23:12 UTC
18 points
7 comments2 min readLW link

[Question] What pre­cisely do we mean by AI al­ign­ment?

Gordon Seidoh Worley9 Dec 2018 2:23 UTC
27 points
8 comments1 min readLW link

Thoughts on the Fea­si­bil­ity of Pro­saic AGI Align­ment?

iamthouthouarti21 Aug 2020 23:25 UTC
8 points
10 comments1 min readLW link

[Question] Fore­cast­ing Thread: AI Timelines

22 Aug 2020 2:33 UTC
133 points
95 comments2 min readLW link

Learn­ing hu­man prefer­ences: black-box, white-box, and struc­tured white-box access

Stuart_Armstrong24 Aug 2020 11:42 UTC
25 points
9 comments6 min readLW link

Proofs Sec­tion 2.3 (Up­dates, De­ci­sion The­ory)

Diffractor27 Aug 2020 7:49 UTC
7 points
0 comments31 min readLW link

Proofs Sec­tion 2.2 (Iso­mor­phism to Ex­pec­ta­tions)

Diffractor27 Aug 2020 7:52 UTC
7 points
0 comments46 min readLW link

Proofs Sec­tion 2.1 (The­o­rem 1, Lem­mas)

Diffractor27 Aug 2020 7:54 UTC
7 points
0 comments36 min readLW link

Proofs Sec­tion 1.1 (Ini­tial re­sults to LF-du­al­ity)

Diffractor27 Aug 2020 7:59 UTC
7 points
0 comments20 min readLW link

Proofs Sec­tion 1.2 (Mix­tures, Up­dates, Push­for­wards)

Diffractor27 Aug 2020 7:57 UTC
7 points
0 comments14 min readLW link

Ba­sic In­framea­sure Theory

Diffractor27 Aug 2020 8:02 UTC
35 points
16 comments25 min readLW link

Belief Func­tions And De­ci­sion Theory

Diffractor27 Aug 2020 8:00 UTC
15 points
8 comments39 min readLW link

Tech­ni­cal model re­fine­ment formalism

Stuart_Armstrong27 Aug 2020 11:54 UTC
19 points
0 comments6 min readLW link

Pong from pix­els with­out read­ing “Pong from Pix­els”

Ian McKenzie29 Aug 2020 17:26 UTC
15 points
1 comment7 min readLW link

Reflec­tions on AI Timelines Fore­cast­ing Thread

Amandango1 Sep 2020 1:42 UTC
53 points
7 comments5 min readLW link

on “learn­ing to sum­ma­rize”

nostalgebraist12 Sep 2020 3:20 UTC
25 points
13 comments8 min readLW link
(nostalgebraist.tumblr.com)

[Question] The uni­ver­sal­ity of com­pu­ta­tion and mind de­sign space

alanf12 Sep 2020 14:58 UTC
1 point
7 comments1 min readLW link

Clar­ify­ing “What failure looks like”

Sam Clarke20 Sep 2020 20:40 UTC
95 points
14 comments17 min readLW link

Hu­man Bi­ases that Ob­scure AI Progress

Phylliida Dev25 Sep 2020 0:24 UTC
42 points
2 comments4 min readLW link

[Question] Com­pe­tence vs Alignment

Ariel Kwiatkowski30 Sep 2020 21:03 UTC
6 points
4 comments1 min readLW link

AGI safety from first prin­ci­ples: Alignment

Richard_Ngo1 Oct 2020 3:13 UTC
56 points
2 comments13 min readLW link

[Question] GPT-3 + GAN

stick10917 Oct 2020 7:58 UTC
4 points
4 comments1 min readLW link

Book Re­view: Re­in­force­ment Learn­ing by Sut­ton and Barto

billmei20 Oct 2020 19:40 UTC
52 points
3 comments10 min readLW link

GPT-X, Paper­clip Max­i­mizer? An­a­lyz­ing AGI and Fi­nal Goals

meanderingmoose22 Oct 2020 14:33 UTC
8 points
1 comment6 min readLW link

Con­tain­ing the AI… In­side a Si­mu­lated Reality

HumaneAutomation31 Oct 2020 16:16 UTC
1 point
9 comments2 min readLW link

Why those who care about catas­trophic and ex­is­ten­tial risk should care about au­tonomous weapons

aaguirre11 Nov 2020 15:22 UTC
60 points
20 comments19 min readLW link

Euro­pean Master’s Pro­grams in Ma­chine Learn­ing, Ar­tifi­cial In­tel­li­gence, and re­lated fields

Master Programs ML/AI14 Nov 2020 15:51 UTC
32 points
8 comments1 min readLW link

Should we post­pone AGI un­til we reach safety?

otto.barten18 Nov 2020 15:43 UTC
27 points
36 comments3 min readLW link

Com­mit­ment and cred­i­bil­ity in mul­ti­po­lar AI scenarios

anni_leskela4 Dec 2020 18:48 UTC
25 points
3 comments18 min readLW link

[Question] AI Win­ter Is Com­ing—How to profit from it?

maximkazhenkov5 Dec 2020 20:23 UTC
10 points
7 comments1 min readLW link

An­nounc­ing the Tech­ni­cal AI Safety Podcast

Quinn7 Dec 2020 18:51 UTC
42 points
6 comments2 min readLW link
(technical-ai-safety.libsyn.com)

All GPT skills are translation

p.b.13 Dec 2020 20:06 UTC
4 points
0 comments2 min readLW link

[Question] Judg­ing AGI Output

meredev14 Dec 2020 12:43 UTC
3 points
0 comments2 min readLW link

Risk Map of AI Systems

15 Dec 2020 9:16 UTC
25 points
3 comments8 min readLW link

AI Align­ment, Philo­soph­i­cal Plu­ral­ism, and the Rele­vance of Non-Western Philosophy

xuan1 Jan 2021 0:08 UTC
30 points
21 comments20 min readLW link

Are we all mis­al­igned?

Mateusz Mazurkiewicz3 Jan 2021 2:42 UTC
11 points
0 comments5 min readLW link

[Question] What do we *re­ally* ex­pect from a well-al­igned AI?

jan betley4 Jan 2021 20:57 UTC
8 points
10 comments1 min readLW link

Eight claims about multi-agent AGI safety

Richard_Ngo7 Jan 2021 13:34 UTC
73 points
18 comments5 min readLW link

Imi­ta­tive Gen­er­al­i­sa­tion (AKA ‘Learn­ing the Prior’)

Beth Barnes10 Jan 2021 0:30 UTC
92 points
14 comments12 min readLW link

Pre­dic­tion can be Outer Aligned at Optimum

Lanrian10 Jan 2021 18:48 UTC
15 points
12 comments11 min readLW link

[Question] Poll: Which vari­ables are most strate­gi­cally rele­vant?

22 Jan 2021 17:17 UTC
32 points
34 comments1 min readLW link

AISU 2021

Linda Linsefors30 Jan 2021 17:40 UTC
28 points
2 comments1 min readLW link

Deep­mind has made a gen­eral in­duc­tor (“Mak­ing sense of sen­sory in­put”)

mako yass2 Feb 2021 2:54 UTC
48 points
10 comments1 min readLW link
(www.sciencedirect.com)

Coun­ter­fac­tual Plan­ning in AGI Systems

Koen.Holtman3 Feb 2021 13:54 UTC
7 points
0 comments5 min readLW link

[AN #136]: How well will GPT-N perform on down­stream tasks?

Rohin Shah3 Feb 2021 18:10 UTC
21 points
2 comments9 min readLW link
(mailchi.mp)

For­mal Solu­tion to the In­ner Align­ment Problem

michaelcohen18 Feb 2021 14:51 UTC
47 points
123 comments2 min readLW link

TASP Ep 3 - Op­ti­mal Poli­cies Tend to Seek Power

Quinn11 Mar 2021 1:44 UTC
24 points
0 comments1 min readLW link
(technical-ai-safety.libsyn.com)

Phy­lac­tery De­ci­sion Theory

Bunthut2 Apr 2021 20:55 UTC
14 points
6 comments2 min readLW link

Pre­dic­tive Cod­ing has been Unified with Backpropagation

lsusr2 Apr 2021 21:42 UTC
166 points
44 comments2 min readLW link

[Question] What if we could use the the­ory of Mechanism De­sign from Game The­ory as a medium achieve AI Align­ment?

farari74 Apr 2021 12:56 UTC
4 points
0 comments1 min readLW link

Another (outer) al­ign­ment failure story

paulfchristiano7 Apr 2021 20:12 UTC
210 points
38 comments12 min readLW link

A Sys­tem For Evolv­ing In­creas­ingly Gen­eral Ar­tifi­cial In­tel­li­gence From Cur­rent Technologies

Tsang Chung Shu8 Apr 2021 21:37 UTC
1 point
3 comments11 min readLW link

April 2021 Deep Dive: Trans­form­ers and GPT-3

adamShimi1 May 2021 11:18 UTC
30 points
6 comments7 min readLW link

[Question] [time­boxed ex­er­cise] write me your model of AI hu­man-ex­is­ten­tial safety and the al­ign­ment prob­lems in 15 minutes

Quinn4 May 2021 19:10 UTC
6 points
2 comments1 min readLW link

Mostly ques­tions about Dumb AI Kernels

HorizonHeld12 May 2021 22:00 UTC
1 point
1 comment9 min readLW link

Thoughts on Iter­ated Distil­la­tion and Amplification

Waddington11 May 2021 21:32 UTC
9 points
2 comments20 min readLW link

How do we build or­gani­sa­tions that want to build safe AI?

sxae12 May 2021 15:08 UTC
4 points
4 comments9 min readLW link

[Question] Who has ar­gued in de­tail that a cur­rent AI sys­tem is phe­nom­e­nally con­scious?

Robbo14 May 2021 22:03 UTC
3 points
2 comments1 min readLW link

How I Learned to Stop Wor­ry­ing and Love MUM

Waddington20 May 2021 7:57 UTC
2 points
0 comments3 min readLW link

AI Safety Re­search Pro­ject Ideas

Owain_Evans21 May 2021 13:39 UTC
58 points
2 comments3 min readLW link

[Question] How one uses set the­ory for al­ign­ment prob­lem?

Just Learning29 May 2021 0:28 UTC
8 points
6 comments1 min readLW link

Reflec­tion of Hier­ar­chi­cal Re­la­tion­ship via Nuanced Con­di­tion­ing of Game The­ory Ap­proach for AI Devel­op­ment and Utilization

Kyoung-cheol Kim4 Jun 2021 7:20 UTC
2 points
2 comments9 min readLW link

Re­view of “Learn­ing Nor­ma­tivity: A Re­search Agenda”

6 Jun 2021 13:33 UTC
34 points
0 comments6 min readLW link

Hard­ware for Trans­for­ma­tive AI

ViktorThink22 Jun 2021 18:13 UTC
17 points
7 comments2 min readLW link

Alex Turner’s Re­search, Com­pre­hen­sive In­for­ma­tion Gathering

adamShimi23 Jun 2021 9:44 UTC
15 points
3 comments3 min readLW link

Dis­cus­sion: Ob­jec­tive Ro­bust­ness and In­ner Align­ment Terminology

23 Jun 2021 23:25 UTC
70 points
7 comments9 min readLW link

The Lan­guage of Bird

johnswentworth27 Jun 2021 4:44 UTC
44 points
9 comments2 min readLW link

[Question] What are some claims or opinions about multi-multi del­e­ga­tion you’ve seen in the meme­plex that you think de­serve scrutiny?

Quinn27 Jun 2021 17:44 UTC
17 points
6 comments2 min readLW link

An ex­am­i­na­tion of Me­tac­u­lus’ re­solved AI pre­dic­tions and their im­pli­ca­tions for AI timelines

CharlesD20 Jul 2021 9:08 UTC
28 points
0 comments7 min readLW link

[Question] How should my timelines in­fluence my ca­reer choice?

Tom Lieberum3 Aug 2021 10:14 UTC
13 points
10 comments1 min readLW link

What is the prob­lem?

Carlos Ramirez11 Aug 2021 22:33 UTC
7 points
0 comments6 min readLW link

OpenAI Codex: First Impressions

specbug13 Aug 2021 16:52 UTC
49 points
8 comments4 min readLW link
(sixeleven.in)

[Question] 1h-vol­un­teers needed for a small AI Safety-re­lated re­search pro­ject

PabloAMC16 Aug 2021 17:53 UTC
2 points
0 comments1 min readLW link

Ex­trac­tion of hu­man prefer­ences 👨→🤖

arunraja-hub24 Aug 2021 16:34 UTC
18 points
2 comments5 min readLW link

Call for re­search on eval­u­at­ing al­ign­ment (fund­ing + ad­vice available)

Beth Barnes31 Aug 2021 23:28 UTC
105 points
11 comments5 min readLW link

Ob­sta­cles to gra­di­ent hacking

leogao5 Sep 2021 22:42 UTC
21 points
11 comments4 min readLW link

[Question] Con­di­tional on the first AGI be­ing al­igned cor­rectly, is a good out­come even still likely?

iamthouthouarti6 Sep 2021 17:30 UTC
2 points
1 comment1 min readLW link

Dist­in­guish­ing AI takeover scenarios

8 Sep 2021 16:19 UTC
67 points
11 comments14 min readLW link

Paths To High-Level Ma­chine Intelligence

Daniel_Eth10 Sep 2021 13:21 UTC
67 points
8 comments33 min readLW link

How truth­ful is GPT-3? A bench­mark for lan­guage models

Owain_Evans16 Sep 2021 10:09 UTC
56 points
24 comments6 min readLW link

In­ves­ti­gat­ing AI Takeover Scenarios

Sammy Martin17 Sep 2021 18:47 UTC
27 points
1 comment27 min readLW link

A suffi­ciently para­noid non-Friendly AGI might self-mod­ify it­self to be­come Friendly

RomanS22 Sep 2021 6:29 UTC
5 points
2 comments1 min readLW link

Towards De­con­fus­ing Gra­di­ent Hacking

leogao24 Oct 2021 0:43 UTC
25 points
1 comment12 min readLW link

A brief re­view of the rea­sons multi-ob­jec­tive RL could be im­por­tant in AI Safety Research

Ben Smith29 Sep 2021 17:09 UTC
27 points
8 comments10 min readLW link

Meta learn­ing to gra­di­ent hack

Quintin Pope1 Oct 2021 19:25 UTC
54 points
11 comments3 min readLW link

Pro­posal: Scal­ing laws for RL generalization

axioman1 Oct 2021 21:32 UTC
14 points
10 comments11 min readLW link

A Frame­work of Pre­dic­tion Technologies

isaduan3 Oct 2021 10:26 UTC
8 points
2 comments9 min readLW link

AI Pre­dic­tion Ser­vices and Risks of War

isaduan3 Oct 2021 10:26 UTC
3 points
2 comments10 min readLW link

Pos­si­ble Wor­lds af­ter Pre­dic­tion Take-off

isaduan3 Oct 2021 10:26 UTC
5 points
0 comments4 min readLW link

[Pro­posal] Method of lo­cat­ing use­ful sub­nets in large models

Quintin Pope13 Oct 2021 20:52 UTC
9 points
0 comments2 min readLW link

Com­men­tary on “AGI Safety From First Prin­ci­ples by Richard Ngo, Septem­ber 2020”

Robert Kralisch14 Oct 2021 15:11 UTC
3 points
0 comments20 min readLW link

The AGI needs to be honest

rokosbasilisk16 Oct 2021 19:24 UTC
2 points
12 comments2 min readLW link

“Re­dun­dant” AI Alignment

Mckay Jensen16 Oct 2021 21:32 UTC
12 points
3 comments1 min readLW link
(quevivasbien.github.io)

[MLSN #1]: ICLR Safety Paper Roundup

Dan_H18 Oct 2021 15:19 UTC
59 points
1 comment2 min readLW link

AMA on Truth­ful AI: Owen Cot­ton-Bar­ratt, Owain Evans & co-authors

Owain_Evans22 Oct 2021 16:23 UTC
31 points
15 comments1 min readLW link

Hegel vs. GPT-3

Bezzi27 Oct 2021 5:55 UTC
9 points
21 comments2 min readLW link

Google an­nounces Path­ways: new gen­er­a­tion mul­ti­task AI Architecture

Ozyrus29 Oct 2021 11:55 UTC
6 points
1 comment1 min readLW link
(blog.google)

What is the most evil AI that we could build, to­day?

ThomasJ1 Nov 2021 19:58 UTC
−2 points
14 comments1 min readLW link

Why we need proso­cial agents

Akbir Khan2 Nov 2021 15:19 UTC
6 points
0 comments2 min readLW link

Pos­si­ble re­search di­rec­tions to im­prove the mechanis­tic ex­pla­na­tion of neu­ral networks

delton1379 Nov 2021 2:36 UTC
29 points
8 comments9 min readLW link

What are red flags for Neu­ral Net­work suffer­ing?

Marius Hobbhahn8 Nov 2021 12:51 UTC
26 points
15 comments12 min readLW link

Us­ing Brain-Com­puter In­ter­faces to get more data for AI alignment

Robbo7 Nov 2021 0:00 UTC
35 points
10 comments7 min readLW link

Hard­code the AGI to need our ap­proval in­definitely?

MichaelStJules11 Nov 2021 7:04 UTC
2 points
2 comments1 min readLW link

Stop but­ton: to­wards a causal solution

tailcalled12 Nov 2021 19:09 UTC
23 points
37 comments9 min readLW link

A FLI post­doc­toral grant ap­pli­ca­tion: AI al­ign­ment via causal anal­y­sis and de­sign of agents

PabloAMC13 Nov 2021 1:44 UTC
4 points
0 comments7 min readLW link

What would we do if al­ign­ment were fu­tile?

Grant Demaree14 Nov 2021 8:09 UTC
73 points
43 comments3 min readLW link

At­tempted Gears Anal­y­sis of AGI In­ter­ven­tion Dis­cus­sion With Eliezer

Zvi15 Nov 2021 3:50 UTC
204 points
48 comments16 min readLW link
(thezvi.wordpress.com)

A pos­i­tive case for how we might suc­ceed at pro­saic AI alignment

evhub16 Nov 2021 1:49 UTC
78 points
47 comments6 min readLW link

Su­per in­tel­li­gent AIs that don’t re­quire alignment

Yair Halberstadt16 Nov 2021 19:55 UTC
10 points
2 comments6 min readLW link

Some real ex­am­ples of gra­di­ent hacking

Oliver Sourbut22 Nov 2021 0:11 UTC
15 points
8 comments2 min readLW link

[linkpost] Ac­qui­si­tion of Chess Knowl­edge in AlphaZero

Quintin Pope23 Nov 2021 7:55 UTC
8 points
1 comment1 min readLW link

AI Tracker: mon­i­tor­ing cur­rent and near-fu­ture risks from su­per­scale models

23 Nov 2021 19:16 UTC
64 points
13 comments3 min readLW link
(aitracker.org)

AI Safety Needs Great Engineers

Andy Jones23 Nov 2021 15:40 UTC
78 points
45 comments4 min readLW link

HIRING: In­form and shape a new pro­ject on AI safety at Part­ner­ship on AI

Madhulika Srikumar24 Nov 2021 8:27 UTC
6 points
0 comments1 min readLW link

How to mea­sure FLOP/​s for Neu­ral Net­works em­piri­cally?

Marius Hobbhahn29 Nov 2021 15:18 UTC
16 points
5 comments7 min readLW link

AI Gover­nance Fun­da­men­tals—Cur­ricu­lum and Application

Mauricio30 Nov 2021 2:19 UTC
17 points
0 comments16 min readLW link

Be­hav­ior Clon­ing is Miscalibrated

leogao5 Dec 2021 1:36 UTC
53 points
3 comments3 min readLW link

ML Align­ment The­ory Pro­gram un­der Evan Hubinger

6 Dec 2021 0:03 UTC
82 points
3 comments2 min readLW link

In­for­ma­tion bot­tle­neck for coun­ter­fac­tual corrigibility

tailcalled6 Dec 2021 17:11 UTC
8 points
1 comment7 min readLW link

Model­ing Failure Modes of High-Level Ma­chine Intelligence

6 Dec 2021 13:54 UTC
54 points
1 comment12 min readLW link

Find­ing the mul­ti­ple ground truths of CoinRun and image classification

Stuart_Armstrong8 Dec 2021 18:13 UTC
15 points
3 comments2 min readLW link

[Question] What al­ign­ment-re­lated con­cepts should be bet­ter known in the broader ML com­mu­nity?

Lauro Langosco9 Dec 2021 20:44 UTC
6 points
4 comments1 min readLW link

Un­der­stand­ing Gra­di­ent Hacking

peterbarnett10 Dec 2021 15:58 UTC
30 points
5 comments30 min readLW link

What’s the back­ward-for­ward FLOP ra­tio for Neu­ral Net­works?

13 Dec 2021 8:54 UTC
17 points
8 comments10 min readLW link

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC
111 points
9 comments15 min readLW link

Disen­tan­gling Per­spec­tives On Strat­egy-Steal­ing in AI Safety

shawnghu18 Dec 2021 20:13 UTC
20 points
1 comment11 min readLW link

De­mand­ing and De­sign­ing Aligned Cog­ni­tive Architectures

Koen.Holtman21 Dec 2021 17:32 UTC
8 points
5 comments5 min readLW link

Po­ten­tial gears level ex­pla­na­tions of smooth progress

ryan_greenblatt22 Dec 2021 18:05 UTC
4 points
2 comments2 min readLW link

Trans­former Circuits

evhub22 Dec 2021 21:09 UTC
142 points
4 comments3 min readLW link
(transformer-circuits.pub)

Gra­di­ent Hack­ing via Schel­ling Goals

Adam Scherlis28 Dec 2021 20:38 UTC
33 points
4 comments4 min readLW link

Reader-gen­er­ated Essays

Henrik Karlsson3 Jan 2022 8:56 UTC
17 points
0 comments6 min readLW link
(escapingflatland.substack.com)

Brain Effi­ciency: Much More than You Wanted to Know

jacob_cannell6 Jan 2022 3:38 UTC
195 points
87 comments28 min readLW link

Un­der­stand­ing the two-head strat­egy for teach­ing ML to an­swer ques­tions honestly

Adam Scherlis11 Jan 2022 23:24 UTC
28 points
1 comment10 min readLW link

Plan B in AI Safety approach

avturchin13 Jan 2022 12:03 UTC
33 points
9 comments2 min readLW link

Truth­ful LMs as a warm-up for al­igned AGI

Jacob_Hilton17 Jan 2022 16:49 UTC
65 points
14 comments13 min readLW link

How I’m think­ing about GPT-N

delton13717 Jan 2022 17:11 UTC
46 points
21 comments18 min readLW link

Align­ment Prob­lems All the Way Down

peterbarnett22 Jan 2022 0:19 UTC
26 points
7 comments10 min readLW link

[Question] How fea­si­ble/​costly would it be to train a very large AI model on dis­tributed clusters of GPUs?

Anonymous25 Jan 2022 19:20 UTC
7 points
4 comments1 min readLW link

Causal­ity, Trans­for­ma­tive AI and al­ign­ment—part I

Marius Hobbhahn27 Jan 2022 16:18 UTC
13 points
11 comments8 min readLW link

2+2: On­tolog­i­cal Framework

Lyrialtus1 Feb 2022 1:07 UTC
−15 points
2 comments12 min readLW link

QNR prospects are im­por­tant for AI al­ign­ment research

Eric Drexler3 Feb 2022 15:20 UTC
82 points
10 comments11 min readLW link

Paradigm-build­ing: Introduction

Cameron Berg8 Feb 2022 0:06 UTC
25 points
0 comments2 min readLW link

Paradigm-build­ing: The hi­er­ar­chi­cal ques­tion framework

Cameron Berg9 Feb 2022 16:47 UTC
11 points
16 comments3 min readLW link

Ques­tion 1: Pre­dicted ar­chi­tec­ture of AGI learn­ing al­gorithm(s)

Cameron Berg10 Feb 2022 17:22 UTC
12 points
1 comment7 min readLW link

Ques­tion 2: Pre­dicted bad out­comes of AGI learn­ing architecture

Cameron Berg11 Feb 2022 22:23 UTC
5 points
1 comment10 min readLW link

Ques­tion 3: Con­trol pro­pos­als for min­i­miz­ing bad outcomes

Cameron Berg12 Feb 2022 19:13 UTC
5 points
1 comment7 min readLW link

Ques­tion 4: Im­ple­ment­ing the con­trol proposals

Cameron Berg13 Feb 2022 17:12 UTC
6 points
2 comments5 min readLW link

Ques­tion 5: The timeline hyperparameter

Cameron Berg14 Feb 2022 16:38 UTC
5 points
3 comments7 min readLW link

Paradigm-build­ing: Con­clu­sion and prac­ti­cal takeaways

Cameron Berg15 Feb 2022 16:11 UTC
2 points
1 comment2 min readLW link

How com­plex are my­opic imi­ta­tors?

Vivek Hebbar8 Feb 2022 12:00 UTC
23 points
1 comment15 min readLW link

Me­tac­u­lus launches con­test for es­says with quan­ti­ta­tive pre­dic­tions about AI

8 Feb 2022 16:07 UTC
25 points
2 comments1 min readLW link
(www.metaculus.com)

Hy­poth­e­sis: gra­di­ent de­scent prefers gen­eral circuits

Quintin Pope8 Feb 2022 21:12 UTC
40 points
26 comments11 min readLW link

Com­pute Trends Across Three eras of Ma­chine Learning

16 Feb 2022 14:18 UTC
91 points
13 comments2 min readLW link

[Question] Is the com­pe­ti­tion/​co­op­er­a­tion be­tween sym­bolic AI and statis­ti­cal AI (ML) about his­tor­i­cal ap­proach to re­search /​ en­g­ineer­ing, or is it more fun­da­men­tally about what in­tel­li­gent agents “are”?

Edward Hammond17 Feb 2022 23:11 UTC
1 point
1 comment2 min readLW link

HCH and Ad­ver­sar­ial Questions

David Udell19 Feb 2022 0:52 UTC
15 points
7 comments26 min readLW link

Thoughts on Danger­ous Learned Optimization

peterbarnett19 Feb 2022 10:46 UTC
4 points
2 comments4 min readLW link

Rel­a­tivized Defi­ni­tions as a Method to Sidestep the Löbian Obstacle

homotowat27 Feb 2022 6:37 UTC
27 points
4 comments7 min readLW link

What we know about ma­chine learn­ing’s repli­ca­tion crisis

Younes Kamel5 Mar 2022 23:55 UTC
35 points
4 comments6 min readLW link
(youneskamel.substack.com)

Pro­ject­ing com­pute trends in Ma­chine Learning

7 Mar 2022 15:32 UTC
59 points
5 comments6 min readLW link

[Sur­vey] Ex­pec­ta­tions of a Post-ASI Order

Lone Pine9 Mar 2022 19:17 UTC
5 points
0 comments1 min readLW link

A Longlist of The­o­ries of Im­pact for Interpretability

Neel Nanda11 Mar 2022 14:55 UTC
106 points
29 comments5 min readLW link

New GPT3 Im­pres­sive Ca­pa­bil­ities—In­struc­tGPT3 [1/​2]

simeon_c13 Mar 2022 10:58 UTC
71 points
10 comments7 min readLW link

Phase tran­si­tions and AGI

17 Mar 2022 17:22 UTC
44 points
19 comments9 min readLW link
(www.metaculus.com)

Can we simu­late hu­man evolu­tion to cre­ate a some­what al­igned AGI?

Thomas Kwa28 Mar 2022 22:55 UTC
21 points
7 comments7 min readLW link

Pro­ject In­tro: Selec­tion The­o­rems for Modularity

4 Apr 2022 12:59 UTC
69 points
20 comments16 min readLW link

My agenda for re­search into trans­former ca­pa­bil­ities—Introduction

p.b.5 Apr 2022 21:23 UTC
11 points
1 comment3 min readLW link

Re­search agenda: Can trans­form­ers do sys­tem 2 think­ing?

p.b.6 Apr 2022 13:31 UTC
20 points
0 comments2 min readLW link

PaLM in “Ex­trap­o­lat­ing GPT-N perfor­mance”

Lanrian6 Apr 2022 13:05 UTC
80 points
19 comments2 min readLW link

Re­search agenda—Build­ing a multi-modal chess-lan­guage model

p.b.7 Apr 2022 12:25 UTC
8 points
2 comments2 min readLW link

Is GPT3 a Good Ra­tion­al­ist? - In­struc­tGPT3 [2/​2]

simeon_c7 Apr 2022 13:46 UTC
11 points
0 comments7 min readLW link

Play­ing with DALL·E 2

Dave Orr7 Apr 2022 18:49 UTC
165 points
116 comments6 min readLW link

Progress Re­port 4: logit lens redux

Nathan Helm-Burger8 Apr 2022 18:35 UTC
3 points
0 comments2 min readLW link

Hyper­bolic takeoff

Ege Erdil9 Apr 2022 15:57 UTC
17 points
8 comments10 min readLW link
(www.metaculus.com)

Elicit: Lan­guage Models as Re­search Assistants

9 Apr 2022 14:56 UTC
70 points
7 comments13 min readLW link

Is it time to start think­ing about what AI Friendli­ness means?

ZT511 Apr 2022 9:32 UTC
18 points
6 comments3 min readLW link

What more com­pute does for brain-like mod­els: re­sponse to Rohin

Nathan Helm-Burger13 Apr 2022 3:40 UTC
22 points
14 comments11 min readLW link

Align­ment and Deep Learning

Aiyen17 Apr 2022 0:02 UTC
44 points
35 comments8 min readLW link

[$20K in Prizes] AI Safety Ar­gu­ments Competition

26 Apr 2022 16:13 UTC
74 points
543 comments3 min readLW link

SERI ML Align­ment The­ory Schol­ars Pro­gram 2022

27 Apr 2022 0:43 UTC
56 points
6 comments3 min readLW link

[Question] What is a train­ing “step” vs. “epi­sode” in ma­chine learn­ing?

Evan R. Murphy28 Apr 2022 21:53 UTC
9 points
4 comments1 min readLW link

Prize for Align­ment Re­search Tasks

29 Apr 2022 8:57 UTC
63 points
36 comments10 min readLW link

Quick Thoughts on A.I. Governance

NicholasKross30 Apr 2022 14:49 UTC
66 points
8 comments2 min readLW link
(www.thinkingmuchbetter.com)

What DALL-E 2 can and can­not do

Swimmer9631 May 2022 23:51 UTC
351 points
305 comments9 min readLW link

Open Prob­lems in Nega­tive Side Effect Minimization

6 May 2022 9:37 UTC
12 points
7 comments17 min readLW link

[Linkpost] diffu­sion mag­ne­tizes man­i­folds (DALL-E 2 in­tu­ition build­ing)

Paul Bricman7 May 2022 11:01 UTC
1 point
0 comments1 min readLW link
(paulbricman.com)

Up­dat­ing Utility Functions

9 May 2022 9:44 UTC
36 points
7 comments8 min readLW link

Con­di­tions for math­e­mat­i­cal equiv­alence of Stochas­tic Gra­di­ent Des­cent and Nat­u­ral Selection

Oliver Sourbut9 May 2022 21:38 UTC
54 points
12 comments10 min readLW link

AI safety should be made more ac­cessible us­ing non text-based media

Massimog10 May 2022 3:14 UTC
2 points
4 comments4 min readLW link

The limits of AI safety via debate

Marius Hobbhahn10 May 2022 13:33 UTC
28 points
7 comments10 min readLW link

In­tro­duc­tion to the se­quence: In­ter­pretabil­ity Re­search for the Most Im­por­tant Century

Evan R. Murphy12 May 2022 19:59 UTC
16 points
0 comments8 min readLW link

Gato as the Dawn of Early AGI

David Udell15 May 2022 6:52 UTC
84 points
29 comments12 min readLW link

Is AI Progress Im­pos­si­ble To Pre­dict?

alyssavance15 May 2022 18:30 UTC
276 points
38 comments2 min readLW link

Deep­Mind’s gen­er­al­ist AI, Gato: A non-tech­ni­cal explainer

16 May 2022 21:21 UTC
57 points
6 comments6 min readLW link

Gato’s Gen­er­al­i­sa­tion: Pre­dic­tions and Ex­per­i­ments I’d Like to See

Oliver Sourbut18 May 2022 7:15 UTC
43 points
3 comments10 min readLW link

Un­der­stand­ing Gato’s Su­per­vised Re­in­force­ment Learning

Lorenzo Rex18 May 2022 11:08 UTC
3 points
5 comments1 min readLW link
(lorenzopieri.com)

A Story of AI Risk: In­struc­tGPT-N

peterbarnett26 May 2022 23:22 UTC
24 points
0 comments8 min readLW link

[Linkpost] A Chi­nese AI op­ti­mized for killing

RomanS3 Jun 2022 9:17 UTC
−2 points
4 comments1 min readLW link

Give the AI safe tools

Adam Jermyn3 Jun 2022 17:04 UTC
3 points
0 comments4 min readLW link

Towards a For­mal­i­sa­tion of Re­turns on Cog­ni­tive Rein­vest­ment (Part 1)

DragonGod4 Jun 2022 18:42 UTC
17 points
8 comments13 min readLW link

Give the model a model-builder

Adam Jermyn6 Jun 2022 12:21 UTC
3 points
0 comments5 min readLW link

AGI Safety FAQ /​ all-dumb-ques­tions-al­lowed thread

Aryeh Englander7 Jun 2022 5:47 UTC
221 points
515 comments4 min readLW link

Em­bod­i­ment is Indis­pens­able for AGI

P. G. Keerthana Gopalakrishnan7 Jun 2022 21:31 UTC
6 points
1 comment6 min readLW link
(keerthanapg.com)

You Only Get One Shot: an In­tu­ition Pump for Embed­ded Agency

Oliver Sourbut9 Jun 2022 21:38 UTC
22 points
4 comments2 min readLW link

Sum­mary of “AGI Ruin: A List of Lethal­ities”

Stephen McAleese10 Jun 2022 22:35 UTC
32 points
2 comments8 min readLW link

Poorly-Aimed Death Rays

Thane Ruthenis11 Jun 2022 18:29 UTC
43 points
5 comments4 min readLW link

ELK Pro­posal—Make the Re­porter care about the Pre­dic­tor’s beliefs

11 Jun 2022 22:53 UTC
8 points
0 comments6 min readLW link

Grokking “Semi-in­for­ma­tive pri­ors over AI timelines”

anson.ho12 Jun 2022 22:17 UTC
15 points
7 comments14 min readLW link

[Question] Favourite new AI pro­duc­tivity tools?

Gabriel Mukobi15 Jun 2022 1:08 UTC
14 points
5 comments1 min readLW link

Con­tra Hofs­tadter on GPT-3 Nonsense

rictic15 Jun 2022 21:53 UTC
235 points
22 comments2 min readLW link

[Question] What if LaMDA is in­deed sen­tient /​ self-aware /​ worth hav­ing rights?

RomanS16 Jun 2022 9:10 UTC
22 points
13 comments1 min readLW link

Ten ex­per­i­ments in mod­u­lar­ity, which we’d like you to run!

16 Jun 2022 9:17 UTC
59 points
2 comments9 min readLW link

Align­ment re­search for “meta” purposes

acylhalide16 Jun 2022 14:03 UTC
15 points
0 comments1 min readLW link

[Question] AI mis­al­ign­ment risk from GPT-like sys­tems?

fiso6419 Jun 2022 17:35 UTC
10 points
8 comments1 min readLW link

Half-baked al­ign­ment idea: train­ing to generalize

Aaron Bergman19 Jun 2022 20:16 UTC
7 points
2 comments4 min readLW link

Get­ting from an un­al­igned AGI to an al­igned AGI?

Tor Økland Barstad21 Jun 2022 12:36 UTC
9 points
7 comments9 min readLW link

Miti­gat­ing the dam­age from un­al­igned ASI by co­op­er­at­ing with aliens that don’t ex­ist yet

MSRayne21 Jun 2022 16:12 UTC
−8 points
7 comments6 min readLW link

AI Train­ing Should Allow Opt-Out

alyssavance23 Jun 2022 1:33 UTC
76 points
13 comments6 min readLW link

Up­dated Defer­ence is not a strong ar­gu­ment against the util­ity un­cer­tainty ap­proach to alignment

Ivan Vendrov24 Jun 2022 19:32 UTC
20 points
8 comments4 min readLW link

SunPJ in Alenia

FlorianH25 Jun 2022 19:39 UTC
7 points
19 comments8 min readLW link
(plausiblestuff.com)

Con­di­tion­ing Gen­er­a­tive Models

Adam Jermyn25 Jun 2022 22:15 UTC
22 points
18 comments10 min readLW link

Train­ing Trace Pri­ors and Speed Priors

Adam Jermyn26 Jun 2022 18:07 UTC
17 points
0 comments3 min readLW link

De­liber­a­tion Every­where: Sim­ple Examples

Oliver Sourbut27 Jun 2022 17:26 UTC
14 points
0 comments15 min readLW link

De­liber­a­tion, Re­ac­tions, and Con­trol: Ten­ta­tive Defi­ni­tions and a Res­tate­ment of In­stru­men­tal Convergence

Oliver Sourbut27 Jun 2022 17:25 UTC
10 points
0 comments11 min readLW link

For­mal Philos­o­phy and Align­ment Pos­si­ble Projects

Whispermute30 Jun 2022 10:42 UTC
33 points
5 comments8 min readLW link

Refram­ing the AI Risk

Thane Ruthenis1 Jul 2022 18:44 UTC
26 points
7 comments6 min readLW link

Trends in GPU price-performance

1 Jul 2022 15:51 UTC
85 points
10 comments1 min readLW link
(epochai.org)

Fol­low along with Columbia EA’s Ad­vanced AI Safety Fel­low­ship!

RohanS2 Jul 2022 17:45 UTC
3 points
0 comments2 min readLW link
(forum.effectivealtruism.org)

Can we achieve AGI Align­ment by bal­anc­ing mul­ti­ple hu­man ob­jec­tives?

Ben Smith3 Jul 2022 2:51 UTC
11 points
1 comment4 min readLW link

We Need a Con­soli­dated List of Bad AI Align­ment Solutions

Double4 Jul 2022 6:54 UTC
9 points
14 comments1 min readLW link

A com­pressed take on re­cent disagreements

kman4 Jul 2022 4:39 UTC
33 points
9 comments1 min readLW link

My Most Likely Rea­son to Die Young is AI X-Risk

AISafetyIsNotLongtermist4 Jul 2022 17:08 UTC
61 points
24 comments4 min readLW link
(forum.effectivealtruism.org)

The cu­ri­ous case of Pretty Good hu­man in­ner/​outer alignment

PavleMiha5 Jul 2022 19:04 UTC
41 points
45 comments4 min readLW link

In­tro­duc­ing the Fund for Align­ment Re­search (We’re Hiring!)

6 Jul 2022 2:07 UTC
59 points
0 comments4 min readLW link

Outer vs in­ner mis­al­ign­ment: three framings

Richard_Ngo6 Jul 2022 19:46 UTC
43 points
4 comments9 min readLW link

Re­sponse to Blake Richards: AGI, gen­er­al­ity, al­ign­ment, & loss functions

Steven Byrnes12 Jul 2022 13:56 UTC
59 points
9 comments15 min readLW link

Goal Align­ment Is Ro­bust To the Sharp Left Turn

Thane Ruthenis13 Jul 2022 20:23 UTC
45 points
15 comments4 min readLW link

De­cep­tion?! I ain’t got time for that!

Paul Colognese18 Jul 2022 0:06 UTC
50 points
5 comments13 min readLW link

Four ques­tions I ask AI safety researchers

Akash17 Jul 2022 17:25 UTC
17 points
0 comments1 min readLW link

A dis­til­la­tion of Evan Hub­inger’s train­ing sto­ries (for SERI MATS)

Daphne_W18 Jul 2022 3:38 UTC
15 points
1 comment10 min readLW link

Con­di­tion­ing Gen­er­a­tive Models for Alignment

Jozdien18 Jul 2022 7:11 UTC
40 points
8 comments22 min readLW link

In­for­ma­tion the­o­retic model anal­y­sis may not lend much in­sight, but we may have been do­ing them wrong!

Garrett Baker24 Jul 2022 0:42 UTC
7 points
0 comments10 min readLW link

How to Diver­sify Con­cep­tual Align­ment: the Model Be­hind Refine

adamShimi20 Jul 2022 10:44 UTC
78 points
11 comments8 min readLW link

Our Ex­ist­ing Solu­tions to AGI Align­ment (semi-safe)

Michael Soareverix21 Jul 2022 19:00 UTC
12 points
1 comment3 min readLW link

Re­ward is not the op­ti­miza­tion target

TurnTrout25 Jul 2022 0:03 UTC
252 points
97 comments10 min readLW link

What En­vi­ron­ment Prop­er­ties Select Agents For World-Model­ing?

Thane Ruthenis23 Jul 2022 19:27 UTC
24 points
1 comment12 min readLW link

AGI Safety Needs Peo­ple With All Skil­lsets!

Severin T. Seehrich25 Jul 2022 13:32 UTC
28 points
0 comments2 min readLW link

Con­jec­ture: In­ter­nal In­fo­haz­ard Policy

29 Jul 2022 19:07 UTC
119 points
6 comments19 min readLW link

Hu­mans Reflect­ing on HRH

leogao29 Jul 2022 21:56 UTC
20 points
4 comments2 min readLW link

[Question] Would “Man­hat­tan Pro­ject” style be benefi­cial or dele­te­ri­ous for AI Align­ment?

Just Learning4 Aug 2022 19:12 UTC
5 points
1 comment1 min readLW link

Con­ver­gence Towards World-Models: A Gears-Level Model

Thane Ruthenis4 Aug 2022 23:31 UTC
37 points
1 comment13 min readLW link

How To Go From In­ter­pretabil­ity To Align­ment: Just Re­tar­get The Search

johnswentworth10 Aug 2022 16:08 UTC
143 points
30 comments3 min readLW link

For­mal­iz­ing Alignment

Marv K10 Aug 2022 18:50 UTC
3 points
0 comments2 min readLW link

My sum­mary of the al­ign­ment problem

Peter Hroššo11 Aug 2022 19:42 UTC
16 points
3 comments2 min readLW link
(threadreaderapp.com)

Ar­tifi­cial in­tel­li­gence wireheading

Big Tony12 Aug 2022 3:06 UTC
3 points
2 comments1 min readLW link

In­fant AI Scenario

Nathan112312 Aug 2022 21:20 UTC
1 point
0 comments3 min readLW link

Gra­di­ent de­scent doesn’t se­lect for in­ner search

Ivan Vendrov13 Aug 2022 4:15 UTC
36 points
23 comments4 min readLW link

No short­cuts to knowl­edge: Why AI needs to ease up on scal­ing and learn how to code

Yldedly15 Aug 2022 8:42 UTC
4 points
0 comments1 min readLW link
(deoxyribose.github.io)

Mesa-op­ti­miza­tion for goals defined only within a train­ing en­vi­ron­ment is dangerous

Rubi J. Hudson17 Aug 2022 3:56 UTC
6 points
2 comments4 min readLW link

The longest train­ing run

17 Aug 2022 17:18 UTC
68 points
11 comments9 min readLW link
(epochai.org)

Matt Ygle­sias on AI Policy

Grant Demaree17 Aug 2022 23:57 UTC
25 points
1 comment1 min readLW link
(www.slowboring.com)

Epistemic Arte­facts of (con­cep­tual) AI al­ign­ment research

19 Aug 2022 17:18 UTC
30 points
1 comment5 min readLW link

A Bite Sized In­tro­duc­tion to ELK

Luk2718217 Sep 2022 0:28 UTC
5 points
0 comments6 min readLW link

Bench­mark­ing Pro­pos­als on Risk Scenarios

Paul Bricman20 Aug 2022 10:01 UTC
25 points
2 comments14 min readLW link

The ‘Bit­ter Les­son’ is Wrong

deepthoughtlife20 Aug 2022 16:15 UTC
−9 points
14 comments2 min readLW link

My Plan to Build Aligned Superintelligence

apollonianblues21 Aug 2022 13:16 UTC
18 points
7 comments8 min readLW link

Beliefs and Disagree­ments about Au­tomat­ing Align­ment Research

Ian McKenzie24 Aug 2022 18:37 UTC
92 points
4 comments7 min readLW link

Google AI in­te­grates PaLM with robotics: SayCan up­date [Linkpost]

Evan R. Murphy24 Aug 2022 20:54 UTC
25 points
0 comments1 min readLW link
(sites.research.google)

The Shard The­ory Align­ment Scheme

David Udell25 Aug 2022 4:52 UTC
47 points
33 comments2 min readLW link

[Question] What would you ex­pect a mas­sive mul­ti­modal on­line fed­er­ated learner to be ca­pa­ble of?

Aryeh Englander27 Aug 2022 17:31 UTC
13 points
4 comments1 min readLW link

(My un­der­stand­ing of) What Every­one in Tech­ni­cal Align­ment is Do­ing and Why

29 Aug 2022 1:23 UTC
345 points
83 comments38 min readLW link

Break­ing down the train­ing/​de­ploy­ment dichotomy

Erik Jenner28 Aug 2022 21:45 UTC
29 points
4 comments3 min readLW link

Strat­egy For Con­di­tion­ing Gen­er­a­tive Models

1 Sep 2022 4:34 UTC
28 points
4 comments18 min readLW link

Gra­di­ent Hacker De­sign Prin­ci­ples From Biology

johnswentworth1 Sep 2022 19:03 UTC
52 points
13 comments3 min readLW link

No, hu­man brains are not (much) more effi­cient than computers

jhoogland6 Sep 2022 13:53 UTC
19 points
16 comments4 min readLW link
(www.jessehoogland.com)

Can “Re­ward Eco­nomics” solve AI Align­ment?

Q Home7 Sep 2022 7:58 UTC
3 points
15 comments18 min readLW link

Gen­er­a­tors Of Disagree­ment With AI Alignment

George3d67 Sep 2022 18:15 UTC
26 points
9 comments9 min readLW link
(www.epistem.ink)

Search­ing for Mo­du­lar­ity in Large Lan­guage Models

8 Sep 2022 2:25 UTC
43 points
3 comments14 min readLW link

We may be able to see sharp left turns coming

3 Sep 2022 2:55 UTC
50 points
26 comments1 min readLW link

Gate­keeper Vic­tory: AI Box Reflection

9 Sep 2022 21:38 UTC
4 points
5 comments9 min readLW link

Can you force a neu­ral net­work to keep gen­er­al­iz­ing?

Q Home12 Sep 2022 10:14 UTC
2 points
10 comments5 min readLW link

Align­ment via proso­cial brain algorithms

Cameron Berg12 Sep 2022 13:48 UTC
42 points
28 comments6 min readLW link

[Linkpost] A sur­vey on over 300 works about in­ter­pretabil­ity in deep networks

scasper12 Sep 2022 19:07 UTC
96 points
7 comments2 min readLW link
(arxiv.org)

Try­ing to find the un­der­ly­ing struc­ture of com­pu­ta­tional systems

Matthias G. Mayer13 Sep 2022 21:16 UTC
17 points
9 comments4 min readLW link

[Question] Are Speed Su­per­in­tel­li­gences Fea­si­ble for Modern ML Tech­niques?

DragonGod14 Sep 2022 12:59 UTC
8 points
5 comments1 min readLW link

The Defen­der’s Ad­van­tage of Interpretability

Marius Hobbhahn14 Sep 2022 14:05 UTC
41 points
4 comments6 min readLW link

When does tech­ni­cal work to re­duce AGI con­flict make a differ­ence?: Introduction

14 Sep 2022 19:38 UTC
42 points
3 comments6 min readLW link

ACT-1: Trans­former for Actions

Daniel Kokotajlo14 Sep 2022 19:09 UTC
52 points
4 comments1 min readLW link
(www.adept.ai)

[Question] Fore­cast­ing thread: How does AI risk level vary based on timelines?

elifland14 Sep 2022 23:56 UTC
33 points
7 comments1 min readLW link

Gen­eral ad­vice for tran­si­tion­ing into The­o­ret­i­cal AI Safety

Martín Soto15 Sep 2022 5:23 UTC
9 points
0 comments10 min readLW link

Why de­cep­tive al­ign­ment mat­ters for AGI safety

Marius Hobbhahn15 Sep 2022 13:38 UTC
48 points
12 comments13 min readLW link

Un­der­stand­ing Con­jec­ture: Notes from Con­nor Leahy interview

Akash15 Sep 2022 18:37 UTC
103 points
24 comments15 min readLW link

or­der­ing ca­pa­bil­ity thresholds

carado16 Sep 2022 16:36 UTC
27 points
0 comments4 min readLW link
(carado.moe)

Levels of goals and alignment

zeshen16 Sep 2022 16:44 UTC
27 points
4 comments6 min readLW link

Katja Grace on Slow­ing Down AI, AI Ex­pert Sur­veys And Es­ti­mat­ing AI Risk

Michaël Trazzi16 Sep 2022 17:45 UTC
40 points
2 comments3 min readLW link
(theinsideview.ai)

Sum­maries: Align­ment Fun­da­men­tals Curriculum

Leon Lang18 Sep 2022 13:08 UTC
43 points
3 comments1 min readLW link
(docs.google.com)

Lev­er­ag­ing Le­gal In­for­mat­ics to Align AI

John Nay18 Sep 2022 20:39 UTC
11 points
0 comments3 min readLW link
(forum.effectivealtruism.org)

Align­ment Org Cheat Sheet

20 Sep 2022 17:36 UTC
63 points
6 comments4 min readLW link

Public-fac­ing Cen­sor­ship Is Safety Theater, Caus­ing Rep­u­ta­tional Da­m­age

Yitz23 Sep 2022 5:08 UTC
144 points
42 comments6 min readLW link

Nearcast-based “de­ploy­ment prob­lem” analysis

HoldenKarnofsky21 Sep 2022 18:52 UTC
78 points
2 comments26 min readLW link

Math­e­mat­i­cal Cir­cuits in Neu­ral Networks

Sean Osier22 Sep 2022 3:48 UTC
34 points
4 comments1 min readLW link
(www.youtube.com)

Un­der­stand­ing In­fra-Bayesi­anism: A Begin­ner-Friendly Video Series

22 Sep 2022 13:25 UTC
114 points
6 comments2 min readLW link

In­ter­lude: But Who Op­ti­mizes The Op­ti­mizer?

Paul Bricman23 Sep 2022 15:30 UTC
15 points
0 comments10 min readLW link

[Question] What Do AI Safety Pitches Not Get About Your Field?

Aris22 Sep 2022 21:27 UTC
28 points
3 comments1 min readLW link

Let’s Com­pare Notes

Shoshannah Tekofsky22 Sep 2022 20:47 UTC
17 points
3 comments6 min readLW link

Brain-over-body bi­ases, and the em­bod­ied value prob­lem in AI alignment

geoffreymiller24 Sep 2022 22:24 UTC
10 points
6 comments25 min readLW link

Brief Notes on Transformers

Adam Jermyn26 Sep 2022 14:46 UTC
32 points
2 comments2 min readLW link

You are Un­der­es­ti­mat­ing The Like­li­hood That Con­ver­gent In­stru­men­tal Sub­goals Lead to Aligned AGI

Mark Neyer26 Sep 2022 14:22 UTC
3 points
6 comments3 min readLW link

7 traps that (we think) new al­ign­ment re­searchers of­ten fall into

27 Sep 2022 23:13 UTC
157 points
10 comments4 min readLW link

Threat-Re­sis­tant Bar­gain­ing Me­ga­post: In­tro­duc­ing the ROSE Value

Diffractor28 Sep 2022 1:20 UTC
89 points
11 comments53 min readLW link

Failure modes in a shard the­ory al­ign­ment plan

Thomas Kwa27 Sep 2022 22:34 UTC
24 points
2 comments7 min readLW link

QAPR 3: in­ter­pretabil­ity-guided train­ing of neu­ral nets

Quintin Pope28 Sep 2022 16:02 UTC
47 points
2 comments10 min readLW link

[Question] What’s the ac­tual ev­i­dence that AI mar­ket­ing tools are chang­ing prefer­ences in a way that makes them eas­ier to pre­dict?

Emrik1 Oct 2022 15:21 UTC
10 points
7 comments1 min readLW link

[Question] Any fur­ther work on AI Safety Suc­cess Sto­ries?

Krieger2 Oct 2022 9:53 UTC
7 points
6 comments1 min readLW link

AI Timelines via Cu­mu­la­tive Op­ti­miza­tion Power: Less Long, More Short

jacob_cannell6 Oct 2022 0:21 UTC
111 points
32 comments6 min readLW link

con­fu­sion about al­ign­ment requirements

carado6 Oct 2022 10:32 UTC
28 points
10 comments3 min readLW link
(carado.moe)

Good on­tolo­gies in­duce com­mu­ta­tive diagrams

Erik Jenner9 Oct 2022 0:06 UTC
40 points
5 comments14 min readLW link

Un­con­trol­lable AI as an Ex­is­ten­tial Risk

Karl von Wendt9 Oct 2022 10:36 UTC
19 points
0 comments20 min readLW link

Ob­jects in Mir­ror Are Closer Than They Ap­pear...

Damien Lasseur11 Oct 2022 4:34 UTC
2 points
7 comments9 min readLW link

Misal­ign­ment Harms Can Be Caused by Low In­tel­li­gence Systems

DialecticEel11 Oct 2022 13:39 UTC
11 points
3 comments1 min readLW link

Build­ing a trans­former from scratch—AI safety up-skil­ling challenge

Marius Hobbhahn12 Oct 2022 15:40 UTC
42 points
1 comment5 min readLW link

Help out Red­wood Re­search’s in­ter­pretabil­ity team by find­ing heuris­tics im­ple­mented by GPT-2 small

12 Oct 2022 21:25 UTC
49 points
11 comments4 min readLW link

Science of Deep Learn­ing—a tech­ni­cal agenda

Marius Hobbhahn18 Oct 2022 14:54 UTC
35 points
7 comments4 min readLW link

Re­sponse to Katja Grace’s AI x-risk counterarguments

19 Oct 2022 1:17 UTC
75 points
18 comments15 min readLW link

[Question] What Does AI Align­ment Suc­cess Look Like?

shminux20 Oct 2022 0:32 UTC
23 points
7 comments1 min readLW link

AI Re­search Pro­gram Pre­dic­tion Markets

tailcalled20 Oct 2022 13:42 UTC
38 points
10 comments1 min readLW link

Learn­ing so­cietal val­ues from law as part of an AGI al­ign­ment strategy

John Nay21 Oct 2022 2:03 UTC
3 points
18 comments54 min readLW link

Im­proved Se­cu­rity to Prevent Hacker-AI and Digi­tal Ghosts

Erland Wittkotter21 Oct 2022 10:11 UTC
4 points
3 comments12 min readLW link

What will the scaled up GATO look like? (Up­dated with ques­tions)

Amal 25 Oct 2022 12:44 UTC
33 points
20 comments1 min readLW link

In­tent al­ign­ment should not be the goal for AGI x-risk reduction

John Nay26 Oct 2022 1:24 UTC
−6 points
10 comments3 min readLW link

Re­sources that (I think) new al­ign­ment re­searchers should know about

Akash28 Oct 2022 22:13 UTC
69 points
8 comments4 min readLW link

Boundaries vs Frames

Scott Garrabrant31 Oct 2022 15:14 UTC
47 points
7 comments7 min readLW link

Ad­ver­sar­ial Poli­cies Beat Pro­fes­sional-Level Go AIs

sanxiyn3 Nov 2022 13:27 UTC
31 points
35 comments1 min readLW link
(goattack.alignmentfund.org)

The Sin­gu­lar Value De­com­po­si­tions of Trans­former Weight Ma­tri­ces are Highly Interpretable

28 Nov 2022 12:54 UTC
159 points
27 comments31 min readLW link

Sim­ple Way to Prevent Power-Seek­ing AI

research_prime_space7 Dec 2022 0:26 UTC
7 points
1 comment1 min readLW link

You can still fetch the coffee to­day if you’re dead tomorrow

davidad9 Dec 2022 14:06 UTC
58 points
15 comments5 min readLW link

Ex­tract­ing and Eval­u­at­ing Causal Direc­tion in LLMs’ Activations

14 Dec 2022 14:33 UTC
22 points
2 comments11 min readLW link

Real­ism about rationality

Richard_Ngo16 Sep 2018 10:46 UTC
180 points
145 comments4 min readLW link3 reviews
(thinkingcomplete.blogspot.com)

De­bate on In­stru­men­tal Con­ver­gence be­tween LeCun, Rus­sell, Ben­gio, Zador, and More

Ben Pace4 Oct 2019 4:08 UTC
205 points
60 comments15 min readLW link2 reviews

The Parable of Pre­dict-O-Matic

abramdemski15 Oct 2019 0:49 UTC
291 points
42 comments14 min readLW link2 reviews

2018 AI Align­ment Liter­a­ture Re­view and Char­ity Comparison

Larks18 Dec 2018 4:46 UTC
190 points
26 comments62 min readLW link1 review

An Ortho­dox Case Against Utility Functions

abramdemski7 Apr 2020 19:18 UTC
128 points
53 comments8 min readLW link2 reviews

“How con­ser­va­tive” should the par­tial max­imisers be?

Stuart_Armstrong13 Apr 2020 15:50 UTC
30 points
8 comments2 min readLW link

[AN #95]: A frame­work for think­ing about how to make AI go well

Rohin Shah15 Apr 2020 17:10 UTC
20 points
2 comments10 min readLW link
(mailchi.mp)

AI Align­ment Pod­cast: An Overview of Tech­ni­cal AI Align­ment in 2018 and 2019 with Buck Sh­legeris and Ro­hin Shah

Palus Astra16 Apr 2020 0:50 UTC
58 points
27 comments89 min readLW link

Open ques­tion: are min­i­mal cir­cuits dae­mon-free?

paulfchristiano5 May 2018 22:40 UTC
81 points
70 comments2 min readLW link1 review

Disen­tan­gling ar­gu­ments for the im­por­tance of AI safety

Richard_Ngo21 Jan 2019 12:41 UTC
129 points
23 comments8 min readLW link

In­te­grat­ing Hid­den Vari­ables Im­proves Approximation

johnswentworth16 Apr 2020 21:43 UTC
15 points
4 comments1 min readLW link

AI Ser­vices as a Re­search Paradigm

VojtaKovarik20 Apr 2020 13:00 UTC
30 points
12 comments4 min readLW link
(docs.google.com)

Databases of hu­man be­havi­our and prefer­ences?

Stuart_Armstrong21 Apr 2020 18:06 UTC
10 points
9 comments1 min readLW link

Critch on ca­reer ad­vice for ju­nior AI-x-risk-con­cerned researchers

Rob Bensinger12 May 2018 2:13 UTC
117 points
25 comments4 min readLW link

Refram­ing Impact

TurnTrout20 Sep 2019 19:03 UTC
90 points
15 comments3 min readLW link1 review

De­scrip­tion vs simu­lated prediction

Richard Korzekwa 22 Apr 2020 16:40 UTC
26 points
0 comments5 min readLW link
(aiimpacts.org)

Deep­Mind team on speci­fi­ca­tion gaming

JoshuaFox23 Apr 2020 8:01 UTC
30 points
2 comments1 min readLW link
(deepmind.com)

[Question] Does Agent-like Be­hav­ior Im­ply Agent-like Ar­chi­tec­ture?

Scott Garrabrant23 Aug 2019 2:01 UTC
54 points
7 comments1 min readLW link

Risks from Learned Op­ti­miza­tion: Con­clu­sion and Re­lated Work

7 Jun 2019 19:53 UTC
78 points
4 comments6 min readLW link

De­cep­tive Alignment

5 Jun 2019 20:16 UTC
97 points
11 comments17 min readLW link

The In­ner Align­ment Problem

4 Jun 2019 1:20 UTC
99 points
17 comments13 min readLW link

How the MtG Color Wheel Ex­plains AI Safety

Scott Garrabrant15 Feb 2019 23:42 UTC
57 points
4 comments6 min readLW link

[Question] How does Gra­di­ent Des­cent In­ter­act with Good­hart?

Scott Garrabrant2 Feb 2019 0:14 UTC
68 points
19 comments4 min readLW link

For­mal Open Prob­lem in De­ci­sion Theory

Scott Garrabrant29 Nov 2018 3:25 UTC
35 points
11 comments4 min readLW link

The Ubiquitous Con­verse Law­vere Problem

Scott Garrabrant29 Nov 2018 3:16 UTC
21 points
0 comments2 min readLW link

Embed­ded Curiosities

8 Nov 2018 14:19 UTC
88 points
1 comment2 min readLW link

Sub­sys­tem Alignment

6 Nov 2018 16:16 UTC
100 points
12 comments1 min readLW link

Ro­bust Delegation

4 Nov 2018 16:38 UTC
110 points
10 comments1 min readLW link

Embed­ded World-Models

2 Nov 2018 16:07 UTC
87 points
16 comments1 min readLW link

De­ci­sion Theory

31 Oct 2018 18:41 UTC
114 points
46 comments1 min readLW link

(A → B) → A

Scott Garrabrant11 Sep 2018 22:38 UTC
62 points
11 comments2 min readLW link

His­tory of the Devel­op­ment of Log­i­cal Induction

Scott Garrabrant29 Aug 2018 3:15 UTC
89 points
4 comments5 min readLW link

Op­ti­miza­tion Amplifies

Scott Garrabrant27 Jun 2018 1:51 UTC
98 points
12 comments4 min readLW link

What makes coun­ter­fac­tu­als com­pa­rable?

Chris_Leong24 Apr 2020 22:47 UTC
11 points
6 comments3 min readLW link

New Paper Ex­pand­ing on the Good­hart Taxonomy

Scott Garrabrant14 Mar 2018 9:01 UTC
17 points
4 comments1 min readLW link
(arxiv.org)

Sources of in­tu­itions and data on AGI

Scott Garrabrant31 Jan 2018 23:30 UTC
84 points
26 comments3 min readLW link

Corrigibility

paulfchristiano27 Nov 2018 21:50 UTC
52 points
7 comments6 min readLW link

AI pre­dic­tion case study 5: Omo­hun­dro’s AI drives

Stuart_Armstrong15 Mar 2013 9:09 UTC
10 points
5 comments8 min readLW link

Toy model: con­ver­gent in­stru­men­tal goals

Stuart_Armstrong25 Feb 2016 14:03 UTC
15 points
2 comments4 min readLW link

AI-cre­ated pseudo-deontology

Stuart_Armstrong12 Feb 2015 21:11 UTC
10 points
35 comments1 min readLW link

Eth­i­cal Injunctions

Eliezer Yudkowsky20 Oct 2008 23:00 UTC
66 points
76 comments9 min readLW link

Mo­ti­vat­ing Ab­strac­tion-First De­ci­sion Theory

johnswentworth29 Apr 2020 17:47 UTC
42 points
16 comments5 min readLW link

[AN #97]: Are there his­tor­i­cal ex­am­ples of large, ro­bust dis­con­ti­nu­ities?

Rohin Shah29 Apr 2020 17:30 UTC
15 points
0 comments10 min readLW link
(mailchi.mp)

My Up­dat­ing Thoughts on AI policy

Ben Pace1 Mar 2020 7:06 UTC
20 points
1 comment9 min readLW link

Use­ful Does Not Mean Secure

Ben Pace30 Nov 2019 2:05 UTC
46 points
12 comments11 min readLW link

[Question] What is the al­ter­na­tive to in­tent al­ign­ment called?

Richard_Ngo30 Apr 2020 2:16 UTC
12 points
6 comments1 min readLW link

Op­ti­mis­ing So­ciety to Con­strain Risk of War from an Ar­tifi­cial Su­per­in­tel­li­gence

JohnCDraper30 Apr 2020 10:47 UTC
3 points
1 comment51 min readLW link

Stan­ford En­cy­clo­pe­dia of Philos­o­phy on AI ethics and superintelligence

Kaj_Sotala2 May 2020 7:35 UTC
43 points
19 comments7 min readLW link
(plato.stanford.edu)

[Question] How does iter­ated am­plifi­ca­tion ex­ceed hu­man abil­ities?

riceissa2 May 2020 23:44 UTC
19 points
9 comments2 min readLW link

How uniform is the neo­cor­tex?

zhukeepa4 May 2020 2:16 UTC
78 points
23 comments11 min readLW link1 review

Scott Garrabrant’s prob­lem on re­cov­er­ing Brouwer as a corol­lary of Lawvere

Rupert4 May 2020 10:01 UTC
26 points
2 comments2 min readLW link

“AI and Effi­ciency”, OA (44✕ im­prove­ment in CNNs since 2012)

gwern5 May 2020 16:32 UTC
47 points
0 comments1 min readLW link
(openai.com)

Com­pet­i­tive safety via gra­dated curricula

Richard_Ngo5 May 2020 18:11 UTC
38 points
5 comments5 min readLW link

Model­ing nat­u­ral­ized de­ci­sion prob­lems in lin­ear logic

jessicata6 May 2020 0:15 UTC
14 points
2 comments6 min readLW link
(unstableontology.com)

[AN #98]: Un­der­stand­ing neu­ral net train­ing by see­ing which gra­di­ents were helpful

Rohin Shah6 May 2020 17:10 UTC
22 points
3 comments9 min readLW link
(mailchi.mp)

[Question] Is AI safety re­search less par­alleliz­able than AI re­search?

Mati_Roy10 May 2020 20:43 UTC
9 points
5 comments1 min readLW link

Thoughts on im­ple­ment­ing cor­rigible ro­bust alignment

Steven Byrnes26 Nov 2019 14:06 UTC
26 points
2 comments6 min readLW link

Wire­head­ing is in the eye of the beholder

Stuart_Armstrong30 Jan 2019 18:23 UTC
26 points
10 comments1 min readLW link

Wire­head­ing as a po­ten­tial prob­lem with the new im­pact measure

Stuart_Armstrong25 Sep 2018 14:15 UTC
25 points
20 comments4 min readLW link

Wire­head­ing and discontinuity

Michele Campolo18 Feb 2020 10:49 UTC
21 points
4 comments3 min readLW link

[AN #99]: Dou­bling times for the effi­ciency of AI algorithms

Rohin Shah13 May 2020 17:20 UTC
29 points
0 comments10 min readLW link
(mailchi.mp)

How should AIs up­date a prior over hu­man prefer­ences?

Stuart_Armstrong15 May 2020 13:14 UTC
17 points
9 comments2 min readLW link

Con­jec­ture Workshop

johnswentworth15 May 2020 22:41 UTC
34 points
2 comments2 min readLW link

Multi-agent safety

Richard_Ngo16 May 2020 1:59 UTC
31 points
8 comments5 min readLW link

The Mechanis­tic and Nor­ma­tive Struc­ture of Agency

Gordon Seidoh Worley18 May 2020 16:03 UTC
15 points
4 comments1 min readLW link
(philpapers.org)

“Star­wink” by Alicorn

Zack_M_Davis18 May 2020 8:17 UTC
44 points
1 comment1 min readLW link
(alicorn.elcenia.com)

[AN #100]: What might go wrong if you learn a re­ward func­tion while acting

Rohin Shah20 May 2020 17:30 UTC
33 points
2 comments12 min readLW link
(mailchi.mp)

Prob­a­bil­ities, weights, sums: pretty much the same for re­ward functions

Stuart_Armstrong20 May 2020 15:19 UTC
11 points
1 comment2 min readLW link

[Question] Source code size vs learned model size in ML and in hu­mans?

riceissa20 May 2020 8:47 UTC
11 points
6 comments1 min readLW link

Com­par­ing re­ward learn­ing/​re­ward tam­per­ing formalisms

Stuart_Armstrong21 May 2020 12:03 UTC
9 points
3 comments3 min readLW link

AGIs as collectives

Richard_Ngo22 May 2020 20:36 UTC
22 points
23 comments4 min readLW link

[AN #101]: Why we should rigor­ously mea­sure and fore­cast AI progress

Rohin Shah27 May 2020 17:20 UTC
15 points
0 comments10 min readLW link
(mailchi.mp)

AI Safety Dis­cus­sion Days

Linda Linsefors27 May 2020 16:54 UTC
13 points
1 comment3 min readLW link

Build­ing brain-in­spired AGI is in­finitely eas­ier than un­der­stand­ing the brain

Steven Byrnes2 Jun 2020 14:13 UTC
51 points
14 comments7 min readLW link

Spar­sity and in­ter­pretabil­ity?

1 Jun 2020 13:25 UTC
41 points
3 comments7 min readLW link

GPT-3: A Summary

leogao2 Jun 2020 18:14 UTC
20 points
0 comments1 min readLW link
(leogao.dev)

Inac­cessible information

paulfchristiano3 Jun 2020 5:10 UTC
84 points
17 comments14 min readLW link2 reviews
(ai-alignment.com)

[AN #102]: Meta learn­ing by GPT-3, and a list of full pro­pos­als for AI alignment

Rohin Shah3 Jun 2020 17:20 UTC
38 points
6 comments10 min readLW link
(mailchi.mp)

Feed­back is cen­tral to agency

Alex Flint1 Jun 2020 12:56 UTC
28 points
1 comment3 min readLW link

Think­ing About Su­per-Hu­man AI: An Ex­am­i­na­tion of Likely Paths and Ul­ti­mate Constitution

meanderingmoose4 Jun 2020 23:22 UTC
−3 points
0 comments7 min readLW link

Emer­gence and Con­trol: An ex­am­i­na­tion of our abil­ity to gov­ern the be­hav­ior of in­tel­li­gent systems

meanderingmoose5 Jun 2020 17:10 UTC
1 point
0 comments6 min readLW link

GAN Discrim­i­na­tors Don’t Gen­er­al­ize?

tryactions8 Jun 2020 20:36 UTC
18 points
7 comments2 min readLW link

More on dis­am­biguat­ing “dis­con­ti­nu­ity”

Aryeh Englander9 Jun 2020 15:16 UTC
16 points
1 comment3 min readLW link

[AN #103]: ARCHES: an agenda for ex­is­ten­tial safety, and com­bin­ing nat­u­ral lan­guage with deep RL

Rohin Shah10 Jun 2020 17:20 UTC
27 points
1 comment10 min readLW link
(mailchi.mp)

Dutch-Book­ing CDT: Re­vised Argument

abramdemski27 Oct 2020 4:31 UTC
50 points
22 comments16 min readLW link

[Question] List of pub­lic pre­dic­tions of what GPT-X can or can’t do?

Daniel Kokotajlo14 Jun 2020 14:25 UTC
20 points
9 comments1 min readLW link

Achiev­ing AI al­ign­ment through de­liber­ate un­cer­tainty in mul­ti­a­gent systems

Florian Dietz15 Jun 2020 12:19 UTC
3 points
10 comments7 min readLW link

Su­per­ex­po­nen­tial His­toric Growth, by David Roodman

Ben Pace15 Jun 2020 21:49 UTC
43 points
6 comments5 min readLW link
(www.openphilanthropy.org)

Re­lat­ing HCH and Log­i­cal Induction

abramdemski16 Jun 2020 22:08 UTC
47 points
4 comments5 min readLW link

Image GPT

Daniel Kokotajlo18 Jun 2020 11:41 UTC
29 points
27 comments1 min readLW link
(openai.com)

[AN #104]: The per­ils of in­ac­cessible in­for­ma­tion, and what we can learn about AI al­ign­ment from COVID

Rohin Shah18 Jun 2020 17:10 UTC
19 points
5 comments8 min readLW link
(mailchi.mp)

[Question] If AI is based on GPT, how to en­sure its safety?

avturchin18 Jun 2020 20:33 UTC
20 points
11 comments1 min readLW link

What’s Your Cog­ni­tive Al­gorithm?

Raemon18 Jun 2020 22:16 UTC
71 points
23 comments13 min readLW link

Rele­vant pre-AGI possibilities

Daniel Kokotajlo20 Jun 2020 10:52 UTC
38 points
7 comments19 min readLW link
(aiimpacts.org)

Plau­si­ble cases for HRAD work, and lo­cat­ing the crux in the “re­al­ism about ra­tio­nal­ity” debate

riceissa22 Jun 2020 1:10 UTC
85 points
15 comments10 min readLW link

The In­dex­ing Problem

johnswentworth22 Jun 2020 19:11 UTC
35 points
2 comments4 min readLW link

[Question] Re­quest­ing feed­back/​ad­vice: what Type The­ory to study for AI safety?

rvnnt23 Jun 2020 17:03 UTC
7 points
4 comments3 min readLW link

Lo­cal­ity of goals

adamShimi22 Jun 2020 21:56 UTC
16 points
8 comments6 min readLW link

[Question] What is “In­stru­men­tal Cor­rigi­bil­ity”?

joebernstein23 Jun 2020 20:24 UTC
4 points
1 comment1 min readLW link

Models, myths, dreams, and Cheshire cat grins

Stuart_Armstrong24 Jun 2020 10:50 UTC
21 points
7 comments2 min readLW link

[AN #105]: The eco­nomic tra­jec­tory of hu­man­ity, and what we might mean by optimization

Rohin Shah24 Jun 2020 17:30 UTC
24 points
3 comments11 min readLW link
(mailchi.mp)

There’s an Awe­some AI Ethics List and it’s a lit­tle thin

AABoyles25 Jun 2020 13:43 UTC
13 points
1 comment1 min readLW link
(github.com)

GPT-3 Fic­tion Samples

gwern25 Jun 2020 16:12 UTC
63 points
18 comments1 min readLW link
(www.gwern.net)

Walk­through: The Trans­former Ar­chi­tec­ture [Part 1/​2]

Matthew Barnett30 Jul 2019 13:54 UTC
35 points
0 comments6 min readLW link

Ro­bust­ness as a Path to AI Alignment

abramdemski10 Oct 2017 8:14 UTC
45 points
9 comments9 min readLW link

Rad­i­cal Prob­a­bil­ism [Tran­script]

26 Jun 2020 22:14 UTC
46 points
12 comments6 min readLW link

AI safety via mar­ket making

evhub26 Jun 2020 23:07 UTC
55 points
45 comments11 min readLW link

[Question] Have gen­eral de­com­posers been for­mal­ized?

Quinn27 Jun 2020 18:09 UTC
8 points
5 comments1 min readLW link

Gary Mar­cus vs Cor­ti­cal Uniformity

Steven Byrnes28 Jun 2020 18:18 UTC
18 points
0 comments8 min readLW link

Web AI dis­cus­sion Groups

Donald Hobson30 Jun 2020 11:22 UTC
11 points
0 comments2 min readLW link

Com­par­ing AI Align­ment Ap­proaches to Min­i­mize False Pos­i­tive Risk

Gordon Seidoh Worley30 Jun 2020 19:34 UTC
5 points
0 comments9 min readLW link

AvE: As­sis­tance via Empowerment

FactorialCode30 Jun 2020 22:07 UTC
12 points
1 comment1 min readLW link
(arxiv.org)

Evan Hub­inger on In­ner Align­ment, Outer Align­ment, and Pro­pos­als for Build­ing Safe Ad­vanced AI

Palus Astra1 Jul 2020 17:30 UTC
35 points
4 comments67 min readLW link

[AN #106]: Eval­u­at­ing gen­er­al­iza­tion abil­ity of learned re­ward models

Rohin Shah1 Jul 2020 17:20 UTC
14 points
2 comments11 min readLW link
(mailchi.mp)

The “AI De­bate” Debate

michaelcohen2 Jul 2020 10:16 UTC
20 points
20 comments3 min readLW link

Idea: Imi­ta­tion/​Value Learn­ing AIXI

Zachary Robertson3 Jul 2020 17:10 UTC
3 points
6 comments1 min readLW link

Split­ting De­bate up into Two Subsystems

Nandi3 Jul 2020 20:11 UTC
13 points
5 comments4 min readLW link

AI Un­safety via Non-Zero-Sum Debate

VojtaKovarik3 Jul 2020 22:03 UTC
25 points
10 comments5 min readLW link

Clas­sify­ing games like the Pri­soner’s Dilemma

philh4 Jul 2020 17:10 UTC
100 points
28 comments6 min readLW link1 review
(reasonableapproximation.net)

AI-Feyn­man as a bench­mark for what we should be aiming for

Faustus24 Jul 2020 9:24 UTC
8 points
1 comment2 min readLW link

Learn­ing the prior

paulfchristiano5 Jul 2020 21:00 UTC
79 points
29 comments8 min readLW link
(ai-alignment.com)

Bet­ter pri­ors as a safety problem

paulfchristiano5 Jul 2020 21:20 UTC
64 points
7 comments5 min readLW link
(ai-alignment.com)

[Question] How far is AGI?

Roko Jelavić5 Jul 2020 17:58 UTC
6 points
5 comments1 min readLW link

Clas­sify­ing speci­fi­ca­tion prob­lems as var­i­ants of Good­hart’s Law

Vika19 Aug 2019 20:40 UTC
70 points
5 comments5 min readLW link1 review

New safety re­search agenda: scal­able agent al­ign­ment via re­ward modeling

Vika20 Nov 2018 17:29 UTC
34 points
13 comments1 min readLW link
(medium.com)

De­sign­ing agent in­cen­tives to avoid side effects

11 Mar 2019 20:55 UTC
29 points
0 comments2 min readLW link
(medium.com)

Dis­cus­sion on the ma­chine learn­ing ap­proach to AI safety

Vika1 Nov 2018 20:54 UTC
26 points
3 comments4 min readLW link

Speci­fi­ca­tion gam­ing ex­am­ples in AI

Vika3 Apr 2018 12:30 UTC
43 points
9 comments1 min readLW link2 reviews

[Question] (an­swered: yes) Has any­one writ­ten up a con­sid­er­a­tion of Downs’s “Para­dox of Vot­ing” from the per­spec­tive of MIRI-ish de­ci­sion the­o­ries (UDT, FDT, or even just EDT)?

Jameson Quinn6 Jul 2020 18:26 UTC
10 points
24 comments1 min readLW link

New Deep­Mind AI Safety Re­search Blog

Vika27 Sep 2018 16:28 UTC
43 points
0 comments1 min readLW link
(medium.com)

Con­test: $1,000 for good ques­tions to ask to an Or­a­cle AI

Stuart_Armstrong31 Jul 2019 18:48 UTC
57 points
156 comments3 min readLW link

De­con­fus­ing Hu­man Values Re­search Agenda v1

Gordon Seidoh Worley23 Mar 2020 16:25 UTC
27 points
12 comments4 min readLW link

[Question] How “hon­est” is GPT-3?

abramdemski8 Jul 2020 19:38 UTC
72 points
18 comments5 min readLW link

What does it mean to ap­ply de­ci­sion the­ory?

abramdemski8 Jul 2020 20:31 UTC
51 points
5 comments8 min readLW link

AI Re­search Con­sid­er­a­tions for Hu­man Ex­is­ten­tial Safety (ARCHES)

habryka9 Jul 2020 2:49 UTC
60 points
8 comments1 min readLW link
(arxiv.org)

The Un­rea­son­able Effec­tive­ness of Deep Learning

Richard_Ngo30 Sep 2018 15:48 UTC
85 points
5 comments13 min readLW link
(thinkingcomplete.blogspot.com)

mAIry’s room: AI rea­son­ing to solve philo­soph­i­cal problems

Stuart_Armstrong5 Mar 2019 20:24 UTC
92 points
41 comments6 min readLW link2 reviews

Failures of an em­bod­ied AIXI

So8res15 Jun 2014 18:29 UTC
48 points
46 comments12 min readLW link

The Prob­lem with AIXI

Rob Bensinger18 Mar 2014 1:55 UTC
43 points
78 comments23 min readLW link

Ver­sions of AIXI can be ar­bi­trar­ily stupid

Stuart_Armstrong10 Aug 2015 13:23 UTC
29 points
59 comments1 min readLW link

Reflec­tive AIXI and Anthropics

Diffractor24 Sep 2018 2:15 UTC
17 points
13 comments8 min readLW link

AIXI and Ex­is­ten­tial Despair

paulfchristiano8 Dec 2011 20:03 UTC
23 points
38 comments6 min readLW link

How to make AIXI-tl in­ca­pable of learning

itaibn027 Jan 2014 0:05 UTC
7 points
5 comments2 min readLW link

Help re­quest: What is the Kol­mogorov com­plex­ity of com­putable ap­prox­i­ma­tions to AIXI?

AnnaSalamon5 Dec 2010 10:23 UTC
9 points
9 comments1 min readLW link

“AIXIjs: A Soft­ware Demo for Gen­eral Re­in­force­ment Learn­ing”, As­lanides 2017

gwern29 May 2017 21:09 UTC
7 points
1 comment1 min readLW link
(arxiv.org)

Can AIXI be trained to do any­thing a hu­man can?

Stuart_Armstrong20 Oct 2014 13:12 UTC
5 points
9 comments2 min readLW link

Shap­ing eco­nomic in­cen­tives for col­lab­o­ra­tive AGI

Kaj_Sotala29 Jun 2018 16:26 UTC
45 points
15 comments4 min readLW link

Is the Star Trek Fed­er­a­tion re­ally in­ca­pable of build­ing AI?

Kaj_Sotala18 Mar 2018 10:30 UTC
19 points
4 comments2 min readLW link
(kajsotala.fi)

Some con­cep­tual high­lights from “Disjunc­tive Sce­nar­ios of Catas­trophic AI Risk”

Kaj_Sotala12 Feb 2018 12:30 UTC
33 points
4 comments6 min readLW link
(kajsotala.fi)

Mis­con­cep­tions about con­tin­u­ous takeoff

Matthew Barnett8 Oct 2019 21:31 UTC
79 points
38 comments4 min readLW link

Dist­in­guish­ing defi­ni­tions of takeoff

Matthew Barnett14 Feb 2020 0:16 UTC
60 points
6 comments6 min readLW link

Book re­view: Ar­tifi­cial In­tel­li­gence Safety and Security

PeterMcCluskey8 Dec 2018 3:47 UTC
27 points
3 comments8 min readLW link
(www.bayesianinvestor.com)

Why AI may not foom

John_Maxwell24 Mar 2013 8:11 UTC
29 points
81 comments12 min readLW link

Hu­mans Who Are Not Con­cen­trat­ing Are Not Gen­eral Intelligences

sarahconstantin25 Feb 2019 20:40 UTC
181 points
35 comments6 min readLW link1 review
(srconstantin.wordpress.com)

The Hacker Learns to Trust

Ben Pace22 Jun 2019 0:27 UTC
80 points
18 comments8 min readLW link
(medium.com)

Book Re­view: Hu­man Compatible

Scott Alexander31 Jan 2020 5:20 UTC
77 points
6 comments16 min readLW link
(slatestarcodex.com)

SSC Jour­nal Club: AI Timelines

Scott Alexander8 Jun 2017 19:00 UTC
12 points
15 comments8 min readLW link

Ar­gu­ments against my­opic training

Richard_Ngo9 Jul 2020 16:07 UTC
56 points
39 comments12 min readLW link

On mo­ti­va­tions for MIRI’s highly re­li­able agent de­sign research

jessicata29 Jan 2017 19:34 UTC
27 points
1 comment5 min readLW link

Why is the im­pact penalty time-in­con­sis­tent?

Stuart_Armstrong9 Jul 2020 17:26 UTC
16 points
1 comment2 min readLW link

My cur­rent take on the Paul-MIRI dis­agree­ment on al­ignabil­ity of messy AI

jessicata29 Jan 2017 20:52 UTC
21 points
0 comments10 min readLW link

Ben Go­ertzel: The Sin­gu­lar­ity In­sti­tute’s Scary Idea (and Why I Don’t Buy It)

Paul Crowley30 Oct 2010 9:31 UTC
42 points
442 comments1 min readLW link

An An­a­lytic Per­spec­tive on AI Alignment

DanielFilan1 Mar 2020 4:10 UTC
54 points
45 comments8 min readLW link
(danielfilan.com)

Mechanis­tic Trans­parency for Ma­chine Learning

DanielFilan11 Jul 2018 0:34 UTC
54 points
9 comments4 min readLW link

A model I use when mak­ing plans to re­duce AI x-risk

Ben Pace19 Jan 2018 0:21 UTC
69 points
41 comments6 min readLW link

AI Re­searchers On AI Risk

Scott Alexander22 May 2015 11:16 UTC
18 points
0 comments16 min readLW link

Mini ad­vent cal­en­dar of Xrisks: Ar­tifi­cial Intelligence

Stuart_Armstrong7 Dec 2012 11:26 UTC
5 points
5 comments1 min readLW link

For FAI: Is “Molec­u­lar Nan­otech­nol­ogy” putting our best foot for­ward?

leplen22 Jun 2013 4:44 UTC
79 points
118 comments3 min readLW link

UFAI can­not be the Great Filter

Thrasymachus22 Dec 2012 11:26 UTC
59 points
92 comments3 min readLW link

Don’t Fear The Filter

Scott Alexander29 May 2014 0:45 UTC
11 points
18 comments6 min readLW link

The Great Filter is early, or AI is hard

Stuart_Armstrong29 Aug 2014 16:17 UTC
32 points
76 comments1 min readLW link

Talk: Key Is­sues In Near-Term AI Safety Research

Aryeh Englander10 Jul 2020 18:36 UTC
22 points
1 comment1 min readLW link

Mesa-Op­ti­miz­ers vs “Steered Op­ti­miz­ers”

Steven Byrnes10 Jul 2020 16:49 UTC
45 points
7 comments8 min readLW link

AlphaS­tar: Im­pres­sive for RL progress, not for AGI progress

orthonormal2 Nov 2019 1:50 UTC
113 points
58 comments2 min readLW link1 review

The Catas­trophic Con­ver­gence Conjecture

TurnTrout14 Feb 2020 21:16 UTC
44 points
15 comments8 min readLW link

[Question] How well can the GPT ar­chi­tec­ture solve the par­ity task?

FactorialCode11 Jul 2020 19:02 UTC
19 points
3 comments1 min readLW link

Sun­day July 12 — talks by Scott Garrabrant, Alexflint, alexei, Stu­art_Armstrong

8 Jul 2020 0:27 UTC
19 points
2 comments1 min readLW link

[Link] Word-vec­tor based DL sys­tem achieves hu­man par­ity in ver­bal IQ tests

jacob_cannell13 Jun 2015 23:38 UTC
17 points
8 comments1 min readLW link

The Power of Intelligence

Eliezer Yudkowsky1 Jan 2007 20:00 UTC
66 points
4 comments4 min readLW link

Com­ments on CAIS

Richard_Ngo12 Jan 2019 15:20 UTC
76 points
14 comments7 min readLW link

[Question] What are CAIS’ bold­est near/​medium-term pre­dic­tions?

jacobjacob28 Mar 2019 13:14 UTC
31 points
17 comments1 min readLW link

Drexler on AI Risk

PeterMcCluskey1 Feb 2019 5:11 UTC
34 points
10 comments9 min readLW link
(www.bayesianinvestor.com)

Six AI Risk/​Strat­egy Ideas

Wei_Dai27 Aug 2019 0:40 UTC
64 points
18 comments4 min readLW link1 review

New re­port: In­tel­li­gence Ex­plo­sion Microeconomics

Eliezer Yudkowsky29 Apr 2013 23:14 UTC
72 points
251 comments3 min readLW link

Book re­view: Hu­man Compatible

PeterMcCluskey19 Jan 2020 3:32 UTC
37 points
2 comments5 min readLW link
(www.bayesianinvestor.com)

Thoughts on “Hu­man-Com­pat­i­ble”

TurnTrout10 Oct 2019 5:24 UTC
63 points
35 comments5 min readLW link

Book Re­view: The AI Does Not Hate You

PeterMcCluskey28 Oct 2019 17:45 UTC
26 points
0 comments5 min readLW link
(www.bayesianinvestor.com)

[Link] Book Re­view: ‘The AI Does Not Hate You’ by Tom Chivers (Scott Aaron­son)

eigen7 Oct 2019 18:16 UTC
19 points
0 comments1 min readLW link

Book Re­view: Life 3.0: Be­ing Hu­man in the Age of Ar­tifi­cial Intelligence