AI

Core TagLast edit: 9 Feb 2024 2:18 UTC by jimrandomh

Artificial Intelligence is the study of creating intelligence in algorithms. AI Alignment is the task of ensuring [powerful] AI system are aligned with human values and interests. The central concern is that a powerful enough AI, if not designed and implemented with sufficient understanding, would optimize something unintended by its creators and pose an existential threat to the future of humanity. This is known as the AI alignment problem.

Common terms in this space are superintelligence, AI Alignment, AI Safety, Friendly AI, Transformative AI, human-level-intelligence, AI Governance, and Beneficial AI. This entry and the associated tag roughly encompass all of these topics: anything part of the broad cluster of understanding AI and its future impacts on our civilization deserves this tag.

AI Alignment

There are narrow conceptions of alignment, where you’re trying to get it to do something like cure Alzheimer’s disease without destroying the rest of the world. And there’s much more ambitious notions of alignment, where you’re trying to get it to do the right thing and achieve a happy intergalactic civilization.

But both the narrow and the ambitious alignment have in common that you’re trying to have the AI do that thing rather than making a lot of paperclips.

An overview of 11 proposals for building safe advanced AI

evhub29 May 2020 20:38 UTC

194 points

36 comments38 min readLW link 2 reviews

There’s No Fire Alarm for Artificial General Intelligence

Eliezer Yudkowsky13 Oct 2017 21:38 UTC

124 points

71 comments25 min readLW link

Superintelligence FAQ

Scott Alexander20 Sep 2016 19:00 UTC

92 points

16 comments27 min readLW link

Risks from Learned Optimization: Introduction

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

31 May 2019 23:44 UTC

166 points

42 comments12 min readLW link 3 reviews

Embedded Agents

abramdemski and Scott Garrabrant

29 Oct 2018 19:53 UTC

198 points

41 comments1 min readLW link 2 reviews

What failure looks like

paulfchristiano17 Mar 2019 20:18 UTC

319 points

49 comments8 min readLW link 2 reviews

The Rocket Alignment Problem

Eliezer Yudkowsky4 Oct 2018 0:38 UTC

198 points

42 comments15 min readLW link 2 reviews

Challenges to Christiano’s capability amplification proposal

Eliezer Yudkowsky19 May 2018 18:18 UTC

115 points

54 comments23 min readLW link 1 review

Embedded Agency (full-text version)

Scott Garrabrant and abramdemski

15 Nov 2018 19:49 UTC

143 points

15 comments54 min readLW link

A space of proposals for building safe advanced AI

Richard_Ngo10 Jul 2020 16:58 UTC

55 points

4 comments4 min readLW link

Biology-Inspired AGI Timelines: The Trick That Never Works

Eliezer Yudkowsky1 Dec 2021 22:35 UTC

181 points

143 comments65 min readLW link

PreDCA: vanessa kosoy’s alignment protocol

Tamsin Leake20 Aug 2022 10:03 UTC

46 points

8 comments7 min readLW link

(carado.moe)

larger language models may disappoint you [or, an eternally unfinished draft]

nostalgebraist26 Nov 2021 23:08 UTC

237 points

29 comments31 min readLW link 1 review

Deepmind’s Gopher—more powerful than GPT-3

hath8 Dec 2021 17:06 UTC

86 points

27 comments1 min readLW link

(deepmind.com)

Project proposal: Testing the IBP definition of agent

Jeremy Gillen, Thomas Larsen and JamesH

9 Aug 2022 1:09 UTC

21 points

4 comments2 min readLW link

Goodhart Taxonomy

Scott Garrabrant30 Dec 2017 16:38 UTC

180 points

33 comments10 min readLW link

AI Alignment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC

125 points

6 comments35 min readLW link

Some AI research areas and their relevance to existential safety

Andrew_Critch19 Nov 2020 3:18 UTC

199 points

40 comments50 min readLW link 2 reviews

Moravec’s Paradox Comes From The Availability Heuristic

james.lucassen20 Oct 2021 6:23 UTC

32 points

2 comments2 min readLW link

(jlucassen.com)

Inference cost limits the impact of ever larger models

SoerenMind23 Oct 2021 10:51 UTC

36 points

28 comments2 min readLW link

[Linkpost] Chinese government’s guidelines on AI

RomanS10 Dec 2021 21:10 UTC

61 points

14 comments1 min readLW link

That Alien Message

Eliezer Yudkowsky22 May 2008 5:55 UTC

304 points

173 comments10 min readLW link

Epistemological Framing for AI Alignment Research

adamShimi8 Mar 2021 22:05 UTC

53 points

7 comments9 min readLW link

EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

gwern2 Nov 2021 2:32 UTC

134 points

52 comments1 min readLW link

(arxiv.org)

Discussion with Eliezer Yudkowsky on AGI interventions

Rob Bensinger and Eliezer Yudkowsky

11 Nov 2021 3:01 UTC

325 points

257 comments34 min readLW link

Shulman and Yudkowsky on AI progress

Eliezer Yudkowsky and CarlShulman

3 Dec 2021 20:05 UTC

90 points

16 comments20 min readLW link

Future ML Systems Will Be Qualitatively Different

jsteinhardt11 Jan 2022 19:50 UTC

113 points

10 comments5 min readLW link

(bounded-regret.ghost.io)

[Linkpost] TrojanNet: Embedding Hidden Trojan Horse Models in Neural Networks

Gunnar_Zarncke11 Feb 2022 1:17 UTC

13 points

1 comment1 min readLW link

Briefly thinking through some analogs of debate

Eli Tyre11 Sep 2022 12:02 UTC

20 points

3 comments4 min readLW link

Robustness to Scale

Scott Garrabrant21 Feb 2018 22:55 UTC

109 points

22 comments2 min readLW link 1 review

Chris Olah’s views on AGI safety

evhub1 Nov 2019 20:13 UTC

197 points

38 comments12 min readLW link 2 reviews

[AN #96]: Buck and I discuss/argue about AI Alignment

Rohin Shah22 Apr 2020 17:20 UTC

17 points

4 comments10 min readLW link

(mailchi.mp)

Matt Botvinick on the spontaneous emergence of learning algorithms

Adam Scholl12 Aug 2020 7:47 UTC

147 points

87 comments5 min readLW link

A descriptive, not prescriptive, overview of current AI Alignment Research

Jan, Logan Riggs, jacquesthibs and janus

6 Jun 2022 21:59 UTC

126 points

21 comments7 min readLW link

Coherence arguments do not entail goal-directed behavior

Rohin Shah3 Dec 2018 3:26 UTC

101 points

69 comments7 min readLW link 3 reviews

Alignment By Default

johnswentworth12 Aug 2020 18:54 UTC

153 points

92 comments11 min readLW link 2 reviews

Book review: “A Thousand Brains” by Jeff Hawkins

Steven Byrnes4 Mar 2021 5:10 UTC

110 points

18 comments19 min readLW link

Modelling Transformative AI Risks (MTAIR) Project: Introduction

Davidmanheim and Aryeh Englander

16 Aug 2021 7:12 UTC

89 points

0 comments9 min readLW link

Infra-Bayesian physicalism: a formal theory of naturalized induction

Vanessa Kosoy30 Nov 2021 22:25 UTC

98 points

20 comments42 min readLW link 1 review

What an actually pessimistic containment strategy looks like

lc5 Apr 2022 0:19 UTC

554 points

136 comments6 min readLW link

Why I think strong general AI is coming soon

porby28 Sep 2022 5:40 UTC

269 points

126 comments34 min readLW link

AlphaGo Zero and the Foom Debate

Eliezer Yudkowsky21 Oct 2017 2:18 UTC

89 points

17 comments3 min readLW link

Tradeoff between desirable properties for baseline choices in impact measures

Vika4 Jul 2020 11:56 UTC

37 points

24 comments5 min readLW link

Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns

stuhlmueller21 Jul 2020 20:06 UTC

80 points

40 comments3 min readLW link

the scaling “inconsistency”: openAI’s new insight

nostalgebraist7 Nov 2020 7:40 UTC

146 points

14 comments9 min readLW link

(nostalgebraist.tumblr.com)

2019 Review Rewrite: Seeking Power is Often Robustly Instrumental in MDPs

TurnTrout23 Dec 2020 17:16 UTC

35 points

0 comments4 min readLW link

(www.lesswrong.com)

Bootstrapped Alignment

Gordon Seidoh Worley27 Feb 2021 15:46 UTC

19 points

12 comments2 min readLW link

Multimodal Neurons in Artificial Neural Networks

Kaj_Sotala5 Mar 2021 9:01 UTC

57 points

2 comments2 min readLW link

(distill.pub)

Review of “Fun with +12 OOMs of Compute”

adamShimi, Joe_Collman and Gyrodiot

28 Mar 2021 14:55 UTC

60 points

20 comments8 min readLW link

Draft report on existential risk from power-seeking AI

Joe Carlsmith28 Apr 2021 21:41 UTC

80 points

23 comments1 min readLW link

Rogue AGI Embodies Valuable Intellectual Property

Mark Xu and CarlShulman

3 Jun 2021 20:37 UTC

70 points

9 comments3 min readLW link

DeepMind: Generally capable agents emerge from open-ended play

Daniel Kokotajlo27 Jul 2021 14:19 UTC

247 points

53 comments2 min readLW link

(deepmind.com)

Analogies and General Priors on Intelligence

riceissa and Sammy Martin

20 Aug 2021 21:03 UTC

57 points

12 comments14 min readLW link

We’re already in AI takeoff

Valentine8 Mar 2022 23:09 UTC

120 points

115 comments7 min readLW link

It Looks Like You’re Trying To Take Over The World

gwern9 Mar 2022 16:35 UTC

386 points

125 comments1 min readLW link

(www.gwern.net)

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC

45 points

0 comments59 min readLW link

Why all the fuss about recursive self-improvement?

So8res12 Jun 2022 20:53 UTC

150 points

62 comments7 min readLW link

AI Safety bounty for practical homomorphic encryption

acylhalide19 Aug 2022 12:27 UTC

29 points

9 comments4 min readLW link

Paper: Discovering novel algorithms with AlphaTensor [Deepmind]

LawrenceC5 Oct 2022 16:20 UTC

80 points

18 comments1 min readLW link

(www.deepmind.com)

The Teacup Test

lsusr8 Oct 2022 4:25 UTC

71 points

28 comments2 min readLW link

Discontinuous progress in history: an update

KatjaGrace14 Apr 2020 0:00 UTC

179 points

25 comments31 min readLW link 1 review

(aiimpacts.org)

Replication Dynamics Bridge to RL in Thermodynamic Limit

Past Account18 May 2020 1:02 UTC

6 points

1 comment2 min readLW link

The ground of optimization

Alex Flint20 Jun 2020 0:38 UTC

218 points

74 comments27 min readLW link 1 review

Modelling Continuous Progress

Sammy Martin23 Jun 2020 18:06 UTC

29 points

3 comments7 min readLW link

Reframing Superintelligence: Comprehensive AI Services as General Intelligence

Rohin Shah8 Jan 2019 7:12 UTC

118 points

75 comments5 min readLW link 2 reviews

(www.fhi.ox.ac.uk)

Classification of AI alignment research: deconfusion, “good enough” non-superintelligent AI alignment, superintelligent AI alignment

philip_b14 Jul 2020 22:48 UTC

35 points

25 comments3 min readLW link

Collection of GPT-3 results

Kaj_Sotala18 Jul 2020 20:04 UTC

89 points

24 comments1 min readLW link

(twitter.com)

Hiring engineers and researchers to help align GPT-3

paulfchristiano1 Oct 2020 18:54 UTC

206 points

14 comments3 min readLW link

The date of AI Takeover is not the day the AI takes over

Daniel Kokotajlo22 Oct 2020 10:41 UTC

116 points

32 comments2 min readLW link 1 review

[Question] What could one do with truly unlimited computational power?

Yitz11 Nov 2020 10:03 UTC

30 points

22 comments2 min readLW link

AGI Predictions

Amandango and Ben Pace

21 Nov 2020 3:46 UTC

110 points

36 comments4 min readLW link

[Question] What are the best precedents for industries failing to invest in valuable AI research?

Daniel Kokotajlo14 Dec 2020 23:57 UTC

18 points

17 comments1 min readLW link

Extrapolating GPT-N performance

Lukas Finnveden18 Dec 2020 21:41 UTC

103 points

31 comments25 min readLW link 1 review

Debate update: Obfuscated arguments problem

Beth Barnes23 Dec 2020 3:24 UTC

125 points

21 comments16 min readLW link

Literature Review on Goal-Directedness

adamShimi, Michele Campolo and Joe_Collman

18 Jan 2021 11:15 UTC

69 points

21 comments31 min readLW link

[Question] How will OpenAI + GitHub’s Copilot affect programming?

smountjoy29 Jun 2021 16:42 UTC

55 points

23 comments1 min readLW link

Modeling Risks From Learned Optimization

Ben Cottier12 Oct 2021 20:54 UTC

44 points

0 comments12 min readLW link

Truthful AI: Developing and governing AI that does not lie

Owain_Evans, owencb and Lukas Finnveden

18 Oct 2021 18:37 UTC

81 points

9 comments10 min readLW link

EfficientZero: How It Works

1a3orn26 Nov 2021 15:17 UTC

273 points

42 comments29 min readLW link

Theoretical Neuroscience For Alignment Theory

Cameron Berg7 Dec 2021 21:50 UTC

62 points

19 comments23 min readLW link

Magna Alta Doctrina

jacob_cannell11 Dec 2021 21:54 UTC

37 points

7 comments28 min readLW link

DL towards the unaligned Recursive Self-Optimization attractor

jacob_cannell18 Dec 2021 2:15 UTC

32 points

22 comments4 min readLW link

Regularization Causes Modularity Causes Generalization

dkirmani1 Jan 2022 23:34 UTC

49 points

7 comments3 min readLW link

Is General Intelligence “Compact”?

DragonGod4 Jul 2022 13:27 UTC

21 points

6 comments22 min readLW link

The Tree of Life: Stanford AI Alignment Theory of Change

Gabe M2 Jul 2022 18:36 UTC

22 points

0 comments14 min readLW link

Shard Theory: An Overview

David Udell11 Aug 2022 5:44 UTC

135 points

34 comments10 min readLW link

How evolution succeeds and fails at value alignment

Ocracoke21 Aug 2022 7:14 UTC

21 points

2 comments4 min readLW link

An Untrollable Mathematician Illustrated

abramdemski20 Mar 2018 0:00 UTC

155 points

38 comments1 min readLW link 1 review

Conditions for Mesa-Optimization

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

1 Jun 2019 20:52 UTC

75 points

48 comments12 min readLW link

Thoughts on Human Models

Ramana Kumar and Scott Garrabrant

21 Feb 2019 9:10 UTC

124 points

32 comments10 min readLW link 1 review

Inner alignment in the brain

Steven Byrnes22 Apr 2020 13:14 UTC

76 points

16 comments16 min readLW link

Problem relaxation as a tactic

TurnTrout22 Apr 2020 23:44 UTC

113 points

8 comments7 min readLW link

[Question] How should potential AI alignment researchers gauge whether the field is right for them?

TurnTrout6 May 2020 12:24 UTC

20 points

5 comments1 min readLW link

Specification gaming: the flip side of AI ingenuity

Vika, Vlad Mikulik, Matthew Rahtz, tom4everitt, Zac Kenton and janleike

6 May 2020 23:51 UTC

46 points

8 comments6 min readLW link

Lessons from Isaac: Pitfalls of Reason

adamShimi8 May 2020 20:44 UTC

9 points

0 comments8 min readLW link

Corrigibility as outside view

TurnTrout8 May 2020 21:56 UTC

36 points

11 comments4 min readLW link

[Question] How to choose a PhD with AI Safety in mind

Ariel Kwiatkowski15 May 2020 22:19 UTC

9 points

1 comment1 min readLW link

Reward functions and updating assumptions can hide a multitude of sins

Stuart_Armstrong18 May 2020 15:18 UTC

16 points

2 comments9 min readLW link

Possible takeaways from the coronavirus pandemic for slow AI takeoff

Vika31 May 2020 17:51 UTC

135 points

36 comments3 min readLW link 1 review

Focus: you are allowed to be bad at accomplishing your goals

adamShimi3 Jun 2020 21:04 UTC

19 points

17 comments3 min readLW link

Reply to Paul Christiano on Inaccessible Information

Alex Flint5 Jun 2020 9:10 UTC

77 points

15 comments6 min readLW link

Our take on CHAI’s research agenda in under 1500 words

Alex Flint17 Jun 2020 12:24 UTC

112 points

19 comments5 min readLW link

[Question] Question on GPT-3 Excel Demo

Zhitao Hou22 Jun 2020 20:31 UTC

0 points

2 comments1 min readLW link

Dynamic inconsistency of the inaction and initial state baseline

Stuart_Armstrong7 Jul 2020 12:02 UTC

30 points

8 comments2 min readLW link

Cortés, Pizarro, and Afonso as Precedents for Takeover

Daniel Kokotajlo1 Mar 2020 3:49 UTC

145 points

75 comments11 min readLW link 1 review

[Question] What problem would you like to see Reinforcement Learning applied to?

Julian Schrittwieser8 Jul 2020 2:40 UTC

43 points

4 comments1 min readLW link

My current framework for thinking about AGI timelines

zhukeepa30 Mar 2020 1:23 UTC

107 points

5 comments3 min readLW link

[Question] To what extent is GPT-3 capable of reasoning?

TurnTrout20 Jul 2020 17:10 UTC

70 points

74 comments16 min readLW link

Replicating the replication crisis with GPT-3?

skybrian22 Jul 2020 21:20 UTC

29 points

10 comments1 min readLW link

Can you get AGI from a Transformer?

Steven Byrnes23 Jul 2020 15:27 UTC

114 points

39 comments12 min readLW link

Writing with GPT-3

Jacob Falkovich24 Jul 2020 15:22 UTC

42 points

0 comments4 min readLW link

Inner Alignment: Explain like I’m 12 Edition

Rafael Harth1 Aug 2020 15:24 UTC

175 points

46 comments13 min readLW link 2 reviews

Developmental Stages of GPTs

orthonormal26 Jul 2020 22:03 UTC

140 points

74 comments7 min readLW link 1 review

Generalizing the Power-Seeking Theorems

TurnTrout27 Jul 2020 0:28 UTC

40 points

6 comments4 min readLW link

Are we in an AI overhang?

Andy Jones27 Jul 2020 12:48 UTC

255 points

109 comments4 min readLW link

[Question] What specific dangers arise when asking GPT-N to write an Alignment Forum post?

Matthew Barnett28 Jul 2020 2:56 UTC

44 points

14 comments1 min readLW link

[Question] Probability that other architectures will scale as well as Transformers?

Daniel Kokotajlo28 Jul 2020 19:36 UTC

22 points

4 comments1 min readLW link

What a 20-year-lead in military tech might look like

Daniel Kokotajlo29 Jul 2020 20:10 UTC

68 points

44 comments16 min readLW link

[Question] What if memes are common in highly capable minds?

Daniel Kokotajlo30 Jul 2020 20:45 UTC

36 points

15 comments2 min readLW link

Three mental images from thinking about AGI debate & corrigibility

Steven Byrnes3 Aug 2020 14:29 UTC

55 points

35 comments4 min readLW link

Solving Key Alignment Problems Group

Logan Riggs3 Aug 2020 19:30 UTC

19 points

7 comments2 min readLW link

How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe?

Anirandis10 Sep 2020 0:40 UTC

19 points

20 comments2 min readLW link

My computational framework for the brain

Steven Byrnes14 Sep 2020 14:19 UTC

144 points

26 comments13 min readLW link 1 review

[Question] Where is human level on text prediction? (GPTs task)

Daniel Kokotajlo20 Sep 2020 9:00 UTC

27 points

19 comments1 min readLW link

Needed: AI infohazard policy

Vanessa Kosoy21 Sep 2020 15:26 UTC

61 points

17 comments2 min readLW link

The Colliding Exponentials of AI

Vermillion14 Oct 2020 23:31 UTC

27 points

16 comments5 min readLW link

“Little glimpses of empathy” as the foundation for social emotions

Steven Byrnes22 Oct 2020 11:02 UTC

31 points

1 comment5 min readLW link

Introduction to Cartesian Frames

Scott Garrabrant22 Oct 2020 13:00 UTC

145 points

29 comments22 min readLW link 1 review

“Cartesian Frames” Talk #2 this Sunday at 2pm (PT)

Rob Bensinger28 Oct 2020 13:59 UTC

30 points

0 comments1 min readLW link

Does SGD Produce Deceptive Alignment?

Mark Xu6 Nov 2020 23:48 UTC

85 points

6 comments16 min readLW link

[Question] How can I bet on short timelines?

Daniel Kokotajlo7 Nov 2020 12:44 UTC

43 points

16 comments2 min readLW link

Non-Obstruction: A Simple Concept Motivating Corrigibility

TurnTrout21 Nov 2020 19:35 UTC

67 points

19 comments19 min readLW link

Cartesian Frames Definitions

Rob Bensinger8 Nov 2020 12:44 UTC

25 points

0 comments4 min readLW link

Communication Prior as Alignment Strategy

johnswentworth12 Nov 2020 22:06 UTC

40 points

8 comments6 min readLW link

How Roodman’s GWP model translates to TAI timelines

Daniel Kokotajlo16 Nov 2020 14:05 UTC

22 points

5 comments3 min readLW link

Normativity

abramdemski18 Nov 2020 16:52 UTC

46 points

11 comments9 min readLW link

Inner Alignment in Salt-Starved Rats

Steven Byrnes19 Nov 2020 2:40 UTC

136 points

39 comments11 min readLW link 2 reviews

Continuing the takeoffs debate

Richard_Ngo23 Nov 2020 15:58 UTC

67 points

13 comments9 min readLW link

The next AI winter will be due to energy costs

hippke24 Nov 2020 16:53 UTC

57 points

7 comments2 min readLW link

Recursive Quantilizers II

abramdemski2 Dec 2020 15:26 UTC

30 points

15 comments13 min readLW link

Supervised learning in the brain, part 4: compression / filtering

Steven Byrnes5 Dec 2020 17:06 UTC

12 points

0 comments5 min readLW link

Conservatism in neocortex-like AGIs

Steven Byrnes8 Dec 2020 16:37 UTC

22 points

5 comments8 min readLW link

Avoiding Side Effects in Complex Environments

TurnTrout and nealeratzlaff

12 Dec 2020 0:34 UTC

62 points

9 comments2 min readLW link

(avoiding-side-effects.github.io)

The Power of Annealing

meanderingmoose14 Dec 2020 11:02 UTC

25 points

6 comments5 min readLW link

[link] The AI Girlfriend Seducing China’s Lonely Men

Kaj_Sotala14 Dec 2020 20:18 UTC

34 points

11 comments1 min readLW link

(www.sixthtone.com)

Operationalizing compatibility with strategy-stealing

evhub24 Dec 2020 22:36 UTC

41 points

6 comments4 min readLW link

Defusing AGI Danger

Mark Xu24 Dec 2020 22:58 UTC

48 points

9 comments9 min readLW link

Multi-dimensional rewards for AGI interpretability and control

Steven Byrnes4 Jan 2021 3:08 UTC

19 points

8 comments10 min readLW link

DALL-E by OpenAI

Daniel Kokotajlo5 Jan 2021 20:05 UTC

97 points

22 comments1 min readLW link

Review of ‘But exactly how complex and fragile?’

TurnTrout6 Jan 2021 18:39 UTC

55 points

0 comments8 min readLW link

The Case for a Journal of AI Alignment

adamShimi9 Jan 2021 18:13 UTC

45 points

32 comments4 min readLW link

Transparency and AGI safety

jylin0411 Jan 2021 18:51 UTC

52 points

12 comments30 min readLW link

Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain

Daniel Kokotajlo18 Jan 2021 12:08 UTC

184 points

85 comments14 min readLW link 1 review

Infra-Bayesianism Unwrapped

adamShimi20 Jan 2021 13:35 UTC

41 points

0 comments24 min readLW link

Optimal play in human-judged Debate usually won’t answer your question

Joe_Collman27 Jan 2021 7:34 UTC

33 points

12 comments12 min readLW link

Creating AGI Safety Interlocks

Koen.Holtman5 Feb 2021 12:01 UTC

7 points

4 comments8 min readLW link

Timeline of AI safety

riceissa7 Feb 2021 22:29 UTC

63 points

6 comments2 min readLW link

(timelines.issarice.com)

Tournesol, YouTube and AI Risk

adamShimi12 Feb 2021 18:56 UTC

36 points

13 comments4 min readLW link

Internet Encyclopedia of Philosophy on Ethics of Artificial Intelligence

Kaj_Sotala20 Feb 2021 13:54 UTC

15 points

1 comment4 min readLW link

(iep.utm.edu)

Behavioral Sufficient Statistics for Goal-Directedness

adamShimi11 Mar 2021 15:01 UTC

21 points

12 comments9 min readLW link

A simple way to make GPT-3 follow instructions

Quintin Pope8 Mar 2021 2:57 UTC

11 points

5 comments4 min readLW link

Towards a Mechanistic Understanding of Goal-Directedness

Mark Xu9 Mar 2021 20:17 UTC

45 points

1 comment5 min readLW link

AXRP Episode 5 - Infra-Bayesianism with Vanessa Kosoy

DanielFilan10 Mar 2021 4:30 UTC

33 points

12 comments35 min readLW link

Comments on “The Singularity is Nowhere Near”

Steven Byrnes16 Mar 2021 23:59 UTC

50 points

6 comments8 min readLW link

Is RL involved in sensory processing?

Steven Byrnes18 Mar 2021 13:57 UTC

21 points

21 comments5 min readLW link

Against evolution as an analogy for how humans will create AGI

Steven Byrnes23 Mar 2021 12:29 UTC

44 points

25 comments25 min readLW link

My AGI Threat Model: Misaligned Model-Based RL Agent

Steven Byrnes25 Mar 2021 13:45 UTC

66 points

40 comments16 min readLW link

Coherence arguments imply a force for goal-directed behavior

KatjaGrace26 Mar 2021 16:10 UTC

88 points

27 comments14 min readLW link

(aiimpacts.org)

Transparency Trichotomy

Mark Xu28 Mar 2021 20:26 UTC

25 points

2 comments7 min readLW link

Hardware is already ready for the singularity. Algorithm knowledge is the only barrier.

Andrew Vlahos30 Mar 2021 22:48 UTC

16 points

3 comments3 min readLW link

Ben Goertzel’s “Kinds of Minds”

JoshuaFox11 Apr 2021 12:41 UTC

12 points

4 comments1 min readLW link

Updating the Lottery Ticket Hypothesis

johnswentworth18 Apr 2021 21:45 UTC

73 points

41 comments2 min readLW link

Three reasons to expect long AI timelines

Matthew Barnett22 Apr 2021 18:44 UTC

68 points

29 comments11 min readLW link

(matthewbarnett.substack.com)

Beware over-use of the agent model

Alex Flint25 Apr 2021 22:19 UTC

28 points

10 comments5 min readLW link 1 review

Agents Over Cartesian World Models

Mark Xu and evhub

27 Apr 2021 2:06 UTC

62 points

3 comments27 min readLW link

Less Realistic Tales of Doom

Mark Xu6 May 2021 23:01 UTC

110 points

13 comments4 min readLW link

Challenge: know everything that the best go bot knows about go

DanielFilan11 May 2021 5:10 UTC

48 points

93 comments2 min readLW link

(danielfilan.com)

Formal Inner Alignment, Prospectus

abramdemski12 May 2021 19:57 UTC

91 points

57 comments16 min readLW link

Agency in Conway’s Game of Life

Alex Flint13 May 2021 1:07 UTC

97 points

81 comments9 min readLW link 1 review

Knowledge Neurons in Pretrained Transformers

evhub17 May 2021 22:54 UTC

98 points

7 comments2 min readLW link

(arxiv.org)

Decoupling deliberation from competition

paulfchristiano25 May 2021 18:50 UTC

72 points

16 comments9 min readLW link

(ai-alignment.com)

Power dynamics as a blind spot or blurry spot in our collective world-modeling, especially around AI

Andrew_Critch1 Jun 2021 18:45 UTC

176 points

26 comments6 min readLW link

Game-theoretic Alignment in terms of Attainable Utility

midco and TurnTrout

8 Jun 2021 12:36 UTC

20 points

2 comments9 min readLW link

Beijing Academy of Artificial Intelligence announces 1,75 trillion parameters model, Wu Dao 2.0

Ozyrus3 Jun 2021 12:07 UTC

23 points

9 comments1 min readLW link

(www.engadget.com)

An Intuitive Guide to Garrabrant Induction

Mark Xu3 Jun 2021 22:21 UTC

115 points

18 comments24 min readLW link

Conservative Agency with Multiple Stakeholders

TurnTrout8 Jun 2021 0:30 UTC

31 points

0 comments3 min readLW link

Supplement to “Big picture of phasic dopamine”

Steven Byrnes8 Jun 2021 13:08 UTC

13 points

2 comments9 min readLW link

Looking Deeper at Deconfusion

adamShimi13 Jun 2021 21:29 UTC

57 points

13 comments15 min readLW link

[Question] Open problem: how can we quantify player alignment in 2x2 normal-form games?

TurnTrout16 Jun 2021 2:09 UTC

23 points

59 comments1 min readLW link

Reward Is Not Enough

Steven Byrnes16 Jun 2021 13:52 UTC

105 points

18 comments10 min readLW link

Environmental Structure Can Cause Instrumental Convergence

TurnTrout22 Jun 2021 22:26 UTC

71 points

44 comments16 min readLW link

(arxiv.org)

AXRP Episode 9 - Finite Factored Sets with Scott Garrabrant

DanielFilan24 Jun 2021 22:10 UTC

56 points

2 comments58 min readLW link

Musings on general systems alignment

Alex Flint30 Jun 2021 18:16 UTC

31 points

11 comments3 min readLW link

Thoughts on safety in predictive learning

Steven Byrnes30 Jun 2021 19:17 UTC

18 points

17 comments19 min readLW link

The More Power At Stake, The Stronger Instrumental Convergence Gets For Optimal Policies

TurnTrout11 Jul 2021 17:36 UTC

45 points

7 comments6 min readLW link

A world in which the alignment problem seems lower-stakes

TurnTrout8 Jul 2021 2:31 UTC

19 points

17 comments2 min readLW link

Fractional progress estimates for AI timelines and implied resource requirements

Mark Xu and CarlShulman

15 Jul 2021 18:43 UTC

55 points

6 comments7 min readLW link

Experimentation with AI-generated images (VQGAN+CLIP) | Solarpunk airships fleeing a dragon

Kaj_Sotala15 Jul 2021 11:00 UTC

44 points

4 comments2 min readLW link

(kajsotala.fi)

Seeking Power is Convergently Instrumental in a Broad Class of Environments

TurnTrout8 Aug 2021 2:02 UTC

41 points

15 comments8 min readLW link

LCDT, A Myopic Decision Theory

adamShimi and evhub

3 Aug 2021 22:41 UTC

50 points

51 comments15 min readLW link

When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives

TurnTrout9 Aug 2021 17:22 UTC

52 points

4 comments5 min readLW link

Two AI-risk-related game design ideas

Daniel Kokotajlo5 Aug 2021 13:36 UTC

47 points

9 comments5 min readLW link

Research agenda update

Steven Byrnes6 Aug 2021 19:24 UTC

54 points

40 comments7 min readLW link

What 2026 looks like

Daniel Kokotajlo6 Aug 2021 16:14 UTC

371 points

109 comments16 min readLW link 1 review

Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

TurnTrout18 Nov 2021 1:54 UTC

69 points

8 comments17 min readLW link

(www.overleaf.com)

Dopamine-supervised learning in mammals & fruit flies

Steven Byrnes10 Aug 2021 16:13 UTC

16 points

6 comments8 min readLW link

Free course review — Reliable and Interpretable Artificial Intelligence (ETH Zurich)

Jan Czechowski10 Aug 2021 16:36 UTC

7 points

0 comments3 min readLW link

Technical Predictions Related to AI Safety

lsusr13 Aug 2021 0:29 UTC

28 points

12 comments8 min readLW link

Provide feedback on Open Philanthropy’s AI alignment RFP

abergal and Nick_Beckstead

20 Aug 2021 19:52 UTC

56 points

6 comments1 min readLW link

AI Safety Papers: An App for the TAI Safety Database

ozziegooen21 Aug 2021 2:02 UTC

74 points

13 comments2 min readLW link

Randal Koene on brain understanding before whole brain emulation

Steven Byrnes23 Aug 2021 20:59 UTC

36 points

12 comments3 min readLW link

MIRI/OP exchange about decision theory

Rob Bensinger25 Aug 2021 22:44 UTC

47 points

7 comments10 min readLW link

Goodhart Ethology

Charlie Steiner17 Sep 2021 17:31 UTC

18 points

4 comments14 min readLW link

[Question] What are good alignment conference papers?

adamShimi28 Aug 2021 13:35 UTC

12 points

2 comments1 min readLW link

Brain-Computer Interfaces and AI Alignment

niplav28 Aug 2021 19:48 UTC

31 points

6 comments11 min readLW link

Superintelligent Introspection: A Counter-argument to the Orthogonality Thesis

DirectedEvolution29 Aug 2021 4:53 UTC

3 points

18 comments4 min readLW link

Alignment Research = Conceptual Alignment Research + Applied Alignment Research

adamShimi30 Aug 2021 21:13 UTC

37 points

14 comments5 min readLW link

AXRP Episode 11 - Attainable Utility and Power with Alex Turner

DanielFilan25 Sep 2021 21:10 UTC

19 points

5 comments52 min readLW link

Is progress in ML-assisted theorem-proving beneficial?

mako yass28 Sep 2021 1:54 UTC

10 points

3 comments1 min readLW link

Takeoff Speeds and Discontinuities

Sammy Martin and Daniel_Eth

30 Sep 2021 13:50 UTC

62 points

1 comment15 min readLW link

My take on Vanessa Kosoy’s take on AGI safety

Steven Byrnes30 Sep 2021 12:23 UTC

84 points

10 comments31 min readLW link

[Prediction] We are in an Algorithmic Overhang

lsusr29 Sep 2021 23:40 UTC

31 points

14 comments1 min readLW link

Interview with Skynet

lsusr30 Sep 2021 2:20 UTC

49 points

1 comment2 min readLW link

AI learns betrayal and how to avoid it

Stuart_Armstrong30 Sep 2021 9:39 UTC

30 points

4 comments2 min readLW link

The Dark Side of Cognition Hypothesis

Cameron Berg3 Oct 2021 20:10 UTC

19 points

1 comment16 min readLW link

[Question] How to think about and deal with OpenAI

Rafael Harth9 Oct 2021 13:10 UTC

107 points

71 comments1 min readLW link

NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG

Ozyrus11 Oct 2021 15:28 UTC

51 points

36 comments1 min readLW link

(developer.nvidia.com)

Postmodern Warfare

lsusr25 Oct 2021 9:02 UTC

61 points

25 comments2 min readLW link

A very crude deception eval is already passed

Beth Barnes29 Oct 2021 17:57 UTC

105 points

8 comments2 min readLW link

Study Guide

johnswentworth6 Nov 2021 1:23 UTC

220 points

41 comments16 min readLW link

Re: Attempted Gears Analysis of AGI Intervention Discussion With Eliezer

lsusr15 Nov 2021 10:02 UTC

20 points

8 comments15 min readLW link

Ngo and Yudkowsky on alignment difficulty

Eliezer Yudkowsky and Richard_Ngo

15 Nov 2021 20:31 UTC

235 points

143 comments99 min readLW link

Corrigibility Can Be VNM-Incoherent

TurnTrout20 Nov 2021 0:30 UTC

64 points

24 comments7 min readLW link

Visible Thoughts Project and Bounty Announcement

So8res30 Nov 2021 0:19 UTC

245 points

104 comments13 min readLW link

Interpreting Yudkowsky on Deep vs Shallow Knowledge

adamShimi5 Dec 2021 17:32 UTC

100 points

32 comments24 min readLW link

Are there alternative to solving value transfer and extrapolation?

Stuart_Armstrong6 Dec 2021 18:53 UTC

19 points

7 comments5 min readLW link

Considerations on interaction between AI and expected value of the future

Beth Barnes7 Dec 2021 2:46 UTC

64 points

28 comments4 min readLW link

Some thoughts on why adversarial training might be useful

Beth Barnes8 Dec 2021 1:28 UTC

9 points

5 comments3 min readLW link

The Plan

johnswentworth10 Dec 2021 23:41 UTC

235 points

77 comments14 min readLW link

Moore’s Law, AI, and the pace of progress

Veedrac11 Dec 2021 3:02 UTC

120 points

39 comments24 min readLW link

Summary of the Acausal Attack Issue for AIXI

Diffractor13 Dec 2021 8:16 UTC

14 points

6 comments4 min readLW link

Consequentialism & corrigibility

Steven Byrnes14 Dec 2021 13:23 UTC

60 points

27 comments7 min readLW link

Should we rely on the speed prior for safety?

Marc Carauleanu14 Dec 2021 20:45 UTC

14 points

6 comments5 min readLW link

The Case for Radical Optimism about Interpretability

Quintin Pope16 Dec 2021 23:38 UTC

57 points

16 comments8 min readLW link 1 review

Researcher incentives cause smoother progress on benchmarks

ryan_greenblatt21 Dec 2021 4:13 UTC

20 points

4 comments1 min readLW link

Self-Organised Neural Networks: A simple, natural and efficient way to intelligence

D𝜋1 Jan 2022 23:24 UTC

41 points

51 comments44 min readLW link

Prizes for ELK proposals

paulfchristiano3 Jan 2022 20:23 UTC

141 points

156 comments7 min readLW link

D𝜋′s Spiking Network

lsusr4 Jan 2022 4:08 UTC

50 points

37 comments4 min readLW link

More Is Different for AI

jsteinhardt4 Jan 2022 19:30 UTC

137 points

22 comments3 min readLW link

(bounded-regret.ghost.io)

Instrumental Convergence For Realistic Agent Objectives

TurnTrout22 Jan 2022 0:41 UTC

35 points

9 comments9 min readLW link

What’s Up With Confusingly Pervasive Consequentialism?

Raemon20 Jan 2022 19:22 UTC

169 points

88 comments4 min readLW link

[Intro to brain-like-AGI safety] 1. What’s the problem & Why work on it now?

Steven Byrnes26 Jan 2022 15:23 UTC

119 points

19 comments23 min readLW link

Arguments about Highly Reliable Agent Designs as a Useful Path to Artificial Intelligence Safety

riceissa and Davidmanheim

27 Jan 2022 13:13 UTC

27 points

0 comments1 min readLW link

(arxiv.org)

Competitive programming with AlphaCode

Algon2 Feb 2022 16:49 UTC

58 points

37 comments15 min readLW link

(deepmind.com)

Thoughts on AGI safety from the top

jylin042 Feb 2022 20:06 UTC

35 points

3 comments32 min readLW link

Paradigm-building from first principles: Effective altruism, AGI, and alignment

Cameron Berg8 Feb 2022 16:12 UTC

24 points

5 comments14 min readLW link

[Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering

Steven Byrnes9 Feb 2022 13:09 UTC

59 points

3 comments24 min readLW link

[Intro to brain-like-AGI safety] 4. The “short-term predictor”

Steven Byrnes16 Feb 2022 13:12 UTC

51 points

11 comments13 min readLW link

ELK Proposal: Thinking Via A Human Imitator

TurnTrout22 Feb 2022 1:52 UTC

28 points

6 comments11 min readLW link

Why I’m co-founding Aligned AI

Stuart_Armstrong17 Feb 2022 19:55 UTC

93 points

54 comments3 min readLW link

Implications of automated ontology identification

Alex Flint, adamShimi and Robert Miles

18 Feb 2022 3:30 UTC

67 points

29 comments23 min readLW link

Alignment research exercises

Richard_Ngo21 Feb 2022 20:24 UTC

146 points

17 comments8 min readLW link

[Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning

Steven Byrnes23 Feb 2022 14:44 UTC

41 points

25 comments21 min readLW link

How do new models from OpenAI, DeepMind and Anthropic perform on TruthfulQA?

Owain_Evans26 Feb 2022 12:46 UTC

42 points

3 comments11 min readLW link

Estimating Brain-Equivalent Compute from Image Recognition Algorithms

Gunnar_Zarncke27 Feb 2022 2:45 UTC

14 points

4 comments2 min readLW link

[Link] Aligned AI AMA

Stuart_Armstrong1 Mar 2022 12:01 UTC

18 points

0 comments1 min readLW link

[Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL

Steven Byrnes2 Mar 2022 15:26 UTC

41 points

13 comments16 min readLW link

[Question] Would (myopic) general public good producers significantly accelerate the development of AGI?

mako yass2 Mar 2022 23:47 UTC

25 points

10 comments1 min readLW link

[Intro to brain-like-AGI safety] 7. From hardcoded drives to foresighted plans: A worked example

Steven Byrnes9 Mar 2022 14:28 UTC

56 points

0 comments9 min readLW link

[Intro to brain-like-AGI safety] 9. Takeaways from neuro 2/2: On AGI motivation

Steven Byrnes23 Mar 2022 12:48 UTC

31 points

6 comments23 min readLW link

Humans pretending to be robots pretending to be human

Richard_Kennaway28 Mar 2022 15:13 UTC

27 points

15 comments1 min readLW link

[Intro to brain-like-AGI safety] 10. The alignment problem

Steven Byrnes30 Mar 2022 13:24 UTC

34 points

4 comments21 min readLW link

AXRP Episode 13 - First Principles of AGI Safety with Richard Ngo

DanielFilan31 Mar 2022 5:20 UTC

24 points

1 comment48 min readLW link

Uncontrollable Super-Powerful Explosives

Sammy Martin2 Apr 2022 20:13 UTC

53 points

12 comments5 min readLW link

The case for Doing Something Else (if Alignment is doomed)

Rafael Harth5 Apr 2022 17:52 UTC

81 points

14 comments2 min readLW link

[Intro to brain-like-AGI safety] 11. Safety ≠ alignment (but they’re close!)

Steven Byrnes6 Apr 2022 13:39 UTC

25 points

1 comment10 min readLW link

Strategic Considerations Regarding Autistic/Literal AI

Chris_Leong6 Apr 2022 14:57 UTC

−1 points

2 comments2 min readLW link

DALL·E 2 by OpenAI

P.6 Apr 2022 14:17 UTC

44 points

51 comments1 min readLW link

(openai.com)

How to train your transformer

p.b.7 Apr 2022 9:34 UTC

6 points

0 comments8 min readLW link

AMA Conjecture, A New Alignment Startup

adamShimi9 Apr 2022 9:43 UTC

46 points

42 comments1 min readLW link

Worse than an unaligned AGI

shminux10 Apr 2022 3:35 UTC

−1 points

12 comments1 min readLW link

[Question] Did OpenAI let GPT out of the box?

ChristianKl16 Apr 2022 14:56 UTC

4 points

12 comments1 min readLW link

Instrumental Convergence To Offer Hope?

michael_mjd22 Apr 2022 1:56 UTC

12 points

7 comments3 min readLW link

[Intro to brain-like-AGI safety] 13. Symbol grounding & human social instincts

Steven Byrnes27 Apr 2022 13:30 UTC

54 points

13 comments14 min readLW link

[Intro to brain-like-AGI safety] 14. Controlled AGI

Steven Byrnes11 May 2022 13:17 UTC

26 points

25 comments18 min readLW link

[Question] What’s keeping concerned capabilities gain researchers from leaving the field?

sovran12 May 2022 12:16 UTC

19 points

4 comments1 min readLW link

[Question] What’s keeping concerned capabilities gain researchers from leaving the field?

sovran12 May 2022 12:16 UTC

19 points

4 comments1 min readLW link

Reading the ethicists: A review of articles on AI in the journal Science and Engineering Ethics

Charlie Steiner18 May 2022 20:52 UTC

50 points

8 comments14 min readLW link

Confused why a “capabilities research is good for alignment progress” position isn’t discussed more

Kaj_Sotala2 Jun 2022 21:41 UTC

132 points

26 comments4 min readLW link

I’m trying out “asteroid mindset”

Alex_Altair3 Jun 2022 13:35 UTC

85 points

5 comments4 min readLW link

Announcing the Alignment of Complex Systems Research Group

Jan_Kulveit and technicalities

4 Jun 2022 4:10 UTC

79 points

18 comments5 min readLW link

AGI Ruin: A List of Lethalities

Eliezer Yudkowsky5 Jun 2022 22:05 UTC

725 points

653 comments30 min readLW link

Yes, AI research will be substantially curtailed if a lab causes a major disaster

lc14 Jun 2022 22:17 UTC

96 points

35 comments2 min readLW link

Lamda is not an LLM

Kevin19 Jun 2022 11:13 UTC

7 points

10 comments1 min readLW link

(www.wired.com)

Google’s new text-to-image model—Parti, a demonstration of scaling benefits

Kayden22 Jun 2022 20:00 UTC

32 points

4 comments1 min readLW link

[Link] OpenAI: Learning to Play Minecraft with Video PreTraining (VPT)

Aryeh Englander23 Jun 2022 16:29 UTC

53 points

3 comments1 min readLW link

Announcing Epoch: A research organization investigating the road to Transformative AI

Jsevillamol, Pablo Villalobos, Tamay, lennart, Marius Hobbhahn and anson.ho

27 Jun 2022 13:55 UTC

95 points

2 comments2 min readLW link

(epochai.org)

Paper: Forecasting world events with neural nets

Owain_Evans, Dan H and Joe Kwon

1 Jul 2022 19:40 UTC

39 points

3 comments4 min readLW link

Naive Hypotheses on AI Alignment

Shoshannah Tekofsky2 Jul 2022 19:03 UTC

89 points

29 comments5 min readLW link

Humans provide an untapped wealth of evidence about alignment

TurnTrout and Quintin Pope

14 Jul 2022 2:31 UTC

175 points

92 comments10 min readLW link

Examples of AI Increasing AI Progress

ThomasW17 Jul 2022 20:06 UTC

104 points

14 comments1 min readLW link

Forecasting ML Benchmarks in 2023

jsteinhardt18 Jul 2022 2:50 UTC

36 points

19 comments12 min readLW link

(bounded-regret.ghost.io)

Robustness to Scaling Down: More Important Than I Thought

adamShimi23 Jul 2022 11:40 UTC

37 points

5 comments3 min readLW link

Comparing Four Approaches to Inner Alignment

Lucas Teixeira29 Jul 2022 21:06 UTC

33 points

1 comment9 min readLW link

Where are the red lines for AI?

Karl von Wendt5 Aug 2022 9:34 UTC

23 points

8 comments6 min readLW link

Jack Clark on the realities of AI policy

Kaj_Sotala7 Aug 2022 8:44 UTC

66 points

3 comments3 min readLW link

(threadreaderapp.com)

GD’s Implicit Bias on Separable Data

Xander Davies17 Oct 2022 4:13 UTC

23 points

0 comments7 min readLW link

AI Transparency: Why it’s critical and how to obtain it.

Zohar Jackson14 Aug 2022 10:31 UTC

6 points

1 comment5 min readLW link

Brain-like AGI project “aintelope”

Gunnar_Zarncke14 Aug 2022 16:33 UTC

48 points

2 comments1 min readLW link

A Mechanistic Interpretability Analysis of Grokking

Neel Nanda and Tom Lieberum

15 Aug 2022 2:41 UTC

338 points

39 comments42 min readLW link

(colab.research.google.com)

What if we approach AI safety like a technical engineering safety problem

zeshen20 Aug 2022 10:29 UTC

30 points

5 comments7 min readLW link

AI art isn’t “about to shake things up”. It’s already here.

Davis_Kingsley22 Aug 2022 11:17 UTC

65 points

19 comments3 min readLW link

Some conceptual alignment research projects

Richard_Ngo25 Aug 2022 22:51 UTC

168 points

14 comments3 min readLW link

Levelling Up in AI Safety Research Engineering

Gabe M2 Sep 2022 4:59 UTC

40 points

7 comments17 min readLW link

The shard theory of human values

Quintin Pope and TurnTrout

4 Sep 2022 4:28 UTC

202 points

57 comments24 min readLW link

Quintin’s alignment papers roundup—week 1

Quintin Pope10 Sep 2022 6:39 UTC

119 points

5 comments9 min readLW link

LOVE in a simbox is all you need

jacob_cannell28 Sep 2022 18:25 UTC

59 points

69 comments44 min readLW link

A shot at the diamond-alignment problem

TurnTrout6 Oct 2022 18:29 UTC

77 points

53 comments15 min readLW link

More examples of goal misgeneralization

Rohin Shah and Vikrant Varma

7 Oct 2022 14:38 UTC

51 points

8 comments2 min readLW link

(deepmindsafetyresearch.medium.com)

[Crosspost] AlphaTensor, Taste, and the Scalability of AI

jamierumbelow9 Oct 2022 19:42 UTC

16 points

4 comments1 min readLW link

(jamieonsoftware.com)

QAPR 4: Inductive biases

Quintin Pope10 Oct 2022 22:08 UTC

63 points

2 comments18 min readLW link

Infinite Possibility Space and the Shutdown Problem

magfrump18 Oct 2022 5:37 UTC

6 points

0 comments2 min readLW link

(www.magfrump.net)

Cruxes in Katja Grace’s Counterarguments

azsantosk16 Oct 2022 8:44 UTC

16 points

0 comments7 min readLW link

DeepMind on Stratego, an imperfect information game

sanxiyn24 Oct 2022 5:57 UTC

15 points

9 comments1 min readLW link

(arxiv.org)

Announcing: What Future World? - Growing the AI Governance Community

DavidCorfield2 Nov 2022 1:24 UTC

1 point

0 comments1 min readLW link

Poster Session on AI Safety

Neil Crawford12 Nov 2022 3:50 UTC

7 points

6 comments1 min readLW link

AI will change the world, but won’t take it over by playing “3-dimensional chess”.

boazbarak and benedelman

22 Nov 2022 18:57 UTC

103 points

86 comments24 min readLW link

A challenge for AGI organizations, and a challenge for readers

Rob Bensinger and Eliezer Yudkowsky

1 Dec 2022 23:11 UTC

265 points

30 comments2 min readLW link

Towards Hodge-podge Alignment

Cleo Nardo19 Dec 2022 20:12 UTC

65 points

26 comments9 min readLW link

[AN #94]: AI alignment as translation between humans and machines

Rohin Shah8 Apr 2020 17:10 UTC

11 points

0 comments7 min readLW link

(mailchi.mp)

[Question] What are the relative speeds of AI capabilities and AI safety?

NunoSempere24 Apr 2020 18:21 UTC

8 points

2 comments1 min readLW link

Seeking Power is Often Convergently Instrumental in MDPs

TurnTrout and Logan Riggs

5 Dec 2019 2:33 UTC

153 points

38 comments16 min readLW link 2 reviews

(arxiv.org)

“Don’t even think about hell”

emmab2 May 2020 8:06 UTC

6 points

2 comments1 min readLW link

[Question] AI Boxing for Hardware-bound agents (aka the China alignment problem)

Logan Zoellner8 May 2020 15:50 UTC

11 points

27 comments10 min readLW link

Pointing to a Flower

johnswentworth18 May 2020 18:54 UTC

59 points

18 comments9 min readLW link

Learning and manipulating learning

Stuart_Armstrong19 May 2020 13:02 UTC

39 points

5 comments10 min readLW link

[Question] Why aren’t we testing general intelligence distribution?

B Jacobs26 May 2020 16:07 UTC

25 points

7 comments1 min readLW link

OpenAI announces GPT-3

gwern29 May 2020 1:49 UTC

67 points

23 comments1 min readLW link

(arxiv.org)

GPT-3: a disappointing paper

nostalgebraist29 May 2020 19:06 UTC

65 points

44 comments8 min readLW link 1 review

Introduction to Existential Risks from Artificial Intelligence, for an EA audience

JoshuaFox2 Jun 2020 8:30 UTC

10 points

1 comment1 min readLW link

Preparing for “The Talk” with AI projects

Daniel Kokotajlo13 Jun 2020 23:01 UTC

64 points

16 comments3 min readLW link

[Question] What are the high-level approaches to AI alignment?

Gordon Seidoh Worley16 Jun 2020 17:10 UTC

12 points

13 comments1 min readLW link

Results of $1,000 Oracle contest!

Stuart_Armstrong17 Jun 2020 17:44 UTC

58 points

2 comments1 min readLW link

[Question] Likelihood of hyperexistential catastrophe from a bug?

Anirandis18 Jun 2020 16:23 UTC

13 points

27 comments1 min readLW link

AI Benefits Post 1: Introducing “AI Benefits”

Cullen22 Jun 2020 16:59 UTC

11 points

3 comments3 min readLW link

Goals and short descriptions

Michele Campolo2 Jul 2020 17:41 UTC

14 points

8 comments5 min readLW link

Research ideas to study humans with AI Safety in mind

Riccardo Volpato3 Jul 2020 16:01 UTC

23 points

2 comments5 min readLW link

AI Benefits Post 3: Direct and Indirect Approaches to AI Benefits

Cullen6 Jul 2020 18:48 UTC

8 points

0 comments2 min readLW link

Antitrust-Compliant AI Industry Self-Regulation

Cullen7 Jul 2020 20:53 UTC

9 points

3 comments1 min readLW link

(cullenokeefe.com)

Should AI Be Open?

Scott Alexander17 Dec 2015 8:25 UTC

20 points

3 comments13 min readLW link

Meta Programming GPT: A route to Superintelligence?

dmtea11 Jul 2020 14:51 UTC

10 points

7 comments4 min readLW link

The Dilemma of Worse Than Death Scenarios

arkaeik10 Jul 2018 9:18 UTC

5 points

18 comments4 min readLW link

[Question] What are the mostly likely ways AGI will emerge?

Craig Quiter14 Jul 2020 0:58 UTC

3 points

7 comments1 min readLW link

AI Benefits Post 4: Outstanding Questions on Selecting Benefits

Cullen14 Jul 2020 17:26 UTC

4 points

4 comments5 min readLW link

Solving Math Problems by Relay

bgold and Owain_Evans

17 Jul 2020 15:32 UTC

98 points

26 comments7 min readLW link

AI Benefits Post 5: Outstanding Questions on Governing Benefits

Cullen21 Jul 2020 16:46 UTC

4 points

0 comments4 min readLW link

[Question] Why is pseudo-alignment “worse” than other ways ML can fail to generalize?

nostalgebraist18 Jul 2020 22:54 UTC

45 points

10 comments2 min readLW link

[Question] “Do Nothing” utility function, 3½ years later?

niplav20 Jul 2020 11:09 UTC

5 points

3 comments1 min readLW link

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

Rohin Shah2 Jan 2020 18:20 UTC

34 points

94 comments10 min readLW link

(mailchi.mp)

Access to AI: a human right?

dmtea25 Jul 2020 9:38 UTC

5 points

3 comments2 min readLW link

The Rise of Commonsense Reasoning

DragonGod27 Jul 2020 19:01 UTC

8 points

0 comments1 min readLW link

(www.reddit.com)

AI and Efficiency

DragonGod27 Jul 2020 20:58 UTC

9 points

1 comment1 min readLW link

(openai.com)

FHI Report: How Will National Security Considerations Affect Antitrust Decisions in AI? An Examination of Historical Precedents

Cullen28 Jul 2020 18:34 UTC

2 points

0 comments1 min readLW link

(www.fhi.ox.ac.uk)

The “best predictor is malicious optimiser” problem

Donald Hobson29 Jul 2020 11:49 UTC

14 points

10 comments2 min readLW link

Sufficiently Advanced Language Models Can Do Reinforcement Learning

Past Account2 Aug 2020 15:32 UTC

21 points

7 comments7 min readLW link

[Question] What are the most important papers/post/resources to read to understand more of GPT-3?

adamShimi2 Aug 2020 20:53 UTC

22 points

4 comments1 min readLW link

[Question] What should an Einstein-like figure in Machine Learning do?

Razied5 Aug 2020 23:52 UTC

3 points

3 comments1 min readLW link

Book review: Architects of Intelligence by Martin Ford (2018)

Ofer11 Aug 2020 17:30 UTC

15 points

0 comments2 min readLW link

[Question] Will OpenAI’s work unintentionally increase existential risks related to AI?

adamShimi11 Aug 2020 18:16 UTC

50 points

56 comments1 min readLW link

Blog post: A tale of two research communities

Aryeh Englander12 Aug 2020 20:41 UTC

14 points

0 comments4 min readLW link

Mapping Out Alignment

Logan Riggs, adamShimi, Gurkenglas, AlexMennen and Gyrodiot

15 Aug 2020 1:02 UTC

42 points

0 comments5 min readLW link

My Understanding of Paul Christiano’s Iterated Amplification AI Safety Research Agenda

Chi Nguyen15 Aug 2020 20:02 UTC

119 points

21 comments39 min readLW link

GPT-3, belief, and consistency

skybrian16 Aug 2020 23:12 UTC

18 points

7 comments2 min readLW link

[Question] What precisely do we mean by AI alignment?

Gordon Seidoh Worley9 Dec 2018 2:23 UTC

27 points

8 comments1 min readLW link

Thoughts on the Feasibility of Prosaic AGI Alignment?

iamthouthouarti21 Aug 2020 23:25 UTC

8 points

10 comments1 min readLW link

[Question] Forecasting Thread: AI Timelines

Amandango, Daniel Kokotajlo and Ben Pace

22 Aug 2020 2:33 UTC

133 points

95 comments2 min readLW link

Learning human preferences: black-box, white-box, and structured white-box access

Stuart_Armstrong24 Aug 2020 11:42 UTC

25 points

9 comments6 min readLW link

Proofs Section 2.3 (Updates, Decision Theory)

Diffractor27 Aug 2020 7:49 UTC

7 points

0 comments31 min readLW link

Proofs Section 2.2 (Isomorphism to Expectations)

Diffractor27 Aug 2020 7:52 UTC

7 points

0 comments46 min readLW link

Proofs Section 2.1 (Theorem 1, Lemmas)

Diffractor27 Aug 2020 7:54 UTC

7 points

0 comments36 min readLW link

Proofs Section 1.1 (Initial results to LF-duality)

Diffractor27 Aug 2020 7:59 UTC

7 points

0 comments20 min readLW link

Proofs Section 1.2 (Mixtures, Updates, Pushforwards)

Diffractor27 Aug 2020 7:57 UTC

7 points

0 comments14 min readLW link

Basic Inframeasure Theory

Diffractor27 Aug 2020 8:02 UTC

35 points

16 comments25 min readLW link

Belief Functions And Decision Theory

Diffractor27 Aug 2020 8:00 UTC

15 points

8 comments39 min readLW link

Technical model refinement formalism

Stuart_Armstrong27 Aug 2020 11:54 UTC

19 points

0 comments6 min readLW link

Pong from pixels without reading “Pong from Pixels”

Ian McKenzie29 Aug 2020 17:26 UTC

15 points

1 comment7 min readLW link

Reflections on AI Timelines Forecasting Thread

Amandango1 Sep 2020 1:42 UTC

53 points

7 comments5 min readLW link

on “learning to summarize”

nostalgebraist12 Sep 2020 3:20 UTC

25 points

13 comments8 min readLW link

(nostalgebraist.tumblr.com)

[Question] The universality of computation and mind design space

alanf12 Sep 2020 14:58 UTC

1 point

7 comments1 min readLW link

Clarifying “What failure looks like”

Sam Clarke20 Sep 2020 20:40 UTC

95 points

14 comments17 min readLW link

Human Biases that Obscure AI Progress

Danielle Ensign25 Sep 2020 0:24 UTC

42 points

2 comments4 min readLW link

[Question] Competence vs Alignment

Ariel Kwiatkowski30 Sep 2020 21:03 UTC

6 points

4 comments1 min readLW link

AGI safety from first principles: Alignment

Richard_Ngo1 Oct 2020 3:13 UTC

56 points

2 comments13 min readLW link

[Question] GPT-3 + GAN

stick10917 Oct 2020 7:58 UTC

4 points

4 comments1 min readLW link

Book Review: Reinforcement Learning by Sutton and Barto

billmei20 Oct 2020 19:40 UTC

52 points

3 comments10 min readLW link

GPT-X, Paperclip Maximizer? Analyzing AGI and Final Goals

meanderingmoose22 Oct 2020 14:33 UTC

8 points

1 comment6 min readLW link

Containing the AI… Inside a Simulated Reality

HumaneAutomation31 Oct 2020 16:16 UTC

1 point

9 comments2 min readLW link

Why those who care about catastrophic and existential risk should care about autonomous weapons

aaguirre11 Nov 2020 15:22 UTC

60 points

20 comments19 min readLW link

European Master’s Programs in Machine Learning, Artificial Intelligence, and related fields

Master Programs ML/AI14 Nov 2020 15:51 UTC

32 points

8 comments1 min readLW link

Should we postpone AGI until we reach safety?

otto.barten18 Nov 2020 15:43 UTC

27 points

36 comments3 min readLW link

Commitment and credibility in multipolar AI scenarios

anni_leskela4 Dec 2020 18:48 UTC

25 points

3 comments18 min readLW link

[Question] AI Winter Is Coming—How to profit from it?

maximkazhenkov5 Dec 2020 20:23 UTC

10 points

7 comments1 min readLW link

Announcing the Technical AI Safety Podcast

Quinn7 Dec 2020 18:51 UTC

42 points

6 comments2 min readLW link

(technical-ai-safety.libsyn.com)

All GPT skills are translation

p.b.13 Dec 2020 20:06 UTC

4 points

0 comments2 min readLW link

[Question] Judging AGI Output

cy6erlion14 Dec 2020 12:43 UTC

3 points

0 comments2 min readLW link

Risk Map of AI Systems

VojtaKovarik and Jan_Kulveit

15 Dec 2020 9:16 UTC

25 points

3 comments8 min readLW link

AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

xuan1 Jan 2021 0:08 UTC

30 points

21 comments20 min readLW link

Are we all misaligned?

Mateusz Mazurkiewicz3 Jan 2021 2:42 UTC

11 points

0 comments5 min readLW link

[Question] What do we really expect from a well-aligned AI?

jan betley4 Jan 2021 20:57 UTC

8 points

10 comments1 min readLW link

Eight claims about multi-agent AGI safety

Richard_Ngo7 Jan 2021 13:34 UTC

73 points

18 comments5 min readLW link

Imitative Generalisation (AKA ‘Learning the Prior’)

Beth Barnes10 Jan 2021 0:30 UTC

92 points

14 comments12 min readLW link

Prediction can be Outer Aligned at Optimum

Lukas Finnveden10 Jan 2021 18:48 UTC

15 points

12 comments11 min readLW link

[Question] Poll: Which variables are most strategically relevant?

Daniel Kokotajlo and Noa Nabeshima

22 Jan 2021 17:17 UTC

32 points

34 comments1 min readLW link

AISU 2021

Linda Linsefors30 Jan 2021 17:40 UTC

28 points

2 comments1 min readLW link

Deepmind has made a general inductor (“Making sense of sensory input”)

mako yass2 Feb 2021 2:54 UTC

48 points

10 comments1 min readLW link

(www.sciencedirect.com)

Counterfactual Planning in AGI Systems

Koen.Holtman3 Feb 2021 13:54 UTC

7 points

0 comments5 min readLW link

[AN #136]: How well will GPT-N perform on downstream tasks?

Rohin Shah3 Feb 2021 18:10 UTC

21 points

2 comments9 min readLW link

(mailchi.mp)

Formal Solution to the Inner Alignment Problem

michaelcohen18 Feb 2021 14:51 UTC

47 points

123 comments2 min readLW link

TASP Ep 3 - Optimal Policies Tend to Seek Power

Quinn11 Mar 2021 1:44 UTC

24 points

0 comments1 min readLW link

(technical-ai-safety.libsyn.com)

Phylactery Decision Theory

Bunthut2 Apr 2021 20:55 UTC

14 points

6 comments2 min readLW link

Predictive Coding has been Unified with Backpropagation

lsusr2 Apr 2021 21:42 UTC

166 points

44 comments2 min readLW link

[Question] What if we could use the theory of Mechanism Design from Game Theory as a medium achieve AI Alignment?

farari74 Apr 2021 12:56 UTC

4 points

0 comments1 min readLW link

Another (outer) alignment failure story

paulfchristiano7 Apr 2021 20:12 UTC

210 points

38 comments12 min readLW link

A System For Evolving Increasingly General Artificial Intelligence From Current Technologies

Tsang Chung Shu8 Apr 2021 21:37 UTC

1 point

3 comments11 min readLW link

April 2021 Deep Dive: Transformers and GPT-3

adamShimi1 May 2021 11:18 UTC

30 points

6 comments7 min readLW link

[Question] [timeboxed exercise] write me your model of AI human-existential safety and the alignment problems in 15 minutes

Quinn4 May 2021 19:10 UTC

6 points

2 comments1 min readLW link

Mostly questions about Dumb AI Kernels

HorizonHeld12 May 2021 22:00 UTC

1 point

1 comment9 min readLW link

Thoughts on Iterated Distillation and Amplification

Waddington11 May 2021 21:32 UTC

9 points

2 comments20 min readLW link

How do we build organisations that want to build safe AI?

sxae12 May 2021 15:08 UTC

4 points

4 comments9 min readLW link

[Question] Who has argued in detail that a current AI system is phenomenally conscious?

Robbo14 May 2021 22:03 UTC

3 points

2 comments1 min readLW link

How I Learned to Stop Worrying and Love MUM

Waddington20 May 2021 7:57 UTC

2 points

0 comments3 min readLW link

AI Safety Research Project Ideas

Owain_Evans21 May 2021 13:39 UTC

58 points

2 comments3 min readLW link

[Question] How one uses set theory for alignment problem?

Valentin202629 May 2021 0:28 UTC

8 points

6 comments1 min readLW link

Reflection of Hierarchical Relationship via Nuanced Conditioning of Game Theory Approach for AI Development and Utilization

Kyoung-cheol Kim4 Jun 2021 7:20 UTC

2 points

2 comments9 min readLW link

Review of “Learning Normativity: A Research Agenda”

Gyrodiot, adamShimi and Joe_Collman

6 Jun 2021 13:33 UTC

34 points

0 comments6 min readLW link

Hardware for Transformative AI

MrThink22 Jun 2021 18:13 UTC

17 points

7 comments2 min readLW link

Alex Turner’s Research, Comprehensive Information Gathering

adamShimi23 Jun 2021 9:44 UTC

15 points

3 comments3 min readLW link

Discussion: Objective Robustness and Inner Alignment Terminology

jbkjr and Lauro Langosco

23 Jun 2021 23:25 UTC

70 points

7 comments9 min readLW link

The Language of Bird

johnswentworth27 Jun 2021 4:44 UTC

44 points

9 comments2 min readLW link

[Question] What are some claims or opinions about multi-multi delegation you’ve seen in the memeplex that you think deserve scrutiny?

Quinn27 Jun 2021 17:44 UTC

17 points

6 comments2 min readLW link

An examination of Metaculus’ resolved AI predictions and their implications for AI timelines

CharlesD20 Jul 2021 9:08 UTC

28 points

0 comments7 min readLW link

[Question] How should my timelines influence my career choice?

Tom Lieberum3 Aug 2021 10:14 UTC

13 points

10 comments1 min readLW link

What is the problem?

Carlos Ramirez11 Aug 2021 22:33 UTC

7 points

0 comments6 min readLW link

OpenAI Codex: First Impressions

specbug13 Aug 2021 16:52 UTC

49 points

8 comments4 min readLW link

(sixeleven.in)

[Question] 1h-volunteers needed for a small AI Safety-related research project

PabloAMC16 Aug 2021 17:53 UTC

2 points

0 comments1 min readLW link

Extraction of human preferences 👨→🤖

arunraja-hub24 Aug 2021 16:34 UTC

18 points

2 comments5 min readLW link

Call for research on evaluating alignment (funding + advice available)

Beth Barnes31 Aug 2021 23:28 UTC

105 points

11 comments5 min readLW link

Obstacles to gradient hacking

leogao5 Sep 2021 22:42 UTC

21 points

11 comments4 min readLW link

[Question] Conditional on the first AGI being aligned correctly, is a good outcome even still likely?

iamthouthouarti6 Sep 2021 17:30 UTC

2 points

1 comment1 min readLW link

Distinguishing AI takeover scenarios

Sam Clarke and Sammy Martin

8 Sep 2021 16:19 UTC

67 points

11 comments14 min readLW link

Paths To High-Level Machine Intelligence

Daniel_Eth10 Sep 2021 13:21 UTC

67 points

8 comments33 min readLW link

How truthful is GPT-3? A benchmark for language models

Owain_Evans16 Sep 2021 10:09 UTC

56 points

24 comments6 min readLW link

Investigating AI Takeover Scenarios

Sammy Martin17 Sep 2021 18:47 UTC

27 points

1 comment27 min readLW link

A sufficiently paranoid non-Friendly AGI might self-modify itself to become Friendly

RomanS22 Sep 2021 6:29 UTC

5 points

2 comments1 min readLW link

Towards Deconfusing Gradient Hacking

leogao24 Oct 2021 0:43 UTC

25 points

1 comment12 min readLW link

A brief review of the reasons multi-objective RL could be important in AI Safety Research

Ben Smith29 Sep 2021 17:09 UTC

27 points

8 comments10 min readLW link

Meta learning to gradient hack

Quintin Pope1 Oct 2021 19:25 UTC

54 points

11 comments3 min readLW link

Proposal: Scaling laws for RL generalization

axioman1 Oct 2021 21:32 UTC

14 points

10 comments11 min readLW link

A Framework of Prediction Technologies

isaduan3 Oct 2021 10:26 UTC

8 points

2 comments9 min readLW link

AI Prediction Services and Risks of War

isaduan3 Oct 2021 10:26 UTC

3 points

2 comments10 min readLW link

Possible Worlds after Prediction Take-off

isaduan3 Oct 2021 10:26 UTC

5 points

0 comments4 min readLW link

[Proposal] Method of locating useful subnets in large models

Quintin Pope13 Oct 2021 20:52 UTC

9 points

0 comments2 min readLW link

Commentary on “AGI Safety From First Principles by Richard Ngo, September 2020”

Robert Kralisch14 Oct 2021 15:11 UTC

3 points

0 comments20 min readLW link

The AGI needs to be honest

rokosbasilisk16 Oct 2021 19:24 UTC

2 points

12 comments2 min readLW link

“Redundant” AI Alignment

Mckay Jensen16 Oct 2021 21:32 UTC

12 points

3 comments1 min readLW link

(quevivasbien.github.io)

[MLSN #1]: ICLR Safety Paper Roundup

Dan_H18 Oct 2021 15:19 UTC

59 points

1 comment2 min readLW link

AMA on Truthful AI: Owen Cotton-Barratt, Owain Evans & co-authors

Owain_Evans22 Oct 2021 16:23 UTC

31 points

15 comments1 min readLW link

Hegel vs. GPT-3

Bezzi27 Oct 2021 5:55 UTC

9 points

21 comments2 min readLW link

Google announces Pathways: new generation multitask AI Architecture

Ozyrus29 Oct 2021 11:55 UTC

6 points

1 comment1 min readLW link

(blog.google)

What is the most evil AI that we could build, today?

ThomasJ1 Nov 2021 19:58 UTC

−2 points

14 comments1 min readLW link

Why we need prosocial agents

Akbir Khan2 Nov 2021 15:19 UTC

6 points

0 comments2 min readLW link

Possible research directions to improve the mechanistic explanation of neural networks

delton1379 Nov 2021 2:36 UTC

29 points

8 comments9 min readLW link

What are red flags for Neural Network suffering?

Marius Hobbhahn8 Nov 2021 12:51 UTC

26 points

15 comments12 min readLW link

Using Brain-Computer Interfaces to get more data for AI alignment

Robbo7 Nov 2021 0:00 UTC

35 points

10 comments7 min readLW link

Hardcode the AGI to need our approval indefinitely?

MichaelStJules11 Nov 2021 7:04 UTC

2 points

2 comments1 min readLW link

Stop button: towards a causal solution

tailcalled12 Nov 2021 19:09 UTC

23 points

37 comments9 min readLW link

A FLI postdoctoral grant application: AI alignment via causal analysis and design of agents

PabloAMC13 Nov 2021 1:44 UTC

4 points

0 comments7 min readLW link

What would we do if alignment were futile?

Grant Demaree14 Nov 2021 8:09 UTC

73 points

43 comments3 min readLW link

Attempted Gears Analysis of AGI Intervention Discussion With Eliezer

Zvi15 Nov 2021 3:50 UTC

204 points

48 comments16 min readLW link

(thezvi.wordpress.com)

A positive case for how we might succeed at prosaic AI alignment

evhub16 Nov 2021 1:49 UTC

78 points

47 comments6 min readLW link

Super intelligent AIs that don’t require alignment

Yair Halberstadt16 Nov 2021 19:55 UTC

10 points

2 comments6 min readLW link

Some real examples of gradient hacking

Oliver Sourbut22 Nov 2021 0:11 UTC

15 points

8 comments2 min readLW link

[linkpost] Acquisition of Chess Knowledge in AlphaZero

Quintin Pope23 Nov 2021 7:55 UTC

8 points

1 comment1 min readLW link

AI Tracker: monitoring current and near-future risks from superscale models

Edouard Harris and Jeremie Harris

23 Nov 2021 19:16 UTC

64 points

13 comments3 min readLW link

(aitracker.org)

AI Safety Needs Great Engineers

Andy Jones23 Nov 2021 15:40 UTC

78 points

45 comments4 min readLW link

HIRING: Inform and shape a new project on AI safety at Partnership on AI

Madhulika Srikumar24 Nov 2021 8:27 UTC

6 points

0 comments1 min readLW link

How to measure FLOP/s for Neural Networks empirically?

Marius Hobbhahn29 Nov 2021 15:18 UTC

16 points

5 comments7 min readLW link

AI Governance Fundamentals—Curriculum and Application

Mau30 Nov 2021 2:19 UTC

17 points

0 comments16 min readLW link

Behavior Cloning is Miscalibrated

leogao5 Dec 2021 1:36 UTC

53 points

3 comments3 min readLW link

ML Alignment Theory Program under Evan Hubinger

ozhang, evhub and Victor W

6 Dec 2021 0:03 UTC

82 points

3 comments2 min readLW link

Information bottleneck for counterfactual corrigibility

tailcalled6 Dec 2021 17:11 UTC

8 points

1 comment7 min readLW link

Modeling Failure Modes of High-Level Machine Intelligence

Ben Cottier, Daniel_Eth and Sammy Martin

6 Dec 2021 13:54 UTC

54 points

1 comment12 min readLW link

Finding the multiple ground truths of CoinRun and image classification

Stuart_Armstrong8 Dec 2021 18:13 UTC

15 points

3 comments2 min readLW link

[Question] What alignment-related concepts should be better known in the broader ML community?

Lauro Langosco9 Dec 2021 20:44 UTC

6 points

4 comments1 min readLW link

Understanding Gradient Hacking

peterbarnett10 Dec 2021 15:58 UTC

30 points

5 comments30 min readLW link

What’s the backward-forward FLOP ratio for Neural Networks?

Marius Hobbhahn and Jsevillamol

13 Dec 2021 8:54 UTC

17 points

8 comments10 min readLW link

My Overview of the AI Alignment Landscape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC

111 points

9 comments15 min readLW link

Disentangling Perspectives On Strategy-Stealing in AI Safety

shawnghu18 Dec 2021 20:13 UTC

20 points

1 comment11 min readLW link

Demanding and Designing Aligned Cognitive Architectures

Koen.Holtman21 Dec 2021 17:32 UTC

8 points

5 comments5 min readLW link

Potential gears level explanations of smooth progress

ryan_greenblatt22 Dec 2021 18:05 UTC

4 points

2 comments2 min readLW link

Transformer Circuits

evhub22 Dec 2021 21:09 UTC

142 points

4 comments3 min readLW link

(transformer-circuits.pub)

Gradient Hacking via Schelling Goals

Adam Scherlis28 Dec 2021 20:38 UTC

33 points

4 comments4 min readLW link

Reader-generated Essays

Henrik Karlsson3 Jan 2022 8:56 UTC

17 points

0 comments6 min readLW link

(escapingflatland.substack.com)

Brain Efficiency: Much More than You Wanted to Know

jacob_cannell6 Jan 2022 3:38 UTC

195 points

87 comments28 min readLW link

Understanding the two-head strategy for teaching ML to answer questions honestly

Adam Scherlis11 Jan 2022 23:24 UTC

28 points

1 comment10 min readLW link

Plan B in AI Safety approach

avturchin13 Jan 2022 12:03 UTC

33 points

9 comments2 min readLW link

Truthful LMs as a warm-up for aligned AGI

Jacob_Hilton17 Jan 2022 16:49 UTC

65 points

14 comments13 min readLW link

How I’m thinking about GPT-N

delton13717 Jan 2022 17:11 UTC

46 points

21 comments18 min readLW link

Alignment Problems All the Way Down

peterbarnett22 Jan 2022 0:19 UTC

26 points

7 comments10 min readLW link

[Question] How feasible/costly would it be to train a very large AI model on distributed clusters of GPUs?

Anonymous25 Jan 2022 19:20 UTC

7 points

4 comments1 min readLW link

Causality, Transformative AI and alignment—part I

Marius Hobbhahn27 Jan 2022 16:18 UTC

13 points

11 comments8 min readLW link

2+2: Ontological Framework

Lyrialtus1 Feb 2022 1:07 UTC

−15 points

2 comments12 min readLW link

QNR prospects are important for AI alignment research

Eric Drexler3 Feb 2022 15:20 UTC

82 points

10 comments11 min readLW link

Paradigm-building: Introduction

Cameron Berg8 Feb 2022 0:06 UTC

25 points

0 comments2 min readLW link

Paradigm-building: The hierarchical question framework

Cameron Berg9 Feb 2022 16:47 UTC

11 points

16 comments3 min readLW link

Question 1: Predicted architecture of AGI learning algorithm(s)

Cameron Berg10 Feb 2022 17:22 UTC

12 points

1 comment7 min readLW link

Question 2: Predicted bad outcomes of AGI learning architecture

Cameron Berg11 Feb 2022 22:23 UTC

5 points

1 comment10 min readLW link

Question 3: Control proposals for minimizing bad outcomes

Cameron Berg12 Feb 2022 19:13 UTC

5 points

1 comment7 min readLW link

Question 4: Implementing the control proposals

Cameron Berg13 Feb 2022 17:12 UTC

6 points

2 comments5 min readLW link

Question 5: The timeline hyperparameter

Cameron Berg14 Feb 2022 16:38 UTC

5 points

3 comments7 min readLW link

Paradigm-building: Conclusion and practical takeaways

Cameron Berg15 Feb 2022 16:11 UTC

2 points

1 comment2 min readLW link

How complex are myopic imitators?

Vivek Hebbar8 Feb 2022 12:00 UTC

23 points

1 comment15 min readLW link

Metaculus launches contest for essays with quantitative predictions about AI

Tamay Besiroglu and Metaculus

8 Feb 2022 16:07 UTC

25 points

2 comments1 min readLW link

(www.metaculus.com)

Hypothesis: gradient descent prefers general circuits

Quintin Pope8 Feb 2022 21:12 UTC

40 points

26 comments11 min readLW link

Compute Trends Across Three eras of Machine Learning

Jsevillamol, Pablo Villalobos, lennart, Marius Hobbhahn, Tamay Besiroglu and anson.ho

16 Feb 2022 14:18 UTC

91 points

13 comments2 min readLW link

[Question] Is the competition/cooperation between symbolic AI and statistical AI (ML) about historical approach to research / engineering, or is it more fundamentally about what intelligent agents “are”?

Edward Hammond17 Feb 2022 23:11 UTC

1 point

1 comment2 min readLW link

HCH and Adversarial Questions

David Udell19 Feb 2022 0:52 UTC

15 points

7 comments26 min readLW link

Thoughts on Dangerous Learned Optimization

peterbarnett19 Feb 2022 10:46 UTC

4 points

2 comments4 min readLW link

Relativized Definitions as a Method to Sidestep the Löbian Obstacle

homotowat27 Feb 2022 6:37 UTC

27 points

4 comments7 min readLW link

What we know about machine learning’s replication crisis

Younes Kamel5 Mar 2022 23:55 UTC

35 points

4 comments6 min readLW link

(youneskamel.substack.com)

Projecting compute trends in Machine Learning

Tamay, lennart and Jsevillamol

7 Mar 2022 15:32 UTC

59 points

5 comments6 min readLW link

[Survey] Expectations of a Post-ASI Order

Lone Pine9 Mar 2022 19:17 UTC

5 points

0 comments1 min readLW link

A Longlist of Theories of Impact for Interpretability

Neel Nanda11 Mar 2022 14:55 UTC

106 points

29 comments5 min readLW link

New GPT3 Impressive Capabilities—InstructGPT3 [1/2]

simeon_c13 Mar 2022 10:58 UTC

71 points

10 comments7 min readLW link

Phase transitions and AGI

Ege Erdil and Metaculus

17 Mar 2022 17:22 UTC

44 points

19 comments9 min readLW link

(www.metaculus.com)

Can we simulate human evolution to create a somewhat aligned AGI?

Thomas Kwa28 Mar 2022 22:55 UTC

21 points

7 comments7 min readLW link

Project Intro: Selection Theorems for Modularity

CallumMcDougall, Avery and Lucius Bushnaq

4 Apr 2022 12:59 UTC

69 points

20 comments16 min readLW link

My agenda for research into transformer capabilities—Introduction

p.b.5 Apr 2022 21:23 UTC

11 points

1 comment3 min readLW link

Research agenda: Can transformers do system 2 thinking?

p.b.6 Apr 2022 13:31 UTC

20 points

0 comments2 min readLW link

PaLM in “Extrapolating GPT-N performance”

Lukas Finnveden6 Apr 2022 13:05 UTC

80 points

19 comments2 min readLW link

Research agenda—Building a multi-modal chess-language model

p.b.7 Apr 2022 12:25 UTC

8 points

2 comments2 min readLW link

Is GPT3 a Good Rationalist? - InstructGPT3 [2/2]

simeon_c7 Apr 2022 13:46 UTC

11 points

0 comments7 min readLW link

Playing with DALL·E 2

Dave Orr7 Apr 2022 18:49 UTC

165 points

116 comments6 min readLW link

Progress Report 4: logit lens redux

Nathan Helm-Burger8 Apr 2022 18:35 UTC

3 points

0 comments2 min readLW link

Hyperbolic takeoff

Ege Erdil9 Apr 2022 15:57 UTC

17 points

8 comments10 min readLW link

(www.metaculus.com)

Elicit: Language Models as Research Assistants

stuhlmueller and jungofthewon

9 Apr 2022 14:56 UTC

70 points

7 comments13 min readLW link

Is it time to start thinking about what AI Friendliness means?

ZT511 Apr 2022 9:32 UTC

18 points

6 comments3 min readLW link

What more compute does for brain-like models: response to Rohin

Nathan Helm-Burger13 Apr 2022 3:40 UTC

22 points

14 comments11 min readLW link

Alignment and Deep Learning

Aiyen17 Apr 2022 0:02 UTC

44 points

35 comments8 min readLW link

[$20K in Prizes] AI Safety Arguments Competition

Dan H, Kevin Liu, ozhang, ThomasW and Sidney Hough

26 Apr 2022 16:13 UTC

74 points

543 comments3 min readLW link

SERI ML Alignment Theory Scholars Program 2022

Ryan Kidd, Victor Warlop and ozhang

27 Apr 2022 0:43 UTC

56 points

6 comments3 min readLW link

[Question] What is a training “step” vs. “episode” in machine learning?

Evan R. Murphy28 Apr 2022 21:53 UTC

9 points

4 comments1 min readLW link

Prize for Alignment Research Tasks

stuhlmueller and William_S

29 Apr 2022 8:57 UTC

63 points

36 comments10 min readLW link

Quick Thoughts on A.I. Governance

Nicholas / Heather Kross30 Apr 2022 14:49 UTC

66 points

8 comments2 min readLW link

(www.thinkingmuchbetter.com)

What DALL-E 2 can and cannot do

Swimmer963 (Miranda Dixon-Luinenburg) 1 May 2022 23:51 UTC

351 points

305 comments9 min readLW link

Open Problems in Negative Side Effect Minimization

Fabian Schimpf and Lukas Fluri

6 May 2022 9:37 UTC

12 points

7 comments17 min readLW link

[Linkpost] diffusion magnetizes manifolds (DALL-E 2 intuition building)

Paul Bricman7 May 2022 11:01 UTC

1 point

0 comments1 min readLW link

(paulbricman.com)

Updating Utility Functions

JustinShovelain and Joar Skalse

9 May 2022 9:44 UTC

36 points

7 comments8 min readLW link

Conditions for mathematical equivalence of Stochastic Gradient Descent and Natural Selection

Oliver Sourbut9 May 2022 21:38 UTC

54 points

12 comments10 min readLW link

AI safety should be made more accessible using non text-based media

Massimog10 May 2022 3:14 UTC

2 points

4 comments4 min readLW link

The limits of AI safety via debate

Marius Hobbhahn10 May 2022 13:33 UTC

28 points

7 comments10 min readLW link

Introduction to the sequence: Interpretability Research for the Most Important Century

Evan R. Murphy12 May 2022 19:59 UTC

16 points

0 comments8 min readLW link

Gato as the Dawn of Early AGI

David Udell15 May 2022 6:52 UTC

84 points

29 comments12 min readLW link

Is AI Progress Impossible To Predict?

alyssavance15 May 2022 18:30 UTC

276 points

38 comments2 min readLW link

DeepMind’s generalist AI, Gato: A non-technical explainer

frances_lorenz, Nora Belrose and jonmenaster

16 May 2022 21:21 UTC

57 points

6 comments6 min readLW link

Gato’s Generalisation: Predictions and Experiments I’d Like to See

Oliver Sourbut18 May 2022 7:15 UTC

43 points

3 comments10 min readLW link

Understanding Gato’s Supervised Reinforcement Learning

lorepieri18 May 2022 11:08 UTC

3 points

5 comments1 min readLW link

(lorenzopieri.com)

A Story of AI Risk: InstructGPT-N

peterbarnett26 May 2022 23:22 UTC

24 points

0 comments8 min readLW link

[Linkpost] A Chinese AI optimized for killing

RomanS3 Jun 2022 9:17 UTC

−2 points

4 comments1 min readLW link

Give the AI safe tools

Adam Jermyn3 Jun 2022 17:04 UTC

3 points

0 comments4 min readLW link

Towards a Formalisation of Returns on Cognitive Reinvestment (Part 1)

DragonGod4 Jun 2022 18:42 UTC

17 points

8 comments13 min readLW link

Give the model a model-builder

Adam Jermyn6 Jun 2022 12:21 UTC

3 points

0 comments5 min readLW link

AGI Safety FAQ / all-dumb-questions-allowed thread

Aryeh Englander7 Jun 2022 5:47 UTC

221 points

515 comments4 min readLW link

Embodiment is Indispensable for AGI

P. G. Keerthana Gopalakrishnan7 Jun 2022 21:31 UTC

6 points

1 comment6 min readLW link

(keerthanapg.com)

You Only Get One Shot: an Intuition Pump for Embedded Agency

Oliver Sourbut9 Jun 2022 21:38 UTC

22 points

4 comments2 min readLW link

Summary of “AGI Ruin: A List of Lethalities”

Stephen McAleese10 Jun 2022 22:35 UTC

32 points

2 comments8 min readLW link

Poorly-Aimed Death Rays

Thane Ruthenis11 Jun 2022 18:29 UTC

43 points

5 comments4 min readLW link

ELK Proposal—Make the Reporter care about the Predictor’s beliefs

Adam Jermyn and Nicholas Schiefer

11 Jun 2022 22:53 UTC

8 points

0 comments6 min readLW link

Grokking “Semi-informative priors over AI timelines”

anson.ho12 Jun 2022 22:17 UTC

15 points

7 comments14 min readLW link

[Question] Favourite new AI productivity tools?

Gabe M15 Jun 2022 1:08 UTC

14 points

5 comments1 min readLW link

Contra Hofstadter on GPT-3 Nonsense

rictic15 Jun 2022 21:53 UTC

235 points

22 comments2 min readLW link

[Question] What if LaMDA is indeed sentient / self-aware / worth having rights?

RomanS16 Jun 2022 9:10 UTC

22 points

13 comments1 min readLW link

Ten experiments in modularity, which we’d like you to run!

CallumMcDougall, Lucius Bushnaq and Avery

16 Jun 2022 9:17 UTC

59 points

2 comments9 min readLW link

Alignment research for “meta” purposes

acylhalide16 Jun 2022 14:03 UTC

15 points

0 comments1 min readLW link

[Question] AI misalignment risk from GPT-like systems?

fiso6419 Jun 2022 17:35 UTC

10 points

8 comments1 min readLW link

Half-baked alignment idea: training to generalize

Aaron Bergman19 Jun 2022 20:16 UTC

7 points

2 comments4 min readLW link

Getting from an unaligned AGI to an aligned AGI?

Tor Økland Barstad21 Jun 2022 12:36 UTC

9 points

7 comments9 min readLW link

Mitigating the damage from unaligned ASI by cooperating with aliens that don’t exist yet

MSRayne21 Jun 2022 16:12 UTC

−8 points

7 comments6 min readLW link

AI Training Should Allow Opt-Out

alyssavance23 Jun 2022 1:33 UTC

76 points

13 comments6 min readLW link

Updated Deference is not a strong argument against the utility uncertainty approach to alignment

Ivan Vendrov24 Jun 2022 19:32 UTC

20 points

8 comments4 min readLW link

SunPJ in Alenia

FlorianH25 Jun 2022 19:39 UTC

7 points

19 comments8 min readLW link

(plausiblestuff.com)

Conditioning Generative Models

Adam Jermyn25 Jun 2022 22:15 UTC

22 points

18 comments10 min readLW link

Training Trace Priors and Speed Priors

Adam Jermyn26 Jun 2022 18:07 UTC

17 points

0 comments3 min readLW link

Deliberation Everywhere: Simple Examples

Oliver Sourbut27 Jun 2022 17:26 UTC

14 points

0 comments15 min readLW link

Deliberation, Reactions, and Control: Tentative Definitions and a Restatement of Instrumental Convergence

Oliver Sourbut27 Jun 2022 17:25 UTC

10 points

0 comments11 min readLW link

Formal Philosophy and Alignment Possible Projects

Whispermute30 Jun 2022 10:42 UTC

33 points

5 comments8 min readLW link

Reframing the AI Risk

Thane Ruthenis1 Jul 2022 18:44 UTC

26 points

7 comments6 min readLW link

Trends in GPU price-performance

Marius Hobbhahn and Tamay

1 Jul 2022 15:51 UTC

85 points

10 comments1 min readLW link

(epochai.org)

Follow along with Columbia EA’s Advanced AI Safety Fellowship!

RohanS2 Jul 2022 17:45 UTC

3 points

0 comments2 min readLW link

(forum.effectivealtruism.org)

Can we achieve AGI Alignment by balancing multiple human objectives?

Ben Smith3 Jul 2022 2:51 UTC

11 points

1 comment4 min readLW link

We Need a Consolidated List of Bad AI Alignment Solutions

Double4 Jul 2022 6:54 UTC

9 points

14 comments1 min readLW link

A compressed take on recent disagreements

kman4 Jul 2022 4:39 UTC

33 points

9 comments1 min readLW link

My Most Likely Reason to Die Young is AI X-Risk

AISafetyIsNotLongtermist4 Jul 2022 17:08 UTC

61 points

24 comments4 min readLW link

(forum.effectivealtruism.org)

The curious case of Pretty Good human inner/outer alignment

PavleMiha5 Jul 2022 19:04 UTC

41 points

45 comments4 min readLW link

Introducing the Fund for Alignment Research (We’re Hiring!)

AdamGleave, Scott Emmons, Ethan Perez and Claudia Shi

6 Jul 2022 2:07 UTC

59 points

0 comments4 min readLW link

Outer vs inner misalignment: three framings

Richard_Ngo6 Jul 2022 19:46 UTC

43 points

4 comments9 min readLW link

Response to Blake Richards: AGI, generality, alignment, & loss functions

Steven Byrnes12 Jul 2022 13:56 UTC

59 points

9 comments15 min readLW link

Goal Alignment Is Robust To the Sharp Left Turn

Thane Ruthenis13 Jul 2022 20:23 UTC

45 points

15 comments4 min readLW link

Deception?! I ain’t got time for that!

Paul Colognese18 Jul 2022 0:06 UTC

50 points

5 comments13 min readLW link

Four questions I ask AI safety researchers

Akash17 Jul 2022 17:25 UTC

17 points

0 comments1 min readLW link

A distillation of Evan Hubinger’s training stories (for SERI MATS)

Daphne_W18 Jul 2022 3:38 UTC

15 points

1 comment10 min readLW link

Conditioning Generative Models for Alignment

Jozdien18 Jul 2022 7:11 UTC

40 points

8 comments22 min readLW link

Information theoretic model analysis may not lend much insight, but we may have been doing them wrong!

Garrett Baker24 Jul 2022 0:42 UTC

7 points

0 comments10 min readLW link

How to Diversify Conceptual Alignment: the Model Behind Refine

adamShimi20 Jul 2022 10:44 UTC

78 points

11 comments8 min readLW link

Our Existing Solutions to AGI Alignment (semi-safe)

Michael Soareverix21 Jul 2022 19:00 UTC

12 points

1 comment3 min readLW link

Reward is not the optimization target

TurnTrout25 Jul 2022 0:03 UTC

252 points

97 comments10 min readLW link

What Environment Properties Select Agents For World-Modeling?

Thane Ruthenis23 Jul 2022 19:27 UTC

24 points

1 comment12 min readLW link

AGI Safety Needs People With All Skillsets!

Severin T. Seehrich25 Jul 2022 13:32 UTC

28 points

0 comments2 min readLW link

Conjecture: Internal Infohazard Policy

Connor Leahy, Sid Black, Chris Scammell and Andrea_Miotti

29 Jul 2022 19:07 UTC

119 points

6 comments19 min readLW link

Humans Reflecting on HRH

leogao29 Jul 2022 21:56 UTC

20 points

4 comments2 min readLW link

[Question] Would “Manhattan Project” style be beneficial or deleterious for AI Alignment?

Valentin20264 Aug 2022 19:12 UTC

5 points

1 comment1 min readLW link

Convergence Towards World-Models: A Gears-Level Model

Thane Ruthenis4 Aug 2022 23:31 UTC

37 points

1 comment13 min readLW link

How To Go From Interpretability To Alignment: Just Retarget The Search

johnswentworth10 Aug 2022 16:08 UTC

143 points

30 comments3 min readLW link

Formalizing Alignment

Marv K10 Aug 2022 18:50 UTC

3 points

0 comments2 min readLW link

My summary of the alignment problem

Peter Hroššo11 Aug 2022 19:42 UTC

16 points

3 comments2 min readLW link

(threadreaderapp.com)

Artificial intelligence wireheading

Big Tony12 Aug 2022 3:06 UTC

3 points

2 comments1 min readLW link

Infant AI Scenario

Nathan112312 Aug 2022 21:20 UTC

1 point

0 comments3 min readLW link

Gradient descent doesn’t select for inner search

Ivan Vendrov13 Aug 2022 4:15 UTC

36 points

23 comments4 min readLW link

No shortcuts to knowledge: Why AI needs to ease up on scaling and learn how to code

Yldedly15 Aug 2022 8:42 UTC

4 points

0 comments1 min readLW link

(deoxyribose.github.io)

Mesa-optimization for goals defined only within a training environment is dangerous

Rubi J. Hudson17 Aug 2022 3:56 UTC

6 points

2 comments4 min readLW link

The longest training run

Jsevillamol, Tamay, Owen D and anson.ho

17 Aug 2022 17:18 UTC

68 points

11 comments9 min readLW link

(epochai.org)

Matt Yglesias on AI Policy

Grant Demaree17 Aug 2022 23:57 UTC

25 points

1 comment1 min readLW link

(www.slowboring.com)

Epistemic Artefacts of (conceptual) AI alignment research

Nora_Ammann and particlemania

19 Aug 2022 17:18 UTC

30 points

1 comment5 min readLW link

A Bite Sized Introduction to ELK

Luk2718217 Sep 2022 0:28 UTC

5 points

0 comments6 min readLW link

Benchmarking Proposals on Risk Scenarios

Paul Bricman20 Aug 2022 10:01 UTC

25 points

2 comments14 min readLW link

The ‘Bitter Lesson’ is Wrong

deepthoughtlife20 Aug 2022 16:15 UTC

−9 points

14 comments2 min readLW link

My Plan to Build Aligned Superintelligence

apollonianblues21 Aug 2022 13:16 UTC

18 points

7 comments8 min readLW link

Beliefs and Disagreements about Automating Alignment Research

Ian McKenzie24 Aug 2022 18:37 UTC

92 points

4 comments7 min readLW link

Google AI integrates PaLM with robotics: SayCan update [Linkpost]

Evan R. Murphy24 Aug 2022 20:54 UTC

25 points

0 comments1 min readLW link

(sites.research.google)

The Shard Theory Alignment Scheme

David Udell25 Aug 2022 4:52 UTC

47 points

33 comments2 min readLW link

[Question] What would you expect a massive multimodal online federated learner to be capable of?

Aryeh Englander27 Aug 2022 17:31 UTC

13 points

4 comments1 min readLW link

(My understanding of) What Everyone in Technical Alignment is Doing and Why

Thomas Larsen and elifland

29 Aug 2022 1:23 UTC

345 points

83 comments38 min readLW link

Breaking down the training/deployment dichotomy

Erik Jenner28 Aug 2022 21:45 UTC

29 points

4 comments3 min readLW link

Strategy For Conditioning Generative Models

james.lucassen and evhub

1 Sep 2022 4:34 UTC

28 points

4 comments18 min readLW link

Gradient Hacker Design Principles From Biology

johnswentworth1 Sep 2022 19:03 UTC

52 points

13 comments3 min readLW link

No, human brains are not (much) more efficient than computers

Jesse Hoogland6 Sep 2022 13:53 UTC

19 points

16 comments4 min readLW link

(www.jessehoogland.com)

Can “Reward Economics” solve AI Alignment?

Q Home7 Sep 2022 7:58 UTC

3 points

15 comments18 min readLW link

Generators Of Disagreement With AI Alignment

George3d67 Sep 2022 18:15 UTC

26 points

9 comments9 min readLW link

(www.epistem.ink)

Searching for Modularity in Large Language Models

NickyP and Stephen Fowler

8 Sep 2022 2:25 UTC

43 points

3 comments14 min readLW link

We may be able to see sharp left turns coming

Ethan Perez and Neel Nanda

3 Sep 2022 2:55 UTC

50 points

26 comments1 min readLW link

Gatekeeper Victory: AI Box Reflection

Double and DaemonicSigil

9 Sep 2022 21:38 UTC

4 points

5 comments9 min readLW link

Can you force a neural network to keep generalizing?

Q Home12 Sep 2022 10:14 UTC

2 points

10 comments5 min readLW link

Alignment via prosocial brain algorithms

Cameron Berg12 Sep 2022 13:48 UTC

42 points

28 comments6 min readLW link

[Linkpost] A survey on over 300 works about interpretability in deep networks

scasper12 Sep 2022 19:07 UTC

96 points

7 comments2 min readLW link

(arxiv.org)

Trying to find the underlying structure of computational systems

Matthias G. Mayer13 Sep 2022 21:16 UTC

17 points

9 comments4 min readLW link

[Question] Are Speed Superintelligences Feasible for Modern ML Techniques?

DragonGod14 Sep 2022 12:59 UTC

8 points

5 comments1 min readLW link

The Defender’s Advantage of Interpretability

Marius Hobbhahn14 Sep 2022 14:05 UTC

41 points

4 comments6 min readLW link

When does technical work to reduce AGI conflict make a difference?: Introduction

JesseClifton, Sammy Martin and Anthony DiGiovanni

14 Sep 2022 19:38 UTC

42 points

3 comments6 min readLW link

ACT-1: Transformer for Actions

Daniel Kokotajlo14 Sep 2022 19:09 UTC

52 points

4 comments1 min readLW link

(www.adept.ai)

[Question] Forecasting thread: How does AI risk level vary based on timelines?

elifland14 Sep 2022 23:56 UTC

33 points

7 comments1 min readLW link

General advice for transitioning into Theoretical AI Safety

Martín Soto15 Sep 2022 5:23 UTC

9 points

0 comments10 min readLW link

Why deceptive alignment matters for AGI safety

Marius Hobbhahn15 Sep 2022 13:38 UTC

48 points

12 comments13 min readLW link

Understanding Conjecture: Notes from Connor Leahy interview

Akash15 Sep 2022 18:37 UTC

103 points

24 comments15 min readLW link

ordering capability thresholds

Tamsin Leake16 Sep 2022 16:36 UTC

27 points

0 comments4 min readLW link

(carado.moe)

Levels of goals and alignment

zeshen16 Sep 2022 16:44 UTC

27 points

4 comments6 min readLW link

Katja Grace on Slowing Down AI, AI Expert Surveys And Estimating AI Risk

Michaël Trazzi16 Sep 2022 17:45 UTC

40 points

2 comments3 min readLW link

(theinsideview.ai)

Summaries: Alignment Fundamentals Curriculum

Leon Lang18 Sep 2022 13:08 UTC

43 points

3 comments1 min readLW link

(docs.google.com)

Leveraging Legal Informatics to Align AI

John Nay18 Sep 2022 20:39 UTC

11 points

0 comments3 min readLW link

(forum.effectivealtruism.org)

Alignment Org Cheat Sheet

Akash and Thomas Larsen

20 Sep 2022 17:36 UTC

63 points

6 comments4 min readLW link

Public-facing Censorship Is Safety Theater, Causing Reputational Damage

Yitz23 Sep 2022 5:08 UTC

144 points

42 comments6 min readLW link

Nearcast-based “deployment problem” analysis

HoldenKarnofsky21 Sep 2022 18:52 UTC

78 points

2 comments26 min readLW link

Mathematical Circuits in Neural Networks

Sean Osier22 Sep 2022 3:48 UTC

34 points

4 comments1 min readLW link

(www.youtube.com)

Understanding Infra-Bayesianism: A Beginner-Friendly Video Series

Jack Parker and Connall Garrod

22 Sep 2022 13:25 UTC

114 points

6 comments2 min readLW link

Interlude: But Who Optimizes The Optimizer?

Paul Bricman23 Sep 2022 15:30 UTC

15 points

0 comments10 min readLW link

[Question] What Do AI Safety Pitches Not Get About Your Field?

Aris22 Sep 2022 21:27 UTC

28 points

3 comments1 min readLW link

Let’s Compare Notes

Shoshannah Tekofsky22 Sep 2022 20:47 UTC

17 points

3 comments6 min readLW link

Brain-over-body biases, and the embodied value problem in AI alignment

geoffreymiller24 Sep 2022 22:24 UTC

10 points

6 comments25 min readLW link

Brief Notes on Transformers

Adam Jermyn26 Sep 2022 14:46 UTC

32 points

2 comments2 min readLW link

You are Underestimating The Likelihood That Convergent Instrumental Subgoals Lead to Aligned AGI

Mark Neyer26 Sep 2022 14:22 UTC

3 points

6 comments3 min readLW link

7 traps that (we think) new alignment researchers often fall into

Akash and Thomas Larsen

27 Sep 2022 23:13 UTC

157 points

10 comments4 min readLW link

Threat-Resistant Bargaining Megapost: Introducing the ROSE Value

Diffractor28 Sep 2022 1:20 UTC

89 points

11 comments53 min readLW link

Failure modes in a shard theory alignment plan

Thomas Kwa27 Sep 2022 22:34 UTC

24 points

2 comments7 min readLW link

QAPR 3: interpretability-guided training of neural nets

Quintin Pope28 Sep 2022 16:02 UTC

47 points

2 comments10 min readLW link

[Question] What’s the actual evidence that AI marketing tools are changing preferences in a way that makes them easier to predict?

Emrik1 Oct 2022 15:21 UTC

10 points

7 comments1 min readLW link

[Question] Any further work on AI Safety Success Stories?

Krieger2 Oct 2022 9:53 UTC

7 points

6 comments1 min readLW link

AI Timelines via Cumulative Optimization Power: Less Long, More Short

jacob_cannell6 Oct 2022 0:21 UTC

111 points

32 comments6 min readLW link

confusion about alignment requirements

Tamsin Leake6 Oct 2022 10:32 UTC

28 points

10 comments3 min readLW link

(carado.moe)

Good ontologies induce commutative diagrams

Erik Jenner9 Oct 2022 0:06 UTC

40 points

5 comments14 min readLW link

Uncontrollable AI as an Existential Risk

Karl von Wendt9 Oct 2022 10:36 UTC

19 points

0 comments20 min readLW link

Objects in Mirror Are Closer Than They Appear...

Vestozia11 Oct 2022 4:34 UTC

2 points

7 comments9 min readLW link

Misalignment Harms Can Be Caused by Low Intelligence Systems

DialecticEel11 Oct 2022 13:39 UTC

11 points

3 comments1 min readLW link

Building a transformer from scratch—AI safety up-skilling challenge

Marius Hobbhahn12 Oct 2022 15:40 UTC

42 points

1 comment5 min readLW link

Help out Redwood Research’s interpretability team by finding heuristics implemented by GPT-2 small

Haoxing Du and Buck

12 Oct 2022 21:25 UTC

49 points

11 comments4 min readLW link

Science of Deep Learning—a technical agenda

Marius Hobbhahn18 Oct 2022 14:54 UTC

35 points

7 comments4 min readLW link

Response to Katja Grace’s AI x-risk counterarguments

Erik Jenner and Johannes Treutlein

19 Oct 2022 1:17 UTC

75 points

18 comments15 min readLW link

[Question] What Does AI Alignment Success Look Like?

shminux20 Oct 2022 0:32 UTC

23 points

7 comments1 min readLW link

AI Research Program Prediction Markets

tailcalled20 Oct 2022 13:42 UTC

38 points

10 comments1 min readLW link

Learning societal values from law as part of an AGI alignment strategy

John Nay21 Oct 2022 2:03 UTC

3 points

18 comments54 min readLW link

Improved Security to Prevent Hacker-AI and Digital Ghosts

Erland Wittkotter21 Oct 2022 10:11 UTC

4 points

3 comments12 min readLW link

What will the scaled up GATO look like? (Updated with questions)

Amal 25 Oct 2022 12:44 UTC

33 points

20 comments1 min readLW link

Intent alignment should not be the goal for AGI x-risk reduction

John Nay26 Oct 2022 1:24 UTC

−6 points

10 comments3 min readLW link

Resources that (I think) new alignment researchers should know about

Akash28 Oct 2022 22:13 UTC

69 points

8 comments4 min readLW link

Boundaries vs Frames

Scott Garrabrant31 Oct 2022 15:14 UTC

47 points

7 comments7 min readLW link

Adversarial Policies Beat Professional-Level Go AIs

sanxiyn3 Nov 2022 13:27 UTC

31 points

35 comments1 min readLW link

(goattack.alignmentfund.org)

The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren and Sid Black

28 Nov 2022 12:54 UTC

159 points

27 comments31 min readLW link

Simple Way to Prevent Power-Seeking AI

research_prime_space7 Dec 2022 0:26 UTC

7 points

1 comment1 min readLW link

You can still fetch the coffee today if you’re dead tomorrow

davidad9 Dec 2022 14:06 UTC

58 points

15 comments5 min readLW link

Extracting and Evaluating Causal Direction in LLMs’ Activations

Fabien Roger and simeon_c

14 Dec 2022 14:33 UTC

22 points

2 comments11 min readLW link

Realism about rationality

Richard_Ngo16 Sep 2018 10:46 UTC

180 points

145 comments4 min readLW link 3 reviews

(thinkingcomplete.blogspot.com)

Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

Ben Pace4 Oct 2019 4:08 UTC

205 points

60 comments15 min readLW link 2 reviews

The Parable of Predict-O-Matic

abramdemski15 Oct 2019 0:49 UTC

291 points

42 comments14 min readLW link 2 reviews

2018 AI Alignment Literature Review and Charity Comparison

Larks18 Dec 2018 4:46 UTC

190 points

26 comments62 min readLW link 1 review

An Orthodox Case Against Utility Functions

abramdemski7 Apr 2020 19:18 UTC

128 points

53 comments8 min readLW link 2 reviews

“How conservative” should the partial maximisers be?

Stuart_Armstrong13 Apr 2020 15:50 UTC

30 points

8 comments2 min readLW link

[AN #95]: A framework for thinking about how to make AI go well

Rohin Shah15 Apr 2020 17:10 UTC

20 points

2 comments10 min readLW link

(mailchi.mp)

AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

Palus Astra16 Apr 2020 0:50 UTC

58 points

27 comments89 min readLW link

Open question: are minimal circuits daemon-free?

paulfchristiano5 May 2018 22:40 UTC

81 points

70 comments2 min readLW link 1 review

Disentangling arguments for the importance of AI safety

Richard_Ngo21 Jan 2019 12:41 UTC

129 points

23 comments8 min readLW link

Integrating Hidden Variables Improves Approximation

johnswentworth16 Apr 2020 21:43 UTC

15 points

4 comments1 min readLW link

AI Services as a Research Paradigm

VojtaKovarik20 Apr 2020 13:00 UTC

30 points

12 comments4 min readLW link

(docs.google.com)

Databases of human behaviour and preferences?

Stuart_Armstrong21 Apr 2020 18:06 UTC

10 points

9 comments1 min readLW link

Critch on career advice for junior AI-x-risk-concerned researchers

Rob Bensinger12 May 2018 2:13 UTC

117 points

25 comments4 min readLW link

Reframing Impact

TurnTrout20 Sep 2019 19:03 UTC

90 points

15 comments3 min readLW link 1 review

Description vs simulated prediction

Richard Korzekwa 22 Apr 2020 16:40 UTC

26 points

0 comments5 min readLW link

(aiimpacts.org)

DeepMind team on specification gaming

JoshuaFox23 Apr 2020 8:01 UTC

30 points

2 comments1 min readLW link

(deepmind.com)

[Question] Does Agent-like Behavior Imply Agent-like Architecture?

Scott Garrabrant23 Aug 2019 2:01 UTC

54 points

7 comments1 min readLW link

Risks from Learned Optimization: Conclusion and Related Work

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

7 Jun 2019 19:53 UTC

78 points

4 comments6 min readLW link

Deceptive Alignment

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

5 Jun 2019 20:16 UTC

97 points

11 comments17 min readLW link

The Inner Alignment Problem

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

4 Jun 2019 1:20 UTC

99 points

17 comments13 min readLW link

How the MtG Color Wheel Explains AI Safety

Scott Garrabrant15 Feb 2019 23:42 UTC

57 points

4 comments6 min readLW link

[Question] How does Gradient Descent Interact with Goodhart?

Scott Garrabrant2 Feb 2019 0:14 UTC

68 points

19 comments4 min readLW link

Formal Open Problem in Decision Theory

Scott Garrabrant29 Nov 2018 3:25 UTC

35 points

11 comments4 min readLW link

The Ubiquitous Converse Lawvere Problem

Scott Garrabrant29 Nov 2018 3:16 UTC

21 points

0 comments2 min readLW link

Embedded Curiosities

Scott Garrabrant and abramdemski

8 Nov 2018 14:19 UTC

88 points

1 comment2 min readLW link

Subsystem Alignment

abramdemski and Scott Garrabrant

6 Nov 2018 16:16 UTC

100 points

12 comments1 min readLW link

Robust Delegation

abramdemski and Scott Garrabrant

4 Nov 2018 16:38 UTC

110 points

10 comments1 min readLW link

Embedded World-Models

abramdemski and Scott Garrabrant

2 Nov 2018 16:07 UTC

87 points

16 comments1 min readLW link

Decision Theory

abramdemski and Scott Garrabrant

31 Oct 2018 18:41 UTC

114 points

46 comments1 min readLW link

(A → B) → A

Scott Garrabrant11 Sep 2018 22:38 UTC

62 points

11 comments2 min readLW link

History of the Development of Logical Induction

Scott Garrabrant29 Aug 2018 3:15 UTC

89 points

4 comments5 min readLW link

Optimization Amplifies

Scott Garrabrant27 Jun 2018 1:51 UTC

98 points

12 comments4 min readLW link

What makes counterfactuals comparable?

Chris_Leong24 Apr 2020 22:47 UTC

11 points

6 comments3 min readLW link

New Paper Expanding on the Goodhart Taxonomy

Scott Garrabrant14 Mar 2018 9:01 UTC

17 points

4 comments1 min readLW link

(arxiv.org)

Sources of intuitions and data on AGI

Scott Garrabrant31 Jan 2018 23:30 UTC

84 points

26 comments3 min readLW link

Corrigibility

paulfchristiano27 Nov 2018 21:50 UTC

52 points

7 comments6 min readLW link

AI prediction case study 5: Omohundro’s AI drives

Stuart_Armstrong15 Mar 2013 9:09 UTC

10 points

5 comments8 min readLW link

Toy model: convergent instrumental goals

Stuart_Armstrong25 Feb 2016 14:03 UTC

15 points

2 comments4 min readLW link

AI-created pseudo-deontology

Stuart_Armstrong12 Feb 2015 21:11 UTC

10 points

35 comments1 min readLW link

Ethical Injunctions

Eliezer Yudkowsky20 Oct 2008 23:00 UTC

66 points

76 comments9 min readLW link

Motivating Abstraction-First Decision Theory

johnswentworth29 Apr 2020 17:47 UTC

42 points

16 comments5 min readLW link

[AN #97]: Are there historical examples of large, robust discontinuities?

Rohin Shah29 Apr 2020 17:30 UTC

15 points

0 comments10 min readLW link

(mailchi.mp)

My Updating Thoughts on AI policy

Ben Pace1 Mar 2020 7:06 UTC

20 points

1 comment9 min readLW link

Useful Does Not Mean Secure

Ben Pace30 Nov 2019 2:05 UTC

46 points

12 comments11 min readLW link

[Question] What is the alternative to intent alignment called?

Richard_Ngo30 Apr 2020 2:16 UTC

12 points

6 comments1 min readLW link

Optimising Society to Constrain Risk of War from an Artificial Superintelligence

JohnCDraper30 Apr 2020 10:47 UTC

3 points

1 comment51 min readLW link

Stanford Encyclopedia of Philosophy on AI ethics and superintelligence

Kaj_Sotala2 May 2020 7:35 UTC

43 points

19 comments7 min readLW link

(plato.stanford.edu)

[Question] How does iterated amplification exceed human abilities?

riceissa2 May 2020 23:44 UTC

19 points

9 comments2 min readLW link

How uniform is the neocortex?

zhukeepa4 May 2020 2:16 UTC

78 points

23 comments11 min readLW link 1 review

Scott Garrabrant’s problem on recovering Brouwer as a corollary of Lawvere

Rupert4 May 2020 10:01 UTC

26 points

2 comments2 min readLW link

“AI and Efficiency”, OA (44✕ improvement in CNNs since 2012)

gwern5 May 2020 16:32 UTC

47 points

0 comments1 min readLW link

(openai.com)

Competitive safety via gradated curricula

Richard_Ngo5 May 2020 18:11 UTC

38 points

5 comments5 min readLW link

Modeling naturalized decision problems in linear logic

jessicata6 May 2020 0:15 UTC

14 points

2 comments6 min readLW link

(unstableontology.com)

[AN #98]: Understanding neural net training by seeing which gradients were helpful

Rohin Shah6 May 2020 17:10 UTC

22 points

3 comments9 min readLW link

(mailchi.mp)

[Question] Is AI safety research less parallelizable than AI research?

Mati_Roy10 May 2020 20:43 UTC

9 points

5 comments1 min readLW link

Thoughts on implementing corrigible robust alignment

Steven Byrnes26 Nov 2019 14:06 UTC

26 points

2 comments6 min readLW link

Wireheading is in the eye of the beholder

Stuart_Armstrong30 Jan 2019 18:23 UTC

26 points

10 comments1 min readLW link

Wireheading as a potential problem with the new impact measure

Stuart_Armstrong25 Sep 2018 14:15 UTC

25 points

20 comments4 min readLW link

Wireheading and discontinuity

Michele Campolo18 Feb 2020 10:49 UTC

21 points

4 comments3 min readLW link

[AN #99]: Doubling times for the efficiency of AI algorithms

Rohin Shah13 May 2020 17:20 UTC

29 points

0 comments10 min readLW link

(mailchi.mp)

How should AIs update a prior over human preferences?

Stuart_Armstrong15 May 2020 13:14 UTC

17 points

9 comments2 min readLW link

Conjecture Workshop

johnswentworth15 May 2020 22:41 UTC

34 points

2 comments2 min readLW link

Multi-agent safety

Richard_Ngo16 May 2020 1:59 UTC

31 points

8 comments5 min readLW link

The Mechanistic and Normative Structure of Agency

Gordon Seidoh Worley18 May 2020 16:03 UTC

15 points

4 comments1 min readLW link

(philpapers.org)

“Starwink” by Alicorn

Zack_M_Davis18 May 2020 8:17 UTC

44 points

1 comment1 min readLW link

(alicorn.elcenia.com)

[AN #100]: What might go wrong if you learn a reward function while acting

Rohin Shah20 May 2020 17:30 UTC

33 points

2 comments12 min readLW link

(mailchi.mp)

Probabilities, weights, sums: pretty much the same for reward functions

Stuart_Armstrong20 May 2020 15:19 UTC

11 points

1 comment2 min readLW link

[Question] Source code size vs learned model size in ML and in humans?

riceissa20 May 2020 8:47 UTC

11 points

6 comments1 min readLW link

Comparing reward learning/reward tampering formalisms

Stuart_Armstrong21 May 2020 12:03 UTC

9 points

3 comments3 min readLW link

AGIs as collectives

Richard_Ngo22 May 2020 20:36 UTC

22 points

23 comments4 min readLW link

[AN #101]: Why we should rigorously measure and forecast AI progress

Rohin Shah27 May 2020 17:20 UTC

15 points

0 comments10 min readLW link

(mailchi.mp)

AI Safety Discussion Days

Linda Linsefors27 May 2020 16:54 UTC

13 points

1 comment3 min readLW link

Building brain-inspired AGI is infinitely easier than understanding the brain

Steven Byrnes2 Jun 2020 14:13 UTC

51 points

14 comments7 min readLW link

Sparsity and interpretability?

Ada Böhm, RobertKirk and Tomáš Gavenčiak

1 Jun 2020 13:25 UTC

41 points

3 comments7 min readLW link

GPT-3: A Summary

leogao2 Jun 2020 18:14 UTC

20 points

0 comments1 min readLW link

(leogao.dev)

Inaccessible information

paulfchristiano3 Jun 2020 5:10 UTC

84 points

17 comments14 min readLW link 2 reviews

(ai-alignment.com)

[AN #102]: Meta learning by GPT-3, and a list of full proposals for AI alignment

Rohin Shah3 Jun 2020 17:20 UTC

38 points

6 comments10 min readLW link

(mailchi.mp)

Feedback is central to agency

Alex Flint1 Jun 2020 12:56 UTC

28 points

1 comment3 min readLW link

Thinking About Super-Human AI: An Examination of Likely Paths and Ultimate Constitution

meanderingmoose4 Jun 2020 23:22 UTC

−3 points

0 comments7 min readLW link

Emergence and Control: An examination of our ability to govern the behavior of intelligent systems

meanderingmoose5 Jun 2020 17:10 UTC

1 point

0 comments6 min readLW link

GAN Discriminators Don’t Generalize?

tryactions8 Jun 2020 20:36 UTC

18 points

7 comments2 min readLW link

More on disambiguating “discontinuity”

Aryeh Englander9 Jun 2020 15:16 UTC

16 points

1 comment3 min readLW link

[AN #103]: ARCHES: an agenda for existential safety, and combining natural language with deep RL

Rohin Shah10 Jun 2020 17:20 UTC

27 points

1 comment10 min readLW link

(mailchi.mp)

Dutch-Booking CDT: Revised Argument

abramdemski27 Oct 2020 4:31 UTC

50 points

22 comments16 min readLW link

[Question] List of public predictions of what GPT-X can or can’t do?

Daniel Kokotajlo14 Jun 2020 14:25 UTC

20 points

9 comments1 min readLW link

Achieving AI alignment through deliberate uncertainty in multiagent systems

Florian Dietz15 Jun 2020 12:19 UTC

3 points

10 comments7 min readLW link

Superexponential Historic Growth, by David Roodman

Ben Pace15 Jun 2020 21:49 UTC

43 points

6 comments5 min readLW link

(www.openphilanthropy.org)

Relating HCH and Logical Induction

abramdemski16 Jun 2020 22:08 UTC

47 points

4 comments5 min readLW link

Image GPT

Daniel Kokotajlo18 Jun 2020 11:41 UTC

29 points

27 comments1 min readLW link

(openai.com)

[AN #104]: The perils of inaccessible information, and what we can learn about AI alignment from COVID

Rohin Shah18 Jun 2020 17:10 UTC

19 points

5 comments8 min readLW link

(mailchi.mp)

[Question] If AI is based on GPT, how to ensure its safety?

avturchin18 Jun 2020 20:33 UTC

20 points

11 comments1 min readLW link

What’s Your Cognitive Algorithm?

Raemon18 Jun 2020 22:16 UTC

71 points

23 comments13 min readLW link

Relevant pre-AGI possibilities

Daniel Kokotajlo20 Jun 2020 10:52 UTC

38 points

7 comments19 min readLW link

(aiimpacts.org)

Plausible cases for HRAD work, and locating the crux in the “realism about rationality” debate

riceissa22 Jun 2020 1:10 UTC

85 points

15 comments10 min readLW link

The Indexing Problem

johnswentworth22 Jun 2020 19:11 UTC

35 points

2 comments4 min readLW link

[Question] Requesting feedback/advice: what Type Theory to study for AI safety?

rvnnt23 Jun 2020 17:03 UTC

7 points

4 comments3 min readLW link

Locality of goals

adamShimi22 Jun 2020 21:56 UTC

16 points

8 comments6 min readLW link

[Question] What is “Instrumental Corrigibility”?

joebernstein23 Jun 2020 20:24 UTC

4 points

1 comment1 min readLW link

Models, myths, dreams, and Cheshire cat grins

Stuart_Armstrong24 Jun 2020 10:50 UTC

21 points

7 comments2 min readLW link

[AN #105]: The economic trajectory of humanity, and what we might mean by optimization

Rohin Shah24 Jun 2020 17:30 UTC

24 points

3 comments11 min readLW link

(mailchi.mp)

There’s an Awesome AI Ethics List and it’s a little thin

AABoyles25 Jun 2020 13:43 UTC

13 points

1 comment1 min readLW link

(github.com)

GPT-3 Fiction Samples

gwern25 Jun 2020 16:12 UTC

63 points

18 comments1 min readLW link

(www.gwern.net)

Walkthrough: The Transformer Architecture [Part 1/2]

Matthew Barnett30 Jul 2019 13:54 UTC

35 points

0 comments6 min readLW link

Robustness as a Path to AI Alignment

abramdemski10 Oct 2017 8:14 UTC

45 points

9 comments9 min readLW link

Radical Probabilism [Transcript]

abramdemski and Ben Pace

26 Jun 2020 22:14 UTC

46 points

12 comments6 min readLW link

AI safety via market making

evhub26 Jun 2020 23:07 UTC

55 points

45 comments11 min readLW link

[Question] Have general decomposers been formalized?

Quinn27 Jun 2020 18:09 UTC

8 points

5 comments1 min readLW link

Gary Marcus vs Cortical Uniformity

Steven Byrnes28 Jun 2020 18:18 UTC

18 points

0 comments8 min readLW link

Web AI discussion Groups

Donald Hobson30 Jun 2020 11:22 UTC

11 points

0 comments2 min readLW link

Comparing AI Alignment Approaches to Minimize False Positive Risk

Gordon Seidoh Worley30 Jun 2020 19:34 UTC

5 points

0 comments9 min readLW link

AvE: Assistance via Empowerment

FactorialCode30 Jun 2020 22:07 UTC

12 points

1 comment1 min readLW link

(arxiv.org)

Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI

Palus Astra1 Jul 2020 17:30 UTC

35 points

4 comments67 min readLW link

[AN #106]: Evaluating generalization ability of learned reward models

Rohin Shah1 Jul 2020 17:20 UTC

14 points

2 comments11 min readLW link

(mailchi.mp)

The “AI Debate” Debate

michaelcohen2 Jul 2020 10:16 UTC

20 points

20 comments3 min readLW link

Idea: Imitation/Value Learning AIXI

Past Account3 Jul 2020 17:10 UTC

3 points

6 comments1 min readLW link

Splitting Debate up into Two Subsystems

Nandi3 Jul 2020 20:11 UTC

13 points

5 comments4 min readLW link

AI Unsafety via Non-Zero-Sum Debate

VojtaKovarik3 Jul 2020 22:03 UTC

25 points

10 comments5 min readLW link

Classifying games like the Prisoner’s Dilemma

philh4 Jul 2020 17:10 UTC

100 points

28 comments6 min readLW link 1 review

(reasonableapproximation.net)

AI-Feynman as a benchmark for what we should be aiming for

Faustus24 Jul 2020 9:24 UTC

8 points

1 comment2 min readLW link

Learning the prior

paulfchristiano5 Jul 2020 21:00 UTC

79 points

29 comments8 min readLW link

(ai-alignment.com)

Better priors as a safety problem

paulfchristiano5 Jul 2020 21:20 UTC

64 points

7 comments5 min readLW link

(ai-alignment.com)

[Question] How far is AGI?

Roko Jelavić5 Jul 2020 17:58 UTC

6 points

5 comments1 min readLW link

Classifying specification problems as variants of Goodhart’s Law

Vika19 Aug 2019 20:40 UTC

70 points

5 comments5 min readLW link 1 review

New safety research agenda: scalable agent alignment via reward modeling

Vika20 Nov 2018 17:29 UTC

34 points

13 comments1 min readLW link

(medium.com)

Designing agent incentives to avoid side effects

Vika and TurnTrout

11 Mar 2019 20:55 UTC

29 points

0 comments2 min readLW link

(medium.com)

Discussion on the machine learning approach to AI safety

Vika1 Nov 2018 20:54 UTC

26 points

3 comments4 min readLW link

Specification gaming examples in AI

Vika3 Apr 2018 12:30 UTC

43 points

9 comments1 min readLW link 2 reviews

[Question] (answered: yes) Has anyone written up a consideration of Downs’s “Paradox of Voting” from the perspective of MIRI-ish decision theories (UDT, FDT, or even just EDT)?

Jameson Quinn6 Jul 2020 18:26 UTC

10 points

24 comments1 min readLW link

New DeepMind AI Safety Research Blog

Vika27 Sep 2018 16:28 UTC

43 points

0 comments1 min readLW link

(medium.com)

Contest: $1,000 for good questions to ask to an Oracle AI

Stuart_Armstrong31 Jul 2019 18:48 UTC

57 points

156 comments3 min readLW link

Deconfusing Human Values Research Agenda v1

Gordon Seidoh Worley23 Mar 2020 16:25 UTC

27 points

12 comments4 min readLW link

[Question] How “honest” is GPT-3?

abramdemski8 Jul 2020 19:38 UTC

72 points

18 comments5 min readLW link

What does it mean to apply decision theory?

abramdemski8 Jul 2020 20:31 UTC

51 points

5 comments8 min readLW link

AI Research Considerations for Human Existential Safety (ARCHES)

habryka9 Jul 2020 2:49 UTC

60 points

8 comments1 min readLW link

(arxiv.org)

The Unreasonable Effectiveness of Deep Learning

Richard_Ngo30 Sep 2018 15:48 UTC

85 points

5 comments13 min readLW link

(thinkingcomplete.blogspot.com)

mAIry’s room: AI reasoning to solve philosophical problems

Stuart_Armstrong5 Mar 2019 20:24 UTC

92 points

41 comments6 min readLW link 2 reviews

Failures of an embodied AIXI

So8res15 Jun 2014 18:29 UTC

48 points

46 comments12 min readLW link

The Problem with AIXI

Rob Bensinger18 Mar 2014 1:55 UTC

43 points

78 comments23 min readLW link

Versions of AIXI can be arbitrarily stupid

Stuart_Armstrong10 Aug 2015 13:23 UTC

29 points

59 comments1 min readLW link

Reflective AIXI and Anthropics

Diffractor24 Sep 2018 2:15 UTC

17 points

13 comments8 min readLW link

AIXI and Existential Despair

paulfchristiano8 Dec 2011 20:03 UTC

23 points

38 comments6 min readLW link

How to make AIXI-tl incapable of learning

itaibn027 Jan 2014 0:05 UTC

7 points

5 comments2 min readLW link

Help request: What is the Kolmogorov complexity of computable approximations to AIXI?

AnnaSalamon5 Dec 2010 10:23 UTC

9 points

9 comments1 min readLW link

“AIXIjs: A Software Demo for General Reinforcement Learning”, Aslanides 2017

gwern29 May 2017 21:09 UTC

7 points

1 comment1 min readLW link

(arxiv.org)

Can AIXI be trained to do anything a human can?

Stuart_Armstrong20 Oct 2014 13:12 UTC

5 points

9 comments2 min readLW link

Shaping economic incentives for collaborative AGI

Kaj_Sotala29 Jun 2018 16:26 UTC

45 points

15 comments4 min readLW link

Is the Star Trek Federation really incapable of building AI?

Kaj_Sotala18 Mar 2018 10:30 UTC

19 points

4 comments2 min readLW link

(kajsotala.fi)

Some conceptual highlights from “Disjunctive Scenarios of Catastrophic AI Risk”

Kaj_Sotala12 Feb 2018 12:30 UTC

33 points

4 comments6 min readLW link

(kajsotala.fi)

Misconceptions about continuous takeoff

Matthew Barnett8 Oct 2019 21:31 UTC

79 points

38 comments4 min readLW link

Distinguishing definitions of takeoff

Matthew Barnett14 Feb 2020 0:16 UTC

60 points

6 comments6 min readLW link

Book review: Artificial Intelligence Safety and Security

PeterMcCluskey8 Dec 2018 3:47 UTC

27 points

3 comments8 min readLW link

(www.bayesianinvestor.com)

Why AI may not foom

John_Maxwell24 Mar 2013 8:11 UTC

29 points

81 comments12 min readLW link

Humans Who Are Not Concentrating Are Not General Intelligences

sarahconstantin25 Feb 2019 20:40 UTC

181 points

35 comments6 min readLW link 1 review

(srconstantin.wordpress.com)

The Hacker Learns to Trust

Ben Pace22 Jun 2019 0:27 UTC

80 points

18 comments8 min readLW link

(medium.com)

Book Review: Human Compatible

Scott Alexander31 Jan 2020 5:20 UTC

77 points

6 comments16 min readLW link

(slatestarcodex.com)

SSC Journal Club: AI Timelines

Scott Alexander8 Jun 2017 19:00 UTC

12 points

15 comments8 min readLW link

Arguments against myopic training

Richard_Ngo9 Jul 2020 16:07 UTC

56 points

39 comments12 min readLW link

On motivations for MIRI’s highly reliable agent design research

jessicata29 Jan 2017 19:34 UTC

27 points

1 comment5 min readLW link

Why is the impact penalty time-inconsistent?

Stuart_Armstrong9 Jul 2020 17:26 UTC

16 points

1 comment2 min readLW link

My current take on the Paul-MIRI disagreement on alignability of messy AI

jessicata29 Jan 2017 20:52 UTC

21 points

0 comments10 min readLW link

Ben Goertzel: The Singularity Institute’s Scary Idea (and Why I Don’t Buy It)

Paul Crowley30 Oct 2010 9:31 UTC

42 points

442 comments1 min readLW link

An Analytic Perspective on AI Alignment

DanielFilan1 Mar 2020 4:10 UTC

54 points

45 comments8 min readLW link

(danielfilan.com)

Mechanistic Transparency for Machine Learning

DanielFilan11 Jul 2018 0:34 UTC

54 points

9 comments4 min readLW link

A model I use when making plans to reduce AI x-risk

Ben Pace19 Jan 2018 0:21 UTC

69 points

41 comments6 min readLW link

AI Researchers On AI Risk

Scott Alexander22 May 2015 11:16 UTC

18 points

0 comments16 min readLW link

Mini advent calendar of Xrisks: Artificial Intelligence

Stuart_Armstrong7 Dec 2012 11:26 UTC

5 points

5 comments1 min readLW link

For FAI: Is “Molecular Nanotechnology” putting our best foot forward?

leplen22 Jun 2013 4:44 UTC

79 points

118 comments3 min readLW link

UFAI cannot be the Great Filter

Thrasymachus22 Dec 2012 11:26 UTC

59 points

92 comments3 min readLW link

Don’t Fear The Filter

Scott Alexander29 May 2014 0:45 UTC

11 points

18 comments6 min readLW link

The Great Filter is early, or AI is hard

Stuart_Armstrong29 Aug 2014 16:17 UTC

32 points

76 comments1 min readLW link

Talk: Key Issues In Near-Term AI Safety Research

Aryeh Englander10 Jul 2020 18:36 UTC

22 points

1 comment1 min readLW link

Mesa-Optimizers vs “Steered Optimizers”

Steven Byrnes10 Jul 2020 16:49 UTC

45 points

7 comments8 min readLW link

AlphaStar: Impressive for RL progress, not for AGI progress

orthonormal2 Nov 2019 1:50 UTC

113 points

58 comments2 min readLW link 1 review

The Catastrophic Convergence Conjecture

TurnTrout14 Feb 2020 21:16 UTC

44 points

15 comments8 min readLW link

[Question] How well can the GPT architecture solve the parity task?

FactorialCode11 Jul 2020 19:02 UTC

19 points

3 comments1 min readLW link

Sunday July 12 — talks by Scott Garrabrant, Alexflint, alexei, Stuart_Armstrong

jacobjacob and Ben Pace

8 Jul 2020 0:27 UTC

19 points

2 comments1 min readLW link

[Link] Word-vector based DL system achieves human parity in verbal IQ tests

jacob_cannell13 Jun 2015 23:38 UTC

17 points

8 comments1 min readLW link

The Power of Intelligence

Eliezer Yudkowsky1 Jan 2007 20:00 UTC

66 points

4 comments4 min readLW link

Comments on CAIS

Richard_Ngo12 Jan 2019 15:20 UTC

76 points

14 comments7 min readLW link

[Question] What are CAIS’ boldest near/medium-term predictions?

jacobjacob28 Mar 2019 13:14 UTC

31 points

17 comments1 min readLW link

Drexler on AI Risk

PeterMcCluskey1 Feb 2019 5:11 UTC

34 points

10 comments9 min readLW link

(www.bayesianinvestor.com)

Six AI Risk/Strategy Ideas

Wei Dai27 Aug 2019 0:40 UTC

64 points

18 comments4 min readLW link 1 review

New report: Intelligence Explosion Microeconomics

Eliezer Yudkowsky29 Apr 2013 23:14 UTC

72 points

251 comments3 min readLW link

Book review: Human Compatible

PeterMcCluskey19 Jan 2020 3:32 UTC

37 points

2 comments5 min readLW link

(www.bayesianinvestor.com)

Thoughts on “Human-Compatible”

TurnTrout10 Oct 2019 5:24 UTC

63 points

35 comments5 min readLW link

Book Review: The AI Does Not Hate You

PeterMcCluskey28 Oct 2019 17:45 UTC

26 points

0 comments5 min readLW link

(www.bayesianinvestor.com)

[Link] Book Review: ‘The AI Does Not Hate You’ by Tom Chivers (Scott Aaronson)

eigen7 Oct 2019 18:16 UTC

19 points

0 comments1 min readLW link

Book Review: Life 3.0: Being Human in the Age of Artificial Intelligence

J Thomas Moros18 Jan 2018 17:18 UTC

8 points

0 comments1 min readLW link

(ferocioustruth.com)

Book Review: Weapons of Math Destruction

Zvi4 Jun 2017 21:20 UTC

1 point

0 comments16 min readLW link

DARPA Digital Tutor: Four Months to Total Technical Expertise?

JohnBuridan6 Jul 2020 23:34 UTC

200 points

19 comments7 min readLW link

Paper: Superintelligence as a Cause or Cure for Risks of Astronomical Suffering

Kaj_Sotala3 Jan 2018 14:39 UTC

1 point

6 comments1 min readLW link

(www.informatica.si)

Preventing s-risks via indexical uncertainty, acausal trade and domination in the multiverse

avturchin27 Sep 2018 10:09 UTC

11 points

6 comments4 min readLW link

Preface to CLR’s Research Agenda on Cooperation, Conflict, and TAI

JesseClifton13 Dec 2019 21:02 UTC

59 points

10 comments2 min readLW link

Sections 1 & 2: Introduction, Strategy and Governance

JesseClifton17 Dec 2019 21:27 UTC

34 points

5 comments14 min readLW link

Sections 3 & 4: Credibility, Peaceful Bargaining Mechanisms

JesseClifton17 Dec 2019 21:46 UTC

19 points

2 comments12 min readLW link

Sections 5 & 6: Contemporary Architectures, Humans in the Loop

JesseClifton20 Dec 2019 3:52 UTC

27 points

4 comments10 min readLW link

Section 7: Foundations of Rational Agency

JesseClifton22 Dec 2019 2:05 UTC

14 points

4 comments8 min readLW link

What counts as defection?

TurnTrout12 Jul 2020 22:03 UTC

81 points

21 comments5 min readLW link 1 review

The Commitment Races problem

Daniel Kokotajlo23 Aug 2019 1:58 UTC

122 points

39 comments5 min readLW link

Alignment Newsletter #36

Rohin Shah12 Dec 2018 1:10 UTC

21 points

0 comments11 min readLW link

(mailchi.mp)

Alignment Newsletter #47

Rohin Shah4 Mar 2019 4:30 UTC

18 points

0 comments8 min readLW link

(mailchi.mp)

Understanding “Deep Double Descent”

evhub6 Dec 2019 0:00 UTC

135 points

51 comments5 min readLW link 4 reviews

[LINK] Strong AI Startup Raises $15M

olalonde21 Aug 2012 20:47 UTC

24 points

13 comments1 min readLW link

Announcing the AI Alignment Prize

cousin_it3 Nov 2017 15:47 UTC

95 points

78 comments1 min readLW link

I’m leaving AI alignment – you better stay

rmoehn12 Mar 2020 5:58 UTC

150 points

19 comments5 min readLW link

New paper: AGI Agent Safety by Iteratively Improving the Utility Function

Koen.Holtman15 Jul 2020 14:05 UTC

21 points

2 comments6 min readLW link

[Question] How should AI debate be judged?

abramdemski15 Jul 2020 22:20 UTC

49 points

27 comments6 min readLW link

Alignment proposals and complexity classes

evhub16 Jul 2020 0:27 UTC

33 points

26 comments13 min readLW link

[AN #107]: The convergent instrumental subgoals of goal-directed agents

Rohin Shah16 Jul 2020 6:47 UTC

13 points

1 comment8 min readLW link

(mailchi.mp)

[AN #108]: Why we should scrutinize arguments for AI risk

Rohin Shah16 Jul 2020 6:47 UTC

19 points

6 comments12 min readLW link

(mailchi.mp)

Environments as a bottleneck in AGI development

Richard_Ngo17 Jul 2020 5:02 UTC

36 points

19 comments6 min readLW link

[Question] Can an agent use interactive proofs to check the alignment of succesors?

PabloAMC17 Jul 2020 19:07 UTC

7 points

2 comments1 min readLW link

Lessons on AI Takeover from the conquistadors

Daniel Kokotajlo and jacobjacob

17 Jul 2020 22:35 UTC

58 points

30 comments5 min readLW link

What Would I Do? Self-prediction in Simple Algorithms

Scott Garrabrant20 Jul 2020 4:27 UTC

54 points

13 comments5 min readLW link

Writeup: Progress on AI Safety via Debate

Beth Barnes and paulfchristiano

5 Feb 2020 21:04 UTC

94 points

18 comments33 min readLW link

Operationalizing Interpretability

lifelonglearner20 Jul 2020 5:22 UTC

20 points

0 comments4 min readLW link

Learning Values in Practice

Stuart_Armstrong20 Jul 2020 18:38 UTC

24 points

0 comments5 min readLW link

Parallels Between AI Safety by Debate and Evidence Law

Cullen20 Jul 2020 22:52 UTC

10 points

1 comment2 min readLW link

(cullenokeefe.com)

The Rediscovery of Interiority in Machine Learning

DanB21 Jul 2020 5:02 UTC

5 points

4 comments1 min readLW link

(danburfoot.net)

The “AI Dungeons” Dragon Model is heavily path dependent (testing GPT-3 on ethics)

Rafael Harth21 Jul 2020 12:14 UTC

44 points

9 comments6 min readLW link

How good is humanity at coordination?

Buck21 Jul 2020 20:01 UTC

78 points

44 comments3 min readLW link

Alignment As A Bottleneck To Usefulness Of GPT-3

johnswentworth21 Jul 2020 20:02 UTC

111 points

57 comments3 min readLW link

$1000 bounty for OpenAI to show whether GPT3 was “deliberately” pretending to be stupider than it is

jacobjacob21 Jul 2020 18:42 UTC

59 points

40 comments2 min readLW link

(twitter.com)

[Preprint] The Computational Limits of Deep Learning

Gordon Seidoh Worley21 Jul 2020 21:25 UTC

9 points

2 comments1 min readLW link

(arxiv.org)

[AN #109]: Teaching neural nets to generalize the way humans would

Rohin Shah22 Jul 2020 17:10 UTC

17 points

3 comments9 min readLW link

(mailchi.mp)

Research agenda for AI safety and a better civilization

agilecaveman22 Jul 2020 6:35 UTC

12 points

2 comments16 min readLW link

Weak HCH accesses EXP

evhub22 Jul 2020 22:36 UTC

14 points

0 comments3 min readLW link

GPT-3 Gems

TurnTrout23 Jul 2020 0:46 UTC

33 points

10 comments48 min readLW link

Optimizing arbitrary expressions with a linear number of queries to a Logical Induction Oracle (Cartoon Guide)

Donald Hobson23 Jul 2020 21:37 UTC

3 points

2 comments2 min readLW link

[Question] Construct a portfolio to profit from AI progress.

sapphire25 Jul 2020 8:18 UTC

29 points

13 comments1 min readLW link

Thinking soberly about the context and consequences of Friendly AI

Mitchell_Porter16 Oct 2012 4:33 UTC

21 points

39 comments1 min readLW link

Goal retention discussion with Eliezer

Max Tegmark4 Sep 2014 22:23 UTC

93 points

26 comments6 min readLW link

[Question] Where do people discuss doing things with GPT-3?

skybrian26 Jul 2020 14:31 UTC

2 points

7 comments1 min readLW link

You Can Probably Amplify GPT3 Directly

Past Account26 Jul 2020 21:58 UTC

34 points

14 comments6 min readLW link

[updated] how does gpt2′s training corpus capture internet discussion? not well

nostalgebraist27 Jul 2020 22:30 UTC

25 points

3 comments2 min readLW link

(nostalgebraist.tumblr.com)

Agentic Language Model Memes

FactorialCode1 Aug 2020 18:03 UTC

16 points

1 comment2 min readLW link

A community-curated repository of interesting GPT-3 stuff

Rudi C28 Jul 2020 14:16 UTC

8 points

0 comments1 min readLW link

(github.com)

[Question] Does the lottery ticket hypothesis suggest the scaling hypothesis?

Daniel Kokotajlo28 Jul 2020 19:52 UTC

14 points

17 comments1 min readLW link

[Question] To what extent are the scaling properties of Transformer networks exceptional?

abramdemski28 Jul 2020 20:06 UTC

30 points

1 comment1 min readLW link

[Question] What happens to variance as neural network training is scaled? What does it imply about “lottery tickets”?

abramdemski28 Jul 2020 20:22 UTC

25 points

4 comments1 min readLW link

[Question] How will internet forums like LW be able to defend against GPT-style spam?

ChristianKl28 Jul 2020 20:12 UTC

14 points

18 comments1 min readLW link

Predictions for GPT-N

hippke29 Jul 2020 1:16 UTC

36 points

31 comments1 min readLW link

Announcement: AI alignment prize winners and next round

cousin_it15 Jan 2018 14:33 UTC

80 points

68 comments2 min readLW link

Jeff Hawkins on neuromorphic AGI within 20 years

Steven Byrnes15 Jul 2019 19:16 UTC

167 points

24 comments12 min readLW link

Cascades, Cycles, Insight...

Eliezer Yudkowsky24 Nov 2008 9:33 UTC

31 points

31 comments8 min readLW link

...Recursion, Magic

Eliezer Yudkowsky25 Nov 2008 9:10 UTC

27 points

28 comments5 min readLW link

References & Resources for LessWrong

XiXiDu10 Oct 2010 14:54 UTC

153 points

106 comments20 min readLW link

[Question] A game designed to beat AI?

Long try17 Mar 2020 3:51 UTC

13 points

29 comments1 min readLW link

Truly Part Of You

Eliezer Yudkowsky21 Nov 2007 2:18 UTC

149 points

59 comments4 min readLW link

[AN #110]: Learning features from human feedback to enable reward learning

Rohin Shah29 Jul 2020 17:20 UTC

13 points

2 comments10 min readLW link

(mailchi.mp)

Structured Tasks for Language Models

Past Account29 Jul 2020 14:17 UTC

5 points

0 comments1 min readLW link

Engaging Seriously with Short Timelines

sapphire29 Jul 2020 19:21 UTC

43 points

23 comments3 min readLW link

What Failure Looks Like: Distilling the Discussion

Ben Pace29 Jul 2020 21:49 UTC

79 points

14 comments7 min readLW link

Learning the prior and generalization

evhub29 Jul 2020 22:49 UTC

34 points

16 comments4 min readLW link

[Question] Is the work on AI alignment relevant to GPT?

Richard_Kennaway30 Jul 2020 12:23 UTC

20 points

5 comments1 min readLW link

Verification and Transparency

DanielFilan8 Aug 2019 1:50 UTC

34 points

6 comments2 min readLW link

(danielfilan.com)

Robin Hanson on Lumpiness of AI Services

DanielFilan17 Feb 2019 23:08 UTC

15 points

2 comments2 min readLW link

(www.overcomingbias.com)

One Way to Think About ML Transparency

Matthew Barnett2 Sep 2019 23:27 UTC

26 points

28 comments5 min readLW link

What is Interpretability?

RobertKirk, Tomáš Gavenčiak and Ada Böhm

17 Mar 2020 20:23 UTC

34 points

0 comments11 min readLW link

Relaxed adversarial training for inner alignment

evhub10 Sep 2019 23:03 UTC

61 points

28 comments1 min readLW link

Conclusion to ‘Reframing Impact’

TurnTrout28 Feb 2020 16:05 UTC

39 points

17 comments2 min readLW link

Bayesian Evolving-to-Extinction

abramdemski14 Feb 2020 23:55 UTC

38 points

13 comments5 min readLW link

Do Sufficiently Advanced Agents Use Logic?

abramdemski13 Sep 2019 19:53 UTC

41 points

11 comments9 min readLW link

World State is the Wrong Abstraction for Impact

TurnTrout1 Oct 2019 21:03 UTC

62 points

19 comments2 min readLW link

Attainable Utility Preservation: Concepts

TurnTrout17 Feb 2020 5:20 UTC

38 points

20 comments1 min readLW link

Attainable Utility Preservation: Empirical Results

TurnTrout and nealeratzlaff

22 Feb 2020 0:38 UTC

61 points

8 comments10 min readLW link 1 review

How Low Should Fruit Hang Before We Pick It?

TurnTrout25 Feb 2020 2:08 UTC

28 points

9 comments12 min readLW link

Attainable Utility Preservation: Scaling to Superhuman

TurnTrout27 Feb 2020 0:52 UTC

28 points

21 comments8 min readLW link

Reasons for Excitement about Impact of Impact Measure Research

TurnTrout27 Feb 2020 21:42 UTC

33 points

8 comments4 min readLW link

Power as Easily Exploitable Opportunities

TurnTrout1 Aug 2020 2:14 UTC

30 points

5 comments6 min readLW link

[Question] Would AGIs parent young AGIs?

Vishrut Arya2 Aug 2020 0:57 UTC

3 points

6 comments1 min readLW link

If I were a well-intentioned AI… I: Image classifier

Stuart_Armstrong26 Feb 2020 12:39 UTC

35 points

4 comments5 min readLW link

Non-Consequentialist Cooperation?

abramdemski11 Jan 2019 9:15 UTC

48 points

15 comments7 min readLW link

Curiosity Killed the Cat and the Asymptotically Optimal Agent

michaelcohen20 Feb 2020 17:28 UTC

27 points

15 comments1 min readLW link

If I were a well-intentioned AI… IV: Mesa-optimising

Stuart_Armstrong2 Mar 2020 12:16 UTC

26 points

2 comments6 min readLW link

Response to Oren Etzioni’s “How to know if artificial intelligence is about to destroy civilization”

Daniel Kokotajlo27 Feb 2020 18:10 UTC

27 points

5 comments8 min readLW link

Clarifying Power-Seeking and Instrumental Convergence

TurnTrout20 Dec 2019 19:59 UTC

42 points

7 comments3 min readLW link

How important are MDPs for AGI (Safety)?

michaelcohen26 Mar 2020 20:32 UTC

14 points

8 comments2 min readLW link

Synthesizing amplification and debate

evhub5 Feb 2020 22:53 UTC

33 points

10 comments4 min readLW link

is gpt-3 few-shot ready for real applications?

nostalgebraist3 Aug 2020 19:50 UTC

31 points

5 comments9 min readLW link

(nostalgebraist.tumblr.com)

Interpretability in ML: A Broad Overview

lifelonglearner4 Aug 2020 19:03 UTC

52 points

5 comments15 min readLW link

Infinite Data/Compute Arguments in Alignment

johnswentworth4 Aug 2020 20:21 UTC

49 points

6 comments2 min readLW link

Four Ways An Impact Measure Could Help Alignment

Matthew Barnett8 Aug 2019 0:10 UTC

21 points

1 comment8 min readLW link

Understanding Recent Impact Measures

Matthew Barnett7 Aug 2019 4:57 UTC

16 points

6 comments7 min readLW link

A Survey of Early Impact Measures

Matthew Barnett6 Aug 2019 1:22 UTC

23 points

0 comments8 min readLW link

Optimization Regularization through Time Penalty

Linda Linsefors1 Jan 2019 13:05 UTC

11 points

4 comments3 min readLW link

Stable Pointers to Value III: Recursive Quantilization

abramdemski21 Jul 2018 8:06 UTC

19 points

4 comments4 min readLW link

Thoughts on Quantilizers

Stuart_Armstrong2 Jun 2017 16:24 UTC

2 points

0 comments2 min readLW link

Quantilizers maximize expected utility subject to a conservative cost constraint

jessicata28 Sep 2015 2:17 UTC

25 points

0 comments5 min readLW link

Quantilal control for finite MDPs

Vanessa Kosoy12 Apr 2018 9:21 UTC

14 points

0 comments13 min readLW link

The limits of corrigibility

Stuart_Armstrong10 Apr 2018 10:49 UTC

27 points

9 comments4 min readLW link

Alignment Newsletter #16: 07/23/18

Rohin Shah23 Jul 2018 16:20 UTC

42 points

0 comments12 min readLW link

(mailchi.mp)

Measuring hardware overhang

hippke5 Aug 2020 19:59 UTC

106 points

14 comments4 min readLW link

[AN #111]: The Circuits hypotheses for deep learning

Rohin Shah5 Aug 2020 17:40 UTC

23 points

0 comments9 min readLW link

(mailchi.mp)

Self-Fulfilling Prophecies Aren’t Always About Self-Awareness

John_Maxwell18 Nov 2019 23:11 UTC

14 points

7 comments4 min readLW link

The Goodhart Game

John_Maxwell18 Nov 2019 23:22 UTC

13 points

5 comments5 min readLW link

Why don’t singularitarians bet on the creation of AGI by buying stocks?

John_Maxwell11 Mar 2020 16:27 UTC

43 points

20 comments4 min readLW link

The Dualist Predict-O-Matic ($100 prize)

John_Maxwell17 Oct 2019 6:45 UTC

16 points

35 comments5 min readLW link

[Question] What AI safety problems need solving for safe AI research assistants?

John_Maxwell5 Nov 2019 2:09 UTC

14 points

13 comments1 min readLW link

Refining the Evolutionary Analogy to AI

lberglund7 Aug 2020 23:13 UTC

9 points

2 comments4 min readLW link

The Fusion Power Generator Scenario

johnswentworth8 Aug 2020 18:31 UTC

136 points

29 comments3 min readLW link

[Question] How much is known about the “inference rules” of logical induction?

Eigil Rischel8 Aug 2020 10:45 UTC

11 points

7 comments1 min readLW link

If I were a well-intentioned AI… II: Acting in a world

Stuart_Armstrong27 Feb 2020 11:58 UTC

20 points

0 comments3 min readLW link

If I were a well-intentioned AI… III: Extremal Goodhart

Stuart_Armstrong28 Feb 2020 11:24 UTC

22 points

0 comments5 min readLW link

Towards a Formalisation of Logical Counterfactuals

Bunthut8 Aug 2020 22:14 UTC

6 points

2 comments2 min readLW link

[Question] 10/50/90% chance of GPT-N Transformative AI?

human_generated_text9 Aug 2020 0:10 UTC

24 points

8 comments1 min readLW link

[Question] Can we expect more value from AI alignment than from an ASI with the goal of running alternate trajectories of our universe?

Maxime Riché9 Aug 2020 17:17 UTC

2 points

5 comments1 min readLW link

In defense of Oracle (“Tool”) AI research

Steven Byrnes7 Aug 2019 19:14 UTC

21 points

11 comments4 min readLW link

How GPT-N will escape from its AI-box

hippke12 Aug 2020 19:34 UTC

7 points

9 comments1 min readLW link

Strong implication of preference uncertainty

Stuart_Armstrong12 Aug 2020 19:02 UTC

20 points

3 comments2 min readLW link

[AN #112]: Engineering a Safer World

Rohin Shah13 Aug 2020 17:20 UTC

25 points

2 comments12 min readLW link

(mailchi.mp)

Room and Board for People Self-Learning ML or Doing Independent ML Research

SamuelKnoche14 Aug 2020 17:19 UTC

7 points

1 comment1 min readLW link

Talk and Q&A—Dan Hendrycks—Paper: Aligning AI With Shared Human Values. On Discord at Aug 28, 2020 8:00-10:00 AM GMT+8.

wassname14 Aug 2020 23:57 UTC

1 point

0 comments1 min readLW link

Search versus design

Alex Flint16 Aug 2020 16:53 UTC

89 points

41 comments36 min readLW link 1 review

Work on Security Instead of Friendliness?

Wei Dai21 Jul 2012 18:28 UTC

51 points

107 comments2 min readLW link

Goal-Directedness: What Success Looks Like

adamShimi16 Aug 2020 18:33 UTC

9 points

0 comments2 min readLW link

[Question] A way to beat superrational/EDT agents?

Abhimanyu Pallavi Sudhir17 Aug 2020 14:33 UTC

5 points

13 comments1 min readLW link

Learning human preferences: optimistic and pessimistic scenarios

Stuart_Armstrong18 Aug 2020 13:05 UTC

27 points

6 comments6 min readLW link

Mesa-Search vs Mesa-Control

abramdemski18 Aug 2020 18:51 UTC

54 points

45 comments7 min readLW link

Why we want unbiased learning processes

Stuart_Armstrong20 Feb 2018 14:48 UTC

13 points

3 comments3 min readLW link

Intuitive examples of reward function learning?

Stuart_Armstrong6 Mar 2018 16:54 UTC

7 points

3 comments2 min readLW link

Open-Category Classification

TurnTrout28 Mar 2018 14:49 UTC

13 points

6 comments10 min readLW link

Looking for adversarial collaborators to test our Debate protocol

Beth Barnes19 Aug 2020 3:15 UTC

52 points

5 comments1 min readLW link

Walkthrough of ‘Formalizing Convergent Instrumental Goals’

TurnTrout26 Feb 2018 2:20 UTC

10 points

2 comments10 min readLW link

Ambiguity Detection

TurnTrout1 Mar 2018 4:23 UTC

11 points

9 comments4 min readLW link

Penalizing Impact via Attainable Utility Preservation

TurnTrout28 Dec 2018 21:46 UTC

24 points

0 comments3 min readLW link

(arxiv.org)

What You See Isn’t Always What You Want

TurnTrout13 Sep 2019 4:17 UTC

30 points

12 comments3 min readLW link

[Question] Instrumental Occam?

abramdemski31 Jan 2020 19:27 UTC

30 points

15 comments1 min readLW link

Compact vs. Wide Models

Vaniver16 Jul 2018 4:09 UTC

31 points

5 comments3 min readLW link

Alex Irpan: “My AI Timelines Have Sped Up”

Vaniver19 Aug 2020 16:23 UTC

43 points

20 comments1 min readLW link

(www.alexirpan.com)

[AN #113]: Checking the ethical intuitions of large language models

Rohin Shah19 Aug 2020 17:10 UTC

23 points

0 comments9 min readLW link

(mailchi.mp)

AI safety as featherless bipeds with broad flat nails

Stuart_Armstrong19 Aug 2020 10:22 UTC

37 points

1 comment1 min readLW link

Time Magazine has an article about the Singularity...

Raemon11 Feb 2011 2:20 UTC

40 points

13 comments1 min readLW link

How rapidly are GPUs improving in price performance?

gallabytes25 Nov 2018 19:54 UTC

31 points

9 comments1 min readLW link

(mediangroup.org)

Our values are underdefined, changeable, and manipulable

Stuart_Armstrong2 Nov 2017 11:09 UTC

25 points

6 comments3 min readLW link

[Question] What funding sources exist for technical AI safety research?

johnswentworth1 Oct 2019 15:30 UTC

26 points

5 comments1 min readLW link

Humans can drive cars

Apprentice30 Jan 2014 11:55 UTC

53 points

89 comments2 min readLW link

A Less Wrong singularity article?

Kaj_Sotala17 Nov 2009 14:15 UTC

31 points

215 comments1 min readLW link

The Bayesian Tyrant

abramdemski20 Aug 2020 0:08 UTC

132 points

20 comments6 min readLW link 1 review

Concept Safety: Producing similar AI-human concept spaces

Kaj_Sotala14 Apr 2015 20:39 UTC

50 points

45 comments8 min readLW link

[LINK] What should a reasonable person believe about the Singularity?

Kaj_Sotala13 Jan 2011 9:32 UTC

38 points

14 comments2 min readLW link

The many ways AIs behave badly

Stuart_Armstrong24 Apr 2018 11:40 UTC

10 points

3 comments2 min readLW link

July 2020 gwern.net newsletter

gwern20 Aug 2020 16:39 UTC

29 points

0 comments1 min readLW link

(www.gwern.net)

Do what we mean vs. do what we say

Rohin Shah30 Aug 2018 22:03 UTC

34 points

14 comments1 min readLW link

[Question] What’s a Decomposable Alignment Topic?

Logan Riggs21 Aug 2020 22:57 UTC

26 points

16 comments1 min readLW link

Tools versus agents

Stuart_Armstrong16 May 2012 13:00 UTC

47 points

39 comments5 min readLW link

An unaligned benchmark

paulfchristiano17 Nov 2018 15:51 UTC

31 points

0 comments9 min readLW link

Following human norms

Rohin Shah20 Jan 2019 23:59 UTC

30 points

10 comments5 min readLW link

nostalgebraist: Recursive Goodhart’s Law

Kaj_Sotala26 Aug 2020 11:07 UTC

53 points

27 comments1 min readLW link

(nostalgebraist.tumblr.com)

[AN #114]: Theory-inspired safety solutions for powerful Bayesian RL agents

Rohin Shah26 Aug 2020 17:20 UTC

21 points

3 comments8 min readLW link

(mailchi.mp)

[Question] How hard would it be to change GPT-3 in a way that allows audio?

ChristianKl28 Aug 2020 14:42 UTC

8 points

5 comments1 min readLW link

Safe Scrambling?

Hoagy29 Aug 2020 14:31 UTC

3 points

1 comment2 min readLW link

(Humor) AI Alignment Critical Failure Table

Kaj_Sotala31 Aug 2020 19:51 UTC

24 points

2 comments1 min readLW link

(sl4.org)

What is ambitious value learning?

Rohin Shah1 Nov 2018 16:20 UTC

49 points

28 comments2 min readLW link

The easy goal inference problem is still hard

paulfchristiano3 Nov 2018 14:41 UTC

50 points

19 comments4 min readLW link

[AN #115]: AI safety research problems in the AI-GA framework

Rohin Shah2 Sep 2020 17:10 UTC

19 points

16 comments6 min readLW link

(mailchi.mp)

Emotional valence vs RL reward: a video game analogy

Steven Byrnes3 Sep 2020 15:28 UTC

12 points

6 comments4 min readLW link

Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda

Logan Riggs and Gurkenglas

3 Sep 2020 18:27 UTC

67 points

12 comments2 min readLW link

“Learning to Summarize with Human Feedback”—OpenAI

[deleted]7 Sep 2020 17:59 UTC

57 points

3 comments1 min readLW link

[AN #116]: How to make explanations of neurons compositional

Rohin Shah9 Sep 2020 17:20 UTC

21 points

2 comments9 min readLW link

(mailchi.mp)

Safer sandboxing via collective separation

Richard_Ngo9 Sep 2020 19:49 UTC

24 points

6 comments4 min readLW link

[Question] Do mesa-optimizer risk arguments rely on the train-test paradigm?

Ben Cottier10 Sep 2020 15:36 UTC

12 points

7 comments1 min readLW link

Safety via selection for obedience

Richard_Ngo10 Sep 2020 10:04 UTC

31 points

1 comment5 min readLW link

How Much Computational Power Does It Take to Match the Human Brain?

habryka12 Sep 2020 6:38 UTC

44 points

1 comment1 min readLW link

(www.openphilanthropy.org)

Decision Theory is multifaceted

Michele Campolo13 Sep 2020 22:30 UTC

7 points

12 comments8 min readLW link

AI Safety Discussion Day

Linda Linsefors15 Sep 2020 14:40 UTC

20 points

0 comments1 min readLW link

[AN #117]: How neural nets would fare under the TEVV framework

Rohin Shah16 Sep 2020 17:20 UTC

27 points

0 comments7 min readLW link

(mailchi.mp)

Applying the Counterfactual Prisoner’s Dilemma to Logical Uncertainty

Chris_Leong16 Sep 2020 10:34 UTC

9 points

5 comments2 min readLW link

Artificial Intelligence: A Modern Approach (4th edition) on the Alignment Problem

Zack_M_Davis17 Sep 2020 2:23 UTC

72 points

12 comments5 min readLW link

(aima.cs.berkeley.edu)

The “Backchaining to Local Search” Technique in AI Alignment

adamShimi18 Sep 2020 15:05 UTC

28 points

1 comment2 min readLW link

Draft report on AI timelines

Ajeya Cotra18 Sep 2020 23:47 UTC

207 points

56 comments1 min readLW link 1 review

Why GPT wants to mesa-optimize & how we might change this

John_Maxwell19 Sep 2020 13:48 UTC

55 points

32 comments9 min readLW link

My (Mis)Adventures With Algorithmic Machine Learning

AHartNtkn20 Sep 2020 5:31 UTC

16 points

4 comments41 min readLW link

[Question] What AI companies would be most likely to have a positive long-term impact on the world as a result of investing in them?

MikkW21 Sep 2020 23:41 UTC

8 points

2 comments2 min readLW link

Anthropomorphisation vs value learning: type 1 vs type 2 errors

Stuart_Armstrong22 Sep 2020 10:46 UTC

16 points

10 comments1 min readLW link

AI Advantages [Gems from the Wiki]

habryka and Kaj_Sotala

22 Sep 2020 22:44 UTC

22 points

7 comments2 min readLW link

(www.lesswrong.com)

A long reply to Ben Garfinkel on Scrutinizing Classic AI Risk Arguments

Søren Elverlin27 Sep 2020 17:51 UTC

17 points

6 comments1 min readLW link

Dehumanisation errors

Stuart_Armstrong23 Sep 2020 9:51 UTC

13 points

0 comments1 min readLW link

[AN #118]: Risks, solutions, and prioritization in a world with many AI systems

Rohin Shah23 Sep 2020 18:20 UTC

15 points

6 comments10 min readLW link

(mailchi.mp)

[Question] David Deutsch on Universal Explainers and AI

alanf24 Sep 2020 7:50 UTC

3 points

8 comments2 min readLW link

KL Divergence as Code Patching Efficiency

Past Account27 Sep 2020 16:06 UTC

17 points

0 comments8 min readLW link

[Question] What to do with imitation humans, other than asking them what the right thing to do is?

Charlie Steiner27 Sep 2020 21:51 UTC

10 points

6 comments1 min readLW link

[Question] What Decision Theory is Implied By Predictive Processing?

johnswentworth28 Sep 2020 17:20 UTC

55 points

17 comments1 min readLW link

AGI safety from first principles: Superintelligence

Richard_Ngo28 Sep 2020 19:53 UTC

80 points

6 comments9 min readLW link

AGI safety from first principles: Introduction

Richard_Ngo28 Sep 2020 19:53 UTC

109 points

18 comments2 min readLW link 1 review

[Question] Examples of self-governance to reduce technology risk?

Jia29 Sep 2020 19:31 UTC

10 points

4 comments1 min readLW link

AGI safety from first principles: Goals and Agency

Richard_Ngo29 Sep 2020 19:06 UTC

70 points

15 comments15 min readLW link

“Unsupervised” translation as an (intent) alignment problem

paulfchristiano30 Sep 2020 0:50 UTC

61 points

15 comments4 min readLW link

(ai-alignment.com)

[AN #119]: AI safety when agents are shaped by environments, not rewards

Rohin Shah30 Sep 2020 17:10 UTC

11 points

0 comments11 min readLW link

(mailchi.mp)

AGI safety from first principles: Control

Richard_Ngo2 Oct 2020 21:51 UTC

61 points

4 comments9 min readLW link

AI race considerations in a report by the U.S. House Committee on Armed Services

NunoSempere4 Oct 2020 12:11 UTC

42 points

4 comments13 min readLW link

[Question] Is there any work on incorporating aleatoric uncertainty and/or inherent randomness into AIXI?

David Scott Krueger (formerly: capybaralet)4 Oct 2020 8:10 UTC

9 points

7 comments1 min readLW link

AGI safety from first principles: Conclusion

Richard_Ngo4 Oct 2020 23:06 UTC

65 points

4 comments3 min readLW link

Universal Eudaimonia

hg005 Oct 2020 13:45 UTC

19 points

6 comments2 min readLW link

The Alignment Problem: Machine Learning and Human Values

Rohin Shah6 Oct 2020 17:41 UTC

120 points

7 comments6 min readLW link 1 review

(www.amazon.com)

[AN #120]: Tracing the intellectual roots of AI and AI alignment

Rohin Shah7 Oct 2020 17:10 UTC

13 points

4 comments10 min readLW link

(mailchi.mp)

[Question] Brainstorming positive visions of AI

jungofthewon7 Oct 2020 16:09 UTC

52 points

25 comments1 min readLW link

[Question] How can an AI demonstrate purely through chat that it is an AI, and not a human?

hugh.mann7 Oct 2020 17:53 UTC

3 points

4 comments1 min readLW link

[Question] Why isn’t JS a popular language for deep learning?

Will Clark8 Oct 2020 14:36 UTC

12 points

21 comments1 min readLW link

[Question] If GPT-6 is human-level AGI but costs $200 per page of output, what would happen?

Daniel Kokotajlo9 Oct 2020 12:00 UTC

28 points

30 comments1 min readLW link

[Question] Shouldn’t there be a Chinese translation of Human Compatible?

mako yass9 Oct 2020 8:47 UTC

18 points

13 comments1 min readLW link

Idealized Factored Cognition

Rafael Harth30 Nov 2020 18:49 UTC

34 points

6 comments11 min readLW link

[Question] Reviews of the book ‘The Alignment Problem’

Mati_Roy11 Oct 2020 7:41 UTC

8 points

3 comments1 min readLW link

[Question] Reviews of TV show NeXt (about AI safety)

Mati_Roy11 Oct 2020 4:31 UTC

25 points

4 comments1 min readLW link

The Achilles Heel Hypothesis for AI

scasper13 Oct 2020 14:35 UTC

20 points

6 comments1 min readLW link

Toy Problem: Detective Story Alignment

johnswentworth13 Oct 2020 21:02 UTC

34 points

4 comments2 min readLW link

[Question] Does anyone worry about A.I. forums like this where they reinforce each other’s biases/ are led by big tech?

misabella1613 Oct 2020 15:14 UTC

4 points

3 comments1 min readLW link

[AN #121]: Forecasting transformative AI timelines using biological anchors

Rohin Shah14 Oct 2020 17:20 UTC

27 points

5 comments14 min readLW link

(mailchi.mp)

Gradient hacking

evhub16 Oct 2019 0:53 UTC

99 points

39 comments3 min readLW link 2 reviews

Impact measurement and value-neutrality verification

evhub15 Oct 2019 0:06 UTC

31 points

13 comments6 min readLW link

Outer alignment and imitative amplification

evhub10 Jan 2020 0:26 UTC

24 points

11 comments9 min readLW link

Safe exploration and corrigibility

evhub28 Dec 2019 23:12 UTC

17 points

4 comments4 min readLW link

[Question] What are some non-purely-sampling ways to do deep RL?

evhub5 Dec 2019 0:09 UTC

15 points

9 comments2 min readLW link

More variations on pseudo-alignment

evhub4 Nov 2019 23:24 UTC

26 points

8 comments3 min readLW link

Towards an empirical investigation of inner alignment

evhub23 Sep 2019 20:43 UTC

44 points

9 comments6 min readLW link

Are minimal circuits deceptive?

evhub7 Sep 2019 18:11 UTC

66 points

11 comments8 min readLW link

Concrete experiments in inner alignment

evhub6 Sep 2019 22:16 UTC

63 points

12 comments6 min readLW link

Towards a mechanistic understanding of corrigibility

evhub22 Aug 2019 23:20 UTC

44 points

26 comments6 min readLW link

A Concrete Proposal for Adversarial IDA

evhub26 Mar 2019 19:50 UTC

16 points

5 comments5 min readLW link

Nuances with ascription universality

evhub12 Feb 2019 23:38 UTC

20 points

1 comment2 min readLW link

Box inversion hypothesis

Jan Kulveit20 Oct 2020 16:20 UTC

59 points

4 comments3 min readLW link

[Question] Has anyone researched specification gaming with biological animals?

David Scott Krueger (formerly: capybaralet)21 Oct 2020 0:20 UTC

9 points

3 comments1 min readLW link

Sunday October 25, 12:00PM (PT) — Scott Garrabrant on “Cartesian Frames”

Ben Pace21 Oct 2020 3:27 UTC

48 points

3 comments2 min readLW link

[Question] Could we use recommender systems to figure out human values?

Olga Babeeva20 Oct 2020 21:35 UTC

7 points

2 comments1 min readLW link

[Question] When was the term “AI alignment” coined?

David Scott Krueger (formerly: capybaralet)21 Oct 2020 18:27 UTC

11 points

8 comments1 min readLW link

[AN #122]: Arguing for AGI-driven existential risk from first principles

Rohin Shah21 Oct 2020 17:10 UTC

28 points

0 comments9 min readLW link

(mailchi.mp)

[Question] What’s the difference between GAI and a government?

DirectedEvolution21 Oct 2020 23:04 UTC

11 points

5 comments1 min readLW link

Moral AI: Options

Manfred11 Jul 2015 21:46 UTC

14 points

6 comments4 min readLW link

Can few-shot learning teach AI right from wrong?

Charlie Steiner20 Jul 2018 7:45 UTC

13 points

3 comments6 min readLW link

Some Comments on Stuart Armstrong’s “Research Agenda v0.9”

Charlie Steiner8 Jul 2019 19:03 UTC

21 points

12 comments4 min readLW link

The Artificial Intentional Stance

Charlie Steiner27 Jul 2019 7:00 UTC

12 points

0 comments4 min readLW link

What’s the dream for giving natural language commands to AI?

Charlie Steiner8 Oct 2019 13:42 UTC

8 points

8 comments7 min readLW link

Supervised learning of outputs in the brain

Steven Byrnes26 Oct 2020 14:32 UTC

27 points

9 comments10 min readLW link

Humans are stunningly rational and stunningly irrational

Stuart_Armstrong23 Oct 2020 14:13 UTC

21 points

4 comments2 min readLW link

Reply to Jebari and Lundborg on Artificial Superintelligence

Richard_Ngo25 Oct 2020 13:50 UTC

31 points

4 comments5 min readLW link

(thinkingcomplete.blogspot.com)

Additive Operations on Cartesian Frames

Scott Garrabrant26 Oct 2020 15:12 UTC

61 points

6 comments11 min readLW link

Security Mindset and Takeoff Speeds

DanielFilan27 Oct 2020 3:20 UTC

54 points

23 comments8 min readLW link

(danielfilan.com)

Biextensional Equivalence

Scott Garrabrant28 Oct 2020 14:07 UTC

43 points

13 comments10 min readLW link

Draft papers for REALab and Decoupled Approval on tampering

Jonathan Uesato and Ramana Kumar

28 Oct 2020 16:01 UTC

47 points

2 comments1 min readLW link

[AN #123]: Inferring what is valuable in order to align recommender systems

Rohin Shah28 Oct 2020 17:00 UTC

20 points

1 comment8 min readLW link

(mailchi.mp)

“Scaling Laws for Autoregressive Generative Modeling”, Henighan et al 2020 {OA}

gwern29 Oct 2020 1:45 UTC

26 points

11 comments1 min readLW link

(arxiv.org)

Controllables and Observables, Revisited

Scott Garrabrant29 Oct 2020 16:38 UTC

34 points

5 comments8 min readLW link

AI risk hub in Singapore?

Daniel Kokotajlo29 Oct 2020 11:45 UTC

57 points

18 comments4 min readLW link

Functors and Coarse Worlds

Scott Garrabrant30 Oct 2020 15:19 UTC

50 points

4 comments8 min readLW link

[Question] Responses to Christiano on takeoff speeds?

Richard_Ngo30 Oct 2020 15:16 UTC

29 points

8 comments1 min readLW link

/r/MLScaling: new subreddit for NN scaling research/discussion

gwern30 Oct 2020 20:50 UTC

20 points

0 comments1 min readLW link

(www.reddit.com)

“Inner Alignment Failures” Which Are Actually Outer Alignment Failures

johnswentworth31 Oct 2020 20:18 UTC

61 points

38 comments5 min readLW link

Automated intelligence is not AI

KatjaGrace1 Nov 2020 23:30 UTC

54 points

10 comments2 min readLW link

(meteuphoric.com)

Confucianism in AI Alignment

johnswentworth2 Nov 2020 21:16 UTC

33 points

28 comments6 min readLW link

[AN #124]: Provably safe exploration through shielding

Rohin Shah4 Nov 2020 18:20 UTC

13 points

0 comments9 min readLW link

(mailchi.mp)

Defining capability and alignment in gradient descent

Edouard Harris5 Nov 2020 14:36 UTC

22 points

6 comments10 min readLW link

Sub-Sums and Sub-Tensors

Scott Garrabrant5 Nov 2020 18:06 UTC

34 points

4 comments8 min readLW link

Multiplicative Operations on Cartesian Frames

Scott Garrabrant3 Nov 2020 19:27 UTC

34 points

23 comments12 min readLW link

Subagents of Cartesian Frames

Scott Garrabrant2 Nov 2020 22:02 UTC

48 points

5 comments8 min readLW link

[Question] What considerations influence whether I have more influence over short or long timelines?

Daniel Kokotajlo5 Nov 2020 19:56 UTC

27 points

30 comments1 min readLW link

Additive and Multiplicative Subagents

Scott Garrabrant6 Nov 2020 14:26 UTC

20 points

7 comments12 min readLW link

Committing, Assuming, Externalizing, and Internalizing

Scott Garrabrant9 Nov 2020 16:59 UTC

31 points

25 comments10 min readLW link

Building AGI Using Language Models

leogao9 Nov 2020 16:33 UTC

11 points

1 comment1 min readLW link

(leogao.dev)

Why You Should Care About Goal-Directedness

adamShimi9 Nov 2020 12:48 UTC

37 points

15 comments9 min readLW link

Clarifying inner alignment terminology

evhub9 Nov 2020 20:40 UTC

98 points

17 comments3 min readLW link 1 review

Eight Definitions of Observability

Scott Garrabrant10 Nov 2020 23:37 UTC

34 points

26 comments12 min readLW link

[AN #125]: Neural network scaling laws across multiple modalities

Rohin Shah11 Nov 2020 18:20 UTC

25 points

7 comments9 min readLW link

(mailchi.mp)

Time in Cartesian Frames

Scott Garrabrant11 Nov 2020 20:25 UTC

48 points

16 comments7 min readLW link

Learning Normativity: A Research Agenda

abramdemski11 Nov 2020 21:59 UTC

76 points

18 comments19 min readLW link

[Question] Any work on honeypots (to detect treacherous turn attempts)?

David Scott Krueger (formerly: capybaralet)12 Nov 2020 5:41 UTC

17 points

4 comments1 min readLW link

Misalignment and misuse: whose values are manifest?

KatjaGrace13 Nov 2020 10:10 UTC

42 points

7 comments2 min readLW link

(meteuphoric.com)

A Self-Embedded Probabilistic Model

johnswentworth13 Nov 2020 20:36 UTC

30 points

2 comments5 min readLW link

TU Darmstadt, Computer Science Master’s with a focus on Machine Learning

Master Programs ML/AI14 Nov 2020 15:50 UTC

6 points

0 comments8 min readLW link

EPF Lausanne, ML related MSc programs

Master Programs ML/AI14 Nov 2020 15:51 UTC

3 points

0 comments4 min readLW link

ETH Zurich, ML related MSc programs

Master Programs ML/AI14 Nov 2020 15:49 UTC

3 points

0 comments10 min readLW link

University of Oxford, Master’s Statistical Science

Master Programs ML/AI14 Nov 2020 15:51 UTC

3 points

0 comments3 min readLW link

University of Edinburgh, Master’s Artificial Intelligence

Master Programs ML/AI14 Nov 2020 15:49 UTC

4 points

0 comments12 min readLW link

University of Amsterdam (UvA), Master’s Artificial Intelligence

Master Programs ML/AI14 Nov 2020 15:49 UTC

16 points

6 comments21 min readLW link

University of Tübingen, Master’s Machine Learning

Master Programs ML/AI14 Nov 2020 15:50 UTC

14 points

0 comments7 min readLW link

A guide to Iterated Amplification & Debate

Rafael Harth15 Nov 2020 17:14 UTC

68 points

10 comments15 min readLW link

Solomonoff Induction and Sleeping Beauty

ike17 Nov 2020 2:28 UTC

7 points

0 comments2 min readLW link

The Pointers Problem: Human Values Are A Function Of Humans’ Latent Variables

johnswentworth18 Nov 2020 17:47 UTC

104 points

43 comments11 min readLW link 2 reviews

The ethics of AI for the Routledge Encyclopedia of Philosophy

Stuart_Armstrong18 Nov 2020 17:55 UTC

45 points

8 comments1 min readLW link

Persuasion Tools: AI takeover without AGI or agency?

Daniel Kokotajlo20 Nov 2020 16:54 UTC

74 points

24 comments11 min readLW link 1 review

UDT might not pay a Counterfactual Mugger

winwonce21 Nov 2020 23:27 UTC

5 points

18 comments2 min readLW link

Changing the AI race payoff matrix

Gurkenglas22 Nov 2020 22:25 UTC

7 points

2 comments1 min readLW link

Syntax, semantics, and symbol grounding, simplified

Stuart_Armstrong23 Nov 2020 16:12 UTC

30 points

4 comments9 min readLW link

Commentary on AGI Safety from First Principles

Richard_Ngo23 Nov 2020 21:37 UTC

80 points

4 comments54 min readLW link

[Question] Critiques of the Agent Foundations agenda?

Jsevillamol24 Nov 2020 16:11 UTC

16 points

3 comments1 min readLW link

[Question] How should OpenAI communicate about the commercial performances of the GPT-3 API?

Maxime Riché24 Nov 2020 8:34 UTC

2 points

0 comments1 min readLW link

[AN #126]: Avoiding wireheading by decoupling action feedback from action effects

Rohin Shah26 Nov 2020 23:20 UTC

24 points

1 comment10 min readLW link

(mailchi.mp)

[Question] Is this a good way to bet on short timelines?

Daniel Kokotajlo28 Nov 2020 12:51 UTC

16 points

8 comments1 min readLW link

Preface to the Sequence on Factored Cognition

Rafael Harth30 Nov 2020 18:49 UTC

35 points

7 comments2 min readLW link

[Linkpost] AlphaFold: a solution to a 50-year-old grand challenge in biology

adamShimi30 Nov 2020 17:33 UTC

54 points

22 comments1 min readLW link

(deepmind.com)

What is “protein folding”? A brief explanation

jasoncrawford1 Dec 2020 2:46 UTC

69 points

9 comments4 min readLW link

(rootsofprogress.org)

[Question] In a multipolar scenario, how do people expect systems to be trained to interact with systems developed by other labs?

JesseClifton1 Dec 2020 20:04 UTC

11 points

6 comments1 min readLW link

[AN #127]: Rethinking agency: Cartesian frames as a formalization of ways to carve up the world into an agent and its environment

Rohin Shah2 Dec 2020 18:20 UTC

46 points

0 comments13 min readLW link

(mailchi.mp)

Beyond 175 billion parameters: Can we anticipate future GPT-X Capabilities?

bakztfuture4 Dec 2020 23:42 UTC

−1 points

1 comment2 min readLW link

Thoughts on Robin Hanson’s AI Impacts interview

Steven Byrnes24 Nov 2019 1:40 UTC

25 points

3 comments7 min readLW link

[RXN#7] Russian x-risks newsletter fall 2020

avturchin5 Dec 2020 16:28 UTC

12 points

0 comments3 min readLW link

The AI Safety Game (UPDATED)

Daniel Kokotajlo5 Dec 2020 10:27 UTC

44 points

9 comments3 min readLW link

Values Form a Shifting Landscape (and why you might care)

VojtaKovarik5 Dec 2020 23:56 UTC

28 points

6 comments4 min readLW link

AI Problems Shared by Non-AI Systems

VojtaKovarik5 Dec 2020 22:15 UTC

7 points

2 comments4 min readLW link

Chance that “AI safety basically [doesn’t need] to be solved, we’ll just solve it by default unless we’re completely completely careless”

Quinn, Aidan_Kierans, Morpheus and Nicholas Turner

8 Dec 2020 21:08 UTC

27 points

0 comments5 min readLW link

Minimal Maps, Semi-Decisions, and Neural Representations

Past Account6 Dec 2020 15:15 UTC

30 points

2 comments4 min readLW link

Launching the Forecasting AI Progress Tournament

Tamay7 Dec 2020 14:08 UTC

20 points

0 comments1 min readLW link

(www.metaculus.com)

[AN #128]: Prioritizing research on AI existential safety based on its application to governance demands

Rohin Shah9 Dec 2020 18:20 UTC

16 points

2 comments10 min readLW link

(mailchi.mp)

Summary of AI Research Considerations for Human Existential Safety (ARCHES)

peterbarnett9 Dec 2020 23:28 UTC

10 points

0 comments13 min readLW link

Clarifying Factored Cognition

Rafael Harth13 Dec 2020 20:02 UTC

23 points

2 comments3 min readLW link

Homogeneity vs. heterogeneity in AI takeoff scenarios

evhub16 Dec 2020 1:37 UTC

95 points

48 comments4 min readLW link

LBIT Proofs 8: Propositions 53-58

Diffractor16 Dec 2020 3:29 UTC

7 points

0 comments18 min readLW link

LBIT Proofs 6: Propositions 39-47

Diffractor16 Dec 2020 3:33 UTC

7 points

0 comments23 min readLW link

LBIT Proofs 5: Propositions 29-38

Diffractor16 Dec 2020 3:35 UTC

7 points

0 comments21 min readLW link

LBIT Proofs 3: Propositions 19-22

Diffractor16 Dec 2020 3:40 UTC

7 points

0 comments17 min readLW link

LBIT Proofs 2: Propositions 10-18

Diffractor16 Dec 2020 3:45 UTC

7 points

0 comments20 min readLW link

LBIT Proofs 1: Propositions 1-9

Diffractor16 Dec 2020 3:48 UTC

7 points

0 comments25 min readLW link

LBIT Proofs 4: Propositions 22-28

Diffractor16 Dec 2020 3:38 UTC

7 points

0 comments17 min readLW link

LBIT Proofs 7: Propositions 48-52

Diffractor16 Dec 2020 3:31 UTC

7 points

0 comments20 min readLW link

Less Basic Inframeasure Theory

Diffractor16 Dec 2020 3:52 UTC

22 points

1 comment61 min readLW link

[AN #129]: Explaining double descent by measuring bias and variance

Rohin Shah16 Dec 2020 18:10 UTC

14 points

1 comment7 min readLW link

(mailchi.mp)

Machine learning could be fundamentally unexplainable

George3d616 Dec 2020 13:32 UTC

26 points

15 comments15 min readLW link

(cerebralab.com)

Beta test GPT-3 based research assistant

jungofthewon16 Dec 2020 13:42 UTC

34 points

2 comments1 min readLW link

[Question] How long till Inverse AlphaFold?

Daniel Kokotajlo17 Dec 2020 19:56 UTC

41 points

18 comments1 min readLW link

Hierarchical planning: context agents

Charlie Steiner19 Dec 2020 11:24 UTC

21 points

6 comments9 min readLW link

[Question] Is there a community aligned with the idea of creating species of AGI systems for them to become our successors?

iamhefesto20 Dec 2020 19:06 UTC

−2 points

7 comments1 min readLW link

Intuition

Rafael Harth20 Dec 2020 21:49 UTC

26 points

1 comment6 min readLW link

2020 AI Alignment Literature Review and Charity Comparison

Larks21 Dec 2020 15:27 UTC

137 points

14 comments68 min readLW link

TAI Safety Bibliographic Database

JessRiedel22 Dec 2020 17:42 UTC

70 points

10 comments17 min readLW link

Announcing AXRP, the AI X-risk Research Podcast

DanielFilan23 Dec 2020 20:00 UTC

54 points

6 comments1 min readLW link

(danielfilan.com)

[AN #130]: A new AI x-risk podcast, and reviews of the field

Rohin Shah24 Dec 2020 18:20 UTC

8 points

0 comments7 min readLW link

(mailchi.mp)

Can we model technological singularity as the phase transition?

Valentin202626 Dec 2020 3:20 UTC

4 points

3 comments4 min readLW link

AGI Alignment Should Solve Corporate Alignment

magfrump27 Dec 2020 2:23 UTC

19 points

6 comments6 min readLW link

Against GDP as a metric for timelines and takeoff speeds

Daniel Kokotajlo29 Dec 2020 17:42 UTC

131 points

16 comments14 min readLW link 1 review

AXRP Episode 3 - Negotiable Reinforcement Learning with Andrew Critch

DanielFilan29 Dec 2020 20:45 UTC

26 points

0 comments27 min readLW link

AXRP Episode 1 - Adversarial Policies with Adam Gleave

DanielFilan29 Dec 2020 20:41 UTC

12 points

5 comments33 min readLW link

AXRP Episode 2 - Learning Human Biases with Rohin Shah

DanielFilan29 Dec 2020 20:43 UTC

13 points

0 comments35 min readLW link

Dario Amodei leaves OpenAI

Daniel Kokotajlo29 Dec 2020 19:31 UTC

69 points

12 comments1 min readLW link

[Question] What Are Some Alternative Approaches to Understanding Agency/Intelligence?

interstice29 Dec 2020 23:21 UTC

15 points

12 comments1 min readLW link

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Joar Skalse29 Dec 2020 13:33 UTC

67 points

58 comments1 min readLW link 1 review

Debate Minus Factored Cognition

abramdemski29 Dec 2020 22:59 UTC

37 points

42 comments11 min readLW link

[AN #131]: Formalizing the argument of ignored attributes in a utility function

Rohin Shah31 Dec 2020 18:20 UTC

13 points

4 comments9 min readLW link

(mailchi.mp)

Reflections on Larks’ 2020 AI alignment literature review

Alex Flint1 Jan 2021 22:53 UTC

79 points

8 comments6 min readLW link

Mental subagent implications for AI Safety

moridinamael3 Jan 2021 18:59 UTC

11 points

0 comments3 min readLW link

The National Defense Authorization Act Contains AI Provisions

ryan_b5 Jan 2021 15:51 UTC

30 points

24 comments1 min readLW link

The Pointers Problem: Clarifications/Variations

abramdemski5 Jan 2021 17:29 UTC

50 points

14 comments18 min readLW link

[AN #132]: Complex and subtly incorrect arguments as an obstacle to debate

Rohin Shah6 Jan 2021 18:20 UTC

19 points

1 comment19 min readLW link

(mailchi.mp)

Out-of-body reasoning (OOBR)

Jon Zero9 Jan 2021 16:10 UTC

5 points

0 comments4 min readLW link

Review of Soft Takeoff Can Still Lead to DSA

Daniel Kokotajlo10 Jan 2021 18:10 UTC

75 points

15 comments6 min readLW link

Review of ‘Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More’

TurnTrout12 Jan 2021 3:57 UTC

40 points

1 comment2 min readLW link

[AN #133]: Building machines that can cooperate (with humans, institutions, or other machines)

Rohin Shah13 Jan 2021 18:10 UTC

14 points

0 comments9 min readLW link

(mailchi.mp)

An Exploratory Toy AI Takeoff Model

niplav13 Jan 2021 18:13 UTC

10 points

3 comments12 min readLW link

Some recent survey papers on (mostly near-term) AI safety, security, and assurance

Aryeh Englander13 Jan 2021 21:50 UTC

11 points

0 comments3 min readLW link

Thoughts on Iason Gabriel’s Artificial Intelligence, Values, and Alignment

Alex Flint14 Jan 2021 12:58 UTC

35 points

14 comments4 min readLW link

Why I’m excited about Debate

Richard_Ngo15 Jan 2021 23:37 UTC

73 points

12 comments7 min readLW link

Excerpt from Arbital Solomonoff induction dialogue

Richard_Ngo17 Jan 2021 3:49 UTC

36 points

6 comments5 min readLW link

(arbital.com)

Short summary of mAIry’s room

Stuart_Armstrong18 Jan 2021 18:11 UTC

26 points

2 comments4 min readLW link

DALL-E does symbol grounding

p.b.17 Jan 2021 21:20 UTC

6 points

0 comments1 min readLW link

Some thoughts on risks from narrow, non-agentic AI

Richard_Ngo19 Jan 2021 0:04 UTC

35 points

21 comments16 min readLW link

Against the Backward Approach to Goal-Directedness

adamShimi19 Jan 2021 18:46 UTC

19 points

6 comments4 min readLW link

[AN #134]: Underspecification as a cause of fragility to distribution shift

Rohin Shah21 Jan 2021 18:10 UTC

13 points

0 comments7 min readLW link

(mailchi.mp)

Counterfactual control incentives

Stuart_Armstrong21 Jan 2021 16:54 UTC

21 points

10 comments9 min readLW link

Policy restrictions and Secret keeping AI

Donald Hobson24 Jan 2021 20:59 UTC

6 points

3 comments3 min readLW link

FC final: Can Factored Cognition schemes scale?

Rafael Harth24 Jan 2021 22:18 UTC

15 points

0 comments17 min readLW link

[AN #135]: Five properties of goal-directed systems

Rohin Shah27 Jan 2021 18:10 UTC

33 points

0 comments8 min readLW link

(mailchi.mp)

AMA on EA Forum: Ajeya Cotra, researcher at Open Phil

Ajeya Cotra29 Jan 2021 23:05 UTC

23 points

0 comments1 min readLW link

(forum.effectivealtruism.org)

Play with neural net

KatjaGrace30 Jan 2021 10:50 UTC

17 points

0 comments1 min readLW link

(worldspiritsockpuppet.com)

A Critique of Non-Obstruction

Joe_Collman3 Feb 2021 8:45 UTC

13 points

10 comments4 min readLW link

Distinguishing claims about training vs deployment

Richard_Ngo3 Feb 2021 11:30 UTC

61 points

30 comments9 min readLW link

Graphical World Models, Counterfactuals, and Machine Learning Agents

Koen.Holtman17 Feb 2021 11:07 UTC

6 points

2 comments10 min readLW link

OpenAI: “Scaling Laws for Transfer”, Hernandez et al.

Lukas Finnveden4 Feb 2021 12:49 UTC

13 points

3 comments1 min readLW link

(arxiv.org)

Evolutions Building Evolutions: Layers of Generate and Test

plex5 Feb 2021 18:21 UTC

11 points

1 comment6 min readLW link

Epistemology of HCH

adamShimi9 Feb 2021 11:46 UTC

16 points

2 comments10 min readLW link

[Question] Mathematical Models of Progress?

abramdemski16 Feb 2021 0:21 UTC

28 points

8 comments2 min readLW link

[Question] Suggestions of posts on the AF to review

adamShimi16 Feb 2021 12:40 UTC

56 points

20 comments1 min readLW link

Disentangling Corrigibility: 2015-2021

Koen.Holtman16 Feb 2021 18:01 UTC

17 points

20 comments9 min readLW link

Cartesian frames as generalised models

Stuart_Armstrong16 Feb 2021 16:09 UTC

20 points

0 comments5 min readLW link

[AN #138]: Why AI governance should find problems rather than just solving them

Rohin Shah17 Feb 2021 18:50 UTC

12 points

0 comments9 min readLW link

(mailchi.mp)

Safely controlling the AGI agent reward function

Koen.Holtman17 Feb 2021 14:47 UTC

7 points

0 comments5 min readLW link

AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger

DanielFilan18 Feb 2021 0:03 UTC

41 points

10 comments86 min readLW link

Utility Maximization = Description Length Minimization

johnswentworth18 Feb 2021 18:04 UTC

183 points

40 comments5 min readLW link

Google’s Ethical AI team and AI Safety

magfrump20 Feb 2021 9:42 UTC

12 points

16 comments7 min readLW link

AI Safety Beginners Meetup (European Time)

Linda Linsefors20 Feb 2021 13:20 UTC

8 points

2 comments1 min readLW link

Minimal Map Constraints

Past Account21 Feb 2021 17:49 UTC

6 points

0 comments3 min readLW link

[AN #139]: How the simplicity of reality explains the success of neural nets

Rohin Shah24 Feb 2021 18:30 UTC

26 points

6 comments12 min readLW link

(mailchi.mp)

My Thoughts on the Apperception Engine

J Bostock25 Feb 2021 19:43 UTC

4 points

1 comment3 min readLW link

The Case for Privacy Optimism

bmgarfinkel10 Mar 2020 20:30 UTC

43 points

1 comment32 min readLW link

(benmgarfinkel.wordpress.com)

[Question] How might cryptocurrencies affect AGI timelines?

Dawn Drescher28 Feb 2021 19:16 UTC

13 points

40 comments2 min readLW link

Fun with +12 OOMs of Compute

Daniel Kokotajlo1 Mar 2021 13:30 UTC

212 points

78 comments12 min readLW link 1 review

Links for Feb 2021

ike1 Mar 2021 5:13 UTC

6 points

0 comments6 min readLW link

(misinfounderload.substack.com)

Introduction to Reinforcement Learning

Dr. Birdbrain28 Feb 2021 23:03 UTC

4 points

1 comment3 min readLW link

Curiosity about Aligning Values

esweet3 Mar 2021 0:22 UTC

3 points

7 comments1 min readLW link

How does bee learning compare with machine learning?

eleni4 Mar 2021 1:59 UTC

62 points

15 comments24 min readLW link

Some recent interviews with AI/math luminaries.

fowlertm4 Mar 2021 1:26 UTC

2 points

0 comments1 min readLW link

A Semitechnical Introductory Dialogue on Solomonoff Induction

Eliezer Yudkowsky4 Mar 2021 17:27 UTC

127 points

34 comments54 min readLW link

Connecting the good regulator theorem with semantics and symbol grounding

Stuart_Armstrong4 Mar 2021 14:35 UTC

11 points

0 comments2 min readLW link

[AN #140]: Theoretical models that predict scaling laws

Rohin Shah4 Mar 2021 18:10 UTC

45 points

0 comments10 min readLW link

(mailchi.mp)

Takeaways from the Intelligence Rising RPG

Quinn and Viktor Rehnberg

5 Mar 2021 10:27 UTC

50 points

8 comments12 min readLW link

GPT-3 and the future of knowledge work

fowlertm5 Mar 2021 17:40 UTC

16 points

0 comments2 min readLW link

The case for aligning narrowly superhuman models

Ajeya Cotra5 Mar 2021 22:29 UTC

187 points

74 comments38 min readLW link

MIRI comments on Cotra’s “Case for Aligning Narrowly Superhuman Models”

Rob Bensinger5 Mar 2021 23:43 UTC

136 points

13 comments26 min readLW link

[Question] What are the biggest current impacts of AI?

Sam Clarke7 Mar 2021 21:44 UTC

15 points

5 comments1 min readLW link

CLR’s recent work on multi-agent systems

JesseClifton9 Mar 2021 2:28 UTC

54 points

1 comment13 min readLW link

De-confusing myself about Pascal’s Mugging and Newcomb’s Problem

DirectedEvolution9 Mar 2021 20:45 UTC

7 points

1 comment3 min readLW link

Open Problems with Myopia

Mark Xu and evhub

10 Mar 2021 18:38 UTC

57 points

16 comments8 min readLW link

[AN #141]: The case for practicing alignment work on GPT-3 and other large models

Rohin Shah10 Mar 2021 18:30 UTC

27 points

4 comments8 min readLW link

(mailchi.mp)

[Link] Whittlestone et al., The Societal Implications of Deep Reinforcement Learning

Aryeh Englander10 Mar 2021 18:13 UTC

11 points

1 comment1 min readLW link

(jair.org)

Four Motivations for Learning Normativity

abramdemski11 Mar 2021 20:13 UTC

42 points

7 comments5 min readLW link

[Question] What’s a good way to test basic machine learning code?

Kenny11 Mar 2021 21:27 UTC

5 points

9 comments1 min readLW link

[Video] Intelligence and Stupidity: The Orthogonality Thesis

plex13 Mar 2021 0:32 UTC

5 points

1 comment1 min readLW link

(www.youtube.com)

AI x-risk reduction: why I chose academia over industry

David Scott Krueger (formerly: capybaralet)14 Mar 2021 17:25 UTC

56 points

14 comments3 min readLW link

[Question] Partial-Consciousness as semantic/symbolic representational language model trained on NN

Joe Kwon16 Mar 2021 18:51 UTC

2 points

3 comments1 min readLW link

[AN #142]: The quest to understand a network well enough to reimplement it by hand

Rohin Shah17 Mar 2021 17:10 UTC

34 points

4 comments8 min readLW link

(mailchi.mp)

Intermittent Distillations #1

Mark Xu17 Mar 2021 5:15 UTC

25 points

1 comment10 min readLW link

HCH Speculation Post #2A

Charlie Steiner17 Mar 2021 13:26 UTC

42 points

7 comments9 min readLW link

The Age of Imaginative Machines

Yuli_Ban18 Mar 2021 0:35 UTC

10 points

1 comment11 min readLW link

Generalizing POWER to multi-agent games

midco and TurnTrout

22 Mar 2021 2:41 UTC

52 points

17 comments7 min readLW link

My research methodology

paulfchristiano22 Mar 2021 21:20 UTC

148 points

36 comments16 min readLW link

(ai-alignment.com)

“Infra-Bayesianism with Vanessa Kosoy” – Watch/Discuss Party

Ben Pace22 Mar 2021 23:44 UTC

27 points

45 comments1 min readLW link

Preferences and biases, the information argument

Stuart_Armstrong23 Mar 2021 12:44 UTC

14 points

5 comments1 min readLW link

[AN #143]: How to make embedded agents that reason probabilistically about their environments

Rohin Shah24 Mar 2021 17:20 UTC

13 points

3 comments8 min readLW link

(mailchi.mp)

Toy model of preference, bias, and extra information

Stuart_Armstrong24 Mar 2021 10:14 UTC

9 points

0 comments4 min readLW link

On language modeling and future abstract reasoning research

alexlyzhov25 Mar 2021 17:43 UTC

3 points

1 comment1 min readLW link

(docs.google.com)

Inframeasures and Domain Theory

Diffractor28 Mar 2021 9:19 UTC

27 points

3 comments33 min readLW link

Infra-Domain Proofs 2

Diffractor28 Mar 2021 9:15 UTC

13 points

0 comments21 min readLW link

Infra-Domain proofs 1

Diffractor28 Mar 2021 9:16 UTC

13 points

0 comments23 min readLW link

Scenarios and Warning Signs for Ajeya’s Aggressive, Conservative, and Best Guess AI Timelines

Kevin Liu29 Mar 2021 1:38 UTC

25 points

1 comment9 min readLW link

(kliu.io)

[Question] How do we prepare for final crunch time?

Eli Tyre30 Mar 2021 5:47 UTC

116 points

30 comments8 min readLW link 1 review

[Question] TAI?

Logan Zoellner30 Mar 2021 12:41 UTC

12 points

8 comments1 min readLW link

A use for Classical AI—Expert Systems

Glpusna31 Mar 2021 2:37 UTC

1 point

2 comments2 min readLW link

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Andrew_Critch31 Mar 2021 23:50 UTC

203 points

60 comments22 min readLW link

AI and the Probability of Conflict

tonyoconnor1 Apr 2021 7:00 UTC

8 points

10 comments8 min readLW link

“AI and Compute” trend isn’t predictive of what is happening

alexlyzhov2 Apr 2021 0:44 UTC

133 points

15 comments1 min readLW link

[AN #144]: How language models can also be finetuned for non-language tasks

Rohin Shah2 Apr 2021 17:20 UTC

19 points

0 comments6 min readLW link

(mailchi.mp)

2012 Robin Hanson comment on “Intelligence Explosion: Evidence and Import”

Rob Bensinger2 Apr 2021 16:26 UTC

28 points

4 comments3 min readLW link

My take on Michael Littman on “The HCI of HAI”

Alex Flint2 Apr 2021 19:51 UTC

59 points

4 comments7 min readLW link

[Question] How do scaling laws work for fine-tuning?

Daniel Kokotajlo4 Apr 2021 12:18 UTC

24 points

10 comments1 min readLW link

Averting suffering with sentience throttlers (proposal)

Quinn5 Apr 2021 10:54 UTC

8 points

7 comments3 min readLW link

Reflective Bayesianism

abramdemski6 Apr 2021 19:48 UTC

58 points

27 comments13 min readLW link

[Question] What will GPT-4 be incapable of?

Michaël Trazzi6 Apr 2021 19:57 UTC

34 points

32 comments1 min readLW link

I Trained a Neural Network to Play Helltaker

lsusr7 Apr 2021 8:24 UTC

29 points

5 comments3 min readLW link

[AN #145]: Our three year anniversary!

Rohin Shah9 Apr 2021 17:48 UTC

19 points

0 comments8 min readLW link

(mailchi.mp)

Alignment Newsletter Three Year Retrospective

Rohin Shah7 Apr 2021 14:39 UTC

55 points

0 comments5 min readLW link

Which counterfactuals should an AI follow?

Stuart_Armstrong7 Apr 2021 16:47 UTC

19 points

5 comments7 min readLW link

Solving the whole AGI control problem, version 0.0001

Steven Byrnes8 Apr 2021 15:14 UTC

60 points

7 comments26 min readLW link

The Japanese Quiz: a Thought Experiment of Statistical Epistemology

DanB8 Apr 2021 17:37 UTC

11 points

0 comments9 min readLW link

A possible preference algorithm

Stuart_Armstrong8 Apr 2021 18:25 UTC

22 points

0 comments4 min readLW link

If you don’t design for extrapolation, you’ll extrapolate poorly—possibly fatally

Stuart_Armstrong8 Apr 2021 18:10 UTC

17 points

0 comments4 min readLW link

AXRP Episode 6 - Debate and Imitative Generalization with Beth Barnes

DanielFilan8 Apr 2021 21:20 UTC

24 points

3 comments59 min readLW link

My Current Take on Counterfactuals

abramdemski9 Apr 2021 17:51 UTC

53 points

57 comments25 min readLW link

Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

lifelonglearner and Peter Hase

9 Apr 2021 19:19 UTC

139 points

16 comments102 min readLW link

Why unriggable almost implies uninfluenceable

Stuart_Armstrong9 Apr 2021 17:07 UTC

11 points

0 comments4 min readLW link

Intermittent Distillations #2

Mark Xu14 Apr 2021 6:47 UTC

32 points

4 comments9 min readLW link

Test Cases for Impact Regularisation Methods

DanielFilan6 Feb 2019 21:50 UTC

58 points

5 comments12 min readLW link

(danielfilan.com)

Superrational Agents Kelly Bet Influence!

abramdemski16 Apr 2021 22:08 UTC

41 points

5 comments5 min readLW link

Defining “optimizer”

Chantiel17 Apr 2021 15:38 UTC

9 points

6 comments1 min readLW link

Alex Flint on “A software engineer’s perspective on logical induction”

Raemon17 Apr 2021 6:56 UTC

21 points

8 comments1 min readLW link

[Question] Parameter count of ML systems through time?

Jsevillamol19 Apr 2021 12:54 UTC

31 points

4 comments1 min readLW link

Gradations of Inner Alignment Obstacles

abramdemski20 Apr 2021 22:18 UTC

80 points

22 comments9 min readLW link

Where are intentions to be found?

Alex Flint21 Apr 2021 0:51 UTC

44 points

12 comments9 min readLW link

[AN #147]: An overview of the interpretability landscape

Rohin Shah21 Apr 2021 17:10 UTC

14 points

2 comments7 min readLW link

(mailchi.mp)

NTK/GP Models of Neural Nets Can’t Learn Features

interstice22 Apr 2021 3:01 UTC

31 points

33 comments3 min readLW link

[Question] Is there anything that can stop AGI development in the near term?

Wulky Wilkinsen22 Apr 2021 20:37 UTC

5 points

5 comments1 min readLW link

Probability theory and logical induction as lenses

Alex Flint23 Apr 2021 2:41 UTC

43 points

7 comments6 min readLW link

Naturalism and AI alignment

Michele Campolo24 Apr 2021 16:16 UTC

5 points

12 comments8 min readLW link

Malicious non-state actors and AI safety

keti25 Apr 2021 3:21 UTC

2 points

13 comments2 min readLW link

Announcing the Alignment Research Center

paulfchristiano26 Apr 2021 23:30 UTC

177 points

6 comments1 min readLW link

(ai-alignment.com)

[Linkpost] Treacherous turns in the wild

Mark Xu26 Apr 2021 22:51 UTC

31 points

6 comments1 min readLW link

(lukemuehlhauser.com)

FAQ: Advice for AI Alignment Researchers

Rohin Shah26 Apr 2021 18:59 UTC

67 points

2 comments1 min readLW link

(rohinshah.com)

Pitfalls of the agent model

Alex Flint27 Apr 2021 22:19 UTC

19 points

4 comments20 min readLW link

[AN #148]: Analyzing generalization across more axes than just accuracy or loss

Rohin Shah28 Apr 2021 18:30 UTC

24 points

5 comments11 min readLW link

(mailchi.mp)

AMA: Paul Christiano, alignment researcher

paulfchristiano28 Apr 2021 18:55 UTC

117 points

198 comments1 min readLW link

25 Min Talk on MetaEthical.AI with Questions from Stuart Armstrong

June Ku29 Apr 2021 15:38 UTC

21 points

7 comments1 min readLW link

Low-stakes alignment

paulfchristiano30 Apr 2021 0:10 UTC

70 points

9 comments7 min readLW link 1 review

(ai-alignment.com)

[Weekly Event] Alignment Researcher Coffee Time (in Walled Garden)

adamShimi2 May 2021 12:59 UTC

37 points

0 comments1 min readLW link

Parsing Abram on Gradations of Inner Alignment Obstacles

Alex Flint4 May 2021 17:44 UTC

22 points

4 comments6 min readLW link

Mundane solutions to exotic problems

paulfchristiano4 May 2021 18:20 UTC

56 points

8 comments5 min readLW link

(ai-alignment.com)

April 15, 2040

Nisan4 May 2021 21:18 UTC

97 points

19 comments2 min readLW link

[AN #149]: The newsletter’s editorial policy

Rohin Shah5 May 2021 17:10 UTC

19 points

3 comments8 min readLW link

(mailchi.mp)

Parsing Chris Mingard on Neural Networks

Alex Flint6 May 2021 22:16 UTC

67 points

27 comments6 min readLW link

Life and expanding steerable consequences

Alex Flint7 May 2021 18:33 UTC

46 points

3 comments4 min readLW link

Domain Theory and the Prisoner’s Dilemma: FairBot

Gurkenglas7 May 2021 7:33 UTC

14 points

5 comments2 min readLW link

Pre-Training + Fine-Tuning Favors Deception

Mark Xu8 May 2021 18:36 UTC

27 points

2 comments3 min readLW link

[Event] Weekly Alignment Research Coffee Time (05/10)

adamShimi9 May 2021 11:05 UTC

16 points

2 comments1 min readLW link

[Question] Is driving worth the risk?

Adam Zerner11 May 2021 5:04 UTC

26 points

29 comments7 min readLW link

Yampolskiy on AI Risk Skepticism

Gordon Seidoh Worley11 May 2021 14:50 UTC

15 points

5 comments1 min readLW link

(www.researchgate.net)

Human priors, features and models, languages, and Solmonoff induction

Stuart_Armstrong10 May 2021 10:55 UTC

16 points

2 comments4 min readLW link

[AN #150]: The subtypes of Cooperative AI research

Rohin Shah12 May 2021 17:20 UTC

15 points

0 comments6 min readLW link

(mailchi.mp)

Understanding the Lottery Ticket Hypothesis

Alex Flint14 May 2021 0:25 UTC

50 points

9 comments8 min readLW link

Concerning not getting lost

Alex Flint14 May 2021 19:38 UTC

50 points

9 comments4 min readLW link

[Event] Weekly Alignment Research Coffee Time (05/17)

adamShimi15 May 2021 22:07 UTC

7 points

0 comments1 min readLW link

Optimizers: To Define or not to Define

J Bostock16 May 2021 19:55 UTC

4 points

0 comments4 min readLW link

Intermittent Distillations #3

Mark Xu15 May 2021 7:13 UTC

19 points

1 comment11 min readLW link

AXRP Episode 7 - Side Effects with Victoria Krakovna

DanielFilan14 May 2021 3:50 UTC

34 points

6 comments43 min readLW link

Saving Time

Scott Garrabrant18 May 2021 20:11 UTC

131 points

19 comments4 min readLW link

[Question] Are there any methods for NNs or other ML systems to get information from knockout-like or assay-like experiments?

J Bostock18 May 2021 21:33 UTC

2 points

1 comment1 min readLW link

SGD’s Bias

johnswentworth18 May 2021 23:19 UTC

60 points

16 comments3 min readLW link

This Sunday, 12PM PT: Scott Garrabrant on “Finite Factored Sets”

Raemon19 May 2021 1:48 UTC

33 points

4 comments1 min readLW link

[AN #151]: How sparsity in the final layer makes a neural net debuggable

Rohin Shah19 May 2021 17:20 UTC

19 points

0 comments6 min readLW link

(mailchi.mp)

The Variational Characterization of KL-Divergence, Error Catastrophes, and Generalization

Past Account20 May 2021 20:57 UTC

38 points

5 comments3 min readLW link

Oracles, Informers, and Controllers

ozziegooen25 May 2021 14:16 UTC

15 points

2 comments3 min readLW link

Knowledge is not just map/territory resemblance

Alex Flint25 May 2021 17:58 UTC

28 points

4 comments3 min readLW link

MDP models are determined by the agent architecture and the environmental dynamics

TurnTrout26 May 2021 0:14 UTC

23 points

34 comments3 min readLW link

[Question] List of good AI safety project ideas?

Aryeh Englander26 May 2021 22:36 UTC

24 points

8 comments1 min readLW link

AXRP Episode 7.5 - Forecasting Transformative AI from Biological Anchors with Ajeya Cotra

DanielFilan28 May 2021 0:20 UTC

24 points

1 comment67 min readLW link

Predict responses to the “existential risk from AI” survey

Rob Bensinger28 May 2021 1:32 UTC

44 points

6 comments2 min readLW link

Teaching ML to answer questions honestly instead of predicting human answers

paulfchristiano28 May 2021 17:30 UTC

53 points

18 comments16 min readLW link

(ai-alignment.com)

The blue-minimising robot and model splintering

Stuart_Armstrong28 May 2021 15:09 UTC

13 points

4 comments3 min readLW link 1 review

[Question] Use of GPT-3 for identifying Phishing and other email based attacks?

jmh29 May 2021 17:11 UTC

6 points

0 comments1 min readLW link

[Event] Weekly Alignment Research Coffee Time

adamShimi29 May 2021 13:26 UTC

12 points

5 comments1 min readLW link

What is the most effective way to donate to AGI XRisk mitigation?

JoshuaFox30 May 2021 11:08 UTC

44 points

11 comments1 min readLW link

“Existential risk from AI” survey results

Rob Bensinger1 Jun 2021 20:02 UTC

56 points

8 comments11 min readLW link

April 2021 Gwern.net newsletter

gwern3 Jun 2021 15:13 UTC

20 points

0 comments1 min readLW link

(www.gwern.net)

The underlying model of a morphism

Stuart_Armstrong4 Jun 2021 22:29 UTC

10 points

0 comments5 min readLW link

We need a standard set of community advice for how to financially prepare for AGI

GeneSmith7 Jun 2021 7:24 UTC

50 points

53 comments5 min readLW link

Some AI Governance Research Ideas

apc and markusanderljung

7 Jun 2021 14:40 UTC

29 points

2 comments2 min readLW link

Big picture of phasic dopamine

Steven Byrnes8 Jun 2021 13:07 UTC

59 points

18 comments36 min readLW link

Bayeswatch 6: Mechwarrior

lsusr7 Jun 2021 20:20 UTC

47 points

8 comments2 min readLW link

Speculations against GPT-n writing alignment papers

Donald Hobson7 Jun 2021 21:13 UTC

31 points

6 comments2 min readLW link

The reverse Goodhart problem

Stuart_Armstrong8 Jun 2021 15:48 UTC

16 points

22 comments1 min readLW link

Against intelligence

George3d68 Jun 2021 13:03 UTC

12 points

17 comments10 min readLW link

(cerebralab.com)

Dangerous optimisation includes variance minimisation

Stuart_Armstrong8 Jun 2021 11:34 UTC

32 points

5 comments2 min readLW link

Survey on AI existential risk scenarios

Sam Clarke, apc and Jonas Schuett

8 Jun 2021 17:12 UTC

60 points

11 comments7 min readLW link

AXRP Episode 8 - Assistance Games with Dylan Hadfield-Menell

DanielFilan8 Jun 2021 23:20 UTC

22 points

1 comment71 min readLW link

“Decision Transformer” (Tool AIs are secret Agent AIs)

gwern9 Jun 2021 1:06 UTC

37 points

4 comments1 min readLW link

(sites.google.com)

Evan Hubinger on Homogeneity in Takeoff Speeds, Learned Optimization and Interpretability

Michaël Trazzi8 Jun 2021 19:20 UTC

28 points

0 comments55 min readLW link

A naive alignment strategy and optimism about generalization

paulfchristiano10 Jun 2021 0:10 UTC

44 points

4 comments3 min readLW link

(ai-alignment.com)

Knowledge is not just mutual information

Alex Flint10 Jun 2021 1:01 UTC

27 points

6 comments4 min readLW link

The Apprentice Experiment

johnswentworth10 Jun 2021 3:29 UTC

148 points

11 comments4 min readLW link

[Question] ML is now automating parts of chip R&D. How big a deal is this?

Daniel Kokotajlo10 Jun 2021 9:51 UTC

45 points

17 comments1 min readLW link

Oh No My AI (Filk)

Gordon Seidoh Worley11 Jun 2021 15:05 UTC

42 points

7 comments1 min readLW link

May 2021 Gwern.net newsletter

gwern11 Jun 2021 14:13 UTC

31 points

0 comments1 min readLW link

(www.gwern.net)

[Question] What other problems would a successful AI safety algorithm solve?

DirectedEvolution13 Jun 2021 21:07 UTC

12 points

4 comments1 min readLW link

Avoiding the instrumental policy by hiding information about humans

paulfchristiano13 Jun 2021 20:00 UTC

31 points

2 comments2 min readLW link

Answering questions honestly given world-model mismatches

paulfchristiano13 Jun 2021 18:00 UTC

34 points

2 comments16 min readLW link

(ai-alignment.com)

Vignettes Workshop (AI Impacts)

Daniel Kokotajlo15 Jun 2021 12:05 UTC

47 points

3 comments1 min readLW link

Three Paths to Existential Risk from AI

harsimony16 Jun 2021 1:37 UTC

1 point

2 comments1 min readLW link

(harsimony.wordpress.com)

[AN #152]: How we’ve overestimated few-shot learning capabilities

Rohin Shah16 Jun 2021 17:20 UTC

22 points

6 comments8 min readLW link

(mailchi.mp)

AI-Based Code Generation Using GPT-J-6B

Tomás B.16 Jun 2021 15:05 UTC

21 points

15 comments1 min readLW link

(minimaxir.com)

Insufficient Values

Jozdien, Jacob Abraham and Abraham Francis

16 Jun 2021 14:33 UTC

29 points

15 comments5 min readLW link

[Question] Pros and cons of working on near-term technical AI safety and assurance

Aryeh Englander17 Jun 2021 20:17 UTC

11 points

1 comment2 min readLW link

Non-poisonous cake: anthropic updates are normal

Stuart_Armstrong18 Jun 2021 14:51 UTC

27 points

11 comments2 min readLW link

Knowledge is not just precipitation of action

Alex Flint18 Jun 2021 23:26 UTC

21 points

6 comments7 min readLW link

I’m no longer sure that I buy dutch book arguments and this makes me skeptical of the “utility function” abstraction

Eli Tyre22 Jun 2021 3:53 UTC

45 points

29 comments4 min readLW link

Frequent arguments about alignment

John Schulman23 Jun 2021 0:46 UTC

95 points

16 comments5 min readLW link

Empirical Observations of Objective Robustness Failures

jbkjr and Lauro Langosco

23 Jun 2021 23:23 UTC

63 points

5 comments9 min readLW link

[AN #153]: Experiments that demonstrate failures of objective robustness

Rohin Shah26 Jun 2021 17:10 UTC

25 points

1 comment8 min readLW link

(mailchi.mp)

Anthropics and Embedded Agency

dadadarren26 Jun 2021 1:45 UTC

7 points

2 comments2 min readLW link

Deep limitations? Examining expert disagreement over deep learning

Richard_Ngo27 Jun 2021 0:55 UTC

17 points

5 comments1 min readLW link

(link.springer.com)

Finite Factored Sets: LW transcript with running commentary

Rob Bensinger and Scott Garrabrant

27 Jun 2021 16:02 UTC

30 points

0 comments51 min readLW link

Brute force searching for alignment

Donald Hobson27 Jun 2021 21:54 UTC

23 points

3 comments2 min readLW link

How teams went about their research at AI Safety Camp edition 5

Remmelt28 Jun 2021 15:15 UTC

24 points

0 comments6 min readLW link

Search by abstraction

p.b.29 Jun 2021 20:56 UTC

4 points

0 comments1 min readLW link

[Question] Is there a “coherent decisions imply consistent utilities”-style argument for non-lexicographic preferences?

Tetraspace29 Jun 2021 19:14 UTC

3 points

20 comments1 min readLW link

Trying to approximate Statistical Models as Scoring Tables

Jsevillamol29 Jun 2021 17:20 UTC

18 points

2 comments9 min readLW link

Do incoherent entities have stronger reason to become more coherent than less?

KatjaGrace30 Jun 2021 5:50 UTC

46 points

5 comments4 min readLW link

(worldspiritsockpuppet.com)

[AN #154]: What economic growth theory has to say about transformative AI

Rohin Shah30 Jun 2021 17:20 UTC

12 points

0 comments9 min readLW link

(mailchi.mp)

Progress on Causal Influence Diagrams

tom4everitt30 Jun 2021 15:34 UTC

71 points

6 comments9 min readLW link

Could Advanced AI Drive Explosive Economic Growth?

Matthew Barnett30 Jun 2021 22:17 UTC

15 points

4 comments2 min readLW link

(www.openphilanthropy.org)

Experimentally evaluating whether honesty generalizes

paulfchristiano1 Jul 2021 17:47 UTC

99 points

23 comments9 min readLW link

Should VS Would and Newcomb’s Paradox

dadadarren3 Jul 2021 23:45 UTC

5 points

36 comments2 min readLW link

Mauhn Releases AI Safety Documentation

Berg Severens3 Jul 2021 21:23 UTC

4 points

0 comments1 min readLW link

Anthropic Effects in Estimating Evolution Difficulty

Mark Xu5 Jul 2021 4:02 UTC

12 points

2 comments3 min readLW link

A simple example of conditional orthogonality in finite factored sets

DanielFilan6 Jul 2021 0:36 UTC

43 points

3 comments5 min readLW link

(danielfilan.com)

[Question] Is keeping AI “in the box” during training enough?

tgb6 Jul 2021 15:17 UTC

7 points

10 comments1 min readLW link

A second example of conditional orthogonality in finite factored sets

DanielFilan7 Jul 2021 1:40 UTC

46 points

0 comments2 min readLW link

(danielfilan.com)

Agency and the unreliable autonomous car

Alex Flint7 Jul 2021 14:58 UTC

29 points

24 comments10 min readLW link

How much chess engine progress is about adapting to bigger computers?

paulfchristiano7 Jul 2021 22:35 UTC

114 points

23 comments6 min readLW link

BASALT: A Benchmark for Learning from Human Feedback

Rohin Shah8 Jul 2021 17:40 UTC

56 points

20 comments2 min readLW link

(bair.berkeley.edu)

[AN #155]: A Minecraft benchmark for algorithms that learn without reward functions

Rohin Shah8 Jul 2021 17:20 UTC

21 points

5 comments7 min readLW link

(mailchi.mp)

Looking for Collaborators for an AGI Research Project

Rafael Cosman8 Jul 2021 17:01 UTC

3 points

5 comments3 min readLW link

Jackpot! An AI Vignette

bgold8 Jul 2021 20:32 UTC

13 points

0 comments2 min readLW link

Intermittent Distillations #4: Semiconductors, Economics, Intelligence, and Technological Progress.

Mark Xu8 Jul 2021 22:14 UTC

81 points

9 comments10 min readLW link

Finite Factored Sets: Conditional Orthogonality

Scott Garrabrant9 Jul 2021 6:01 UTC

27 points

2 comments7 min readLW link

The accumulation of knowledge: literature review

Alex Flint10 Jul 2021 18:36 UTC

29 points

3 comments7 min readLW link

The inescapability of knowledge

Alex Flint11 Jul 2021 22:59 UTC

28 points

17 comments5 min readLW link

[Link] Musk’s non-missing mood

jimrandomh12 Jul 2021 22:09 UTC

70 points

21 comments1 min readLW link

(lukemuehlhauser.com)

[Question] What will the twenties look like if AGI is 30 years away?

Daniel Kokotajlo13 Jul 2021 8:14 UTC

29 points

18 comments1 min readLW link

Answering questions honestly instead of predicting human answers: lots of problems and some solutions

evhub13 Jul 2021 18:49 UTC

53 points

25 comments31 min readLW link

Model-based RL, Desires, Brains, Wireheading

Steven Byrnes14 Jul 2021 15:11 UTC

17 points

1 comment13 min readLW link

A closer look at chess scalings (into the past)

hippke15 Jul 2021 8:13 UTC

49 points

14 comments4 min readLW link

AlphaFold 2 paper released: “Highly accurate protein structure prediction with AlphaFold”, Jumper et al 2021

gwern15 Jul 2021 19:27 UTC

39 points

10 comments1 min readLW link

(www.nature.com)

Benchmarking an old chess engine on new hardware

hippke16 Jul 2021 7:58 UTC

71 points

3 comments5 min readLW link

[AN #156]: The scaling hypothesis: a plan for building AGI

Rohin Shah16 Jul 2021 17:10 UTC

44 points

20 comments8 min readLW link

(mailchi.mp)

Bayesianism versus conservatism versus Goodhart

Stuart_Armstrong16 Jul 2021 23:39 UTC

15 points

1 comment6 min readLW link

(2009) Shane Legg—Funding safe AGI

Tomás B.17 Jul 2021 16:46 UTC

36 points

2 comments1 min readLW link

(www.vetta.org)

[Question] Equivalent of Information Theory but for Computation?

J Bostock17 Jul 2021 9:38 UTC

5 points

27 comments1 min readLW link

A Models-centric Approach to Corrigible Alignment

J Bostock17 Jul 2021 17:27 UTC

2 points

0 comments6 min readLW link

A model of decision-making in the brain (the short version)

Steven Byrnes18 Jul 2021 14:39 UTC

20 points

0 comments3 min readLW link

[Question] Any taxonomies of conscious experience?

JohnDavidBustard18 Jul 2021 18:28 UTC

7 points

10 comments1 min readLW link

[Question] Work on Bayesian fitting of AI trends of performance?

Jsevillamol19 Jul 2021 18:45 UTC

3 points

0 comments1 min readLW link

Some thoughts on David Roodman’s GWP model and its relation to AI timelines

Tom Davidson19 Jul 2021 22:59 UTC

30 points

1 comment8 min readLW link

In search of benevolence (or: what should you get Clippy for Christmas?)

Joe Carlsmith20 Jul 2021 1:12 UTC

20 points

0 comments33 min readLW link

Entropic boundary conditions towards safe artificial superintelligence

Santiago Nunez-Corrales20 Jul 2021 22:15 UTC

3 points

0 comments2 min readLW link

(www.tandfonline.com)

Reward splintering for AI design

Stuart_Armstrong21 Jul 2021 16:13 UTC

30 points

1 comment8 min readLW link

Re-Define Intent Alignment?

abramdemski22 Jul 2021 19:00 UTC

27 points

33 comments4 min readLW link

[AN #157]: Measuring misalignment in the technology underlying Copilot

Rohin Shah23 Jul 2021 17:20 UTC

28 points

18 comments7 min readLW link

(mailchi.mp)

Examples of human-level AI running unaligned.

df fd23 Jul 2021 8:49 UTC

−3 points

0 comments2 min readLW link

(sortale.substack.com)

AXRP Episode 10 - AI’s Future and Impacts with Katja Grace

DanielFilan23 Jul 2021 22:10 UTC

34 points

2 comments76 min readLW link

Wanted: Foom-scared alignment research partner

Icarus Gallagher26 Jul 2021 19:23 UTC

40 points

5 comments1 min readLW link

Refactoring Alignment (attempt #2)

abramdemski26 Jul 2021 20:12 UTC

46 points

17 comments8 min readLW link

[Question] How much compute was used to train DeepMind’s generally capable agents?

Daniel Kokotajlo29 Jul 2021 11:34 UTC

32 points

11 comments1 min readLW link

[Question] Did they or didn’t they learn tool use?

Daniel Kokotajlo29 Jul 2021 13:26 UTC

16 points

8 comments1 min readLW link

[AN #158]: Should we be optimistic about generalization?

Rohin Shah29 Jul 2021 17:20 UTC

19 points

0 comments8 min readLW link

(mailchi.mp)

[Question] Very Unnatural Tasks?

Orfeas31 Jul 2021 21:22 UTC

4 points

5 comments1 min readLW link

[Question] Is iterated amplification really more powerful than imitation?

Chantiel2 Aug 2021 23:20 UTC

5 points

0 comments2 min readLW link

What does GPT-3 understand? Symbol grounding and Chinese rooms

Stuart_Armstrong3 Aug 2021 13:14 UTC

40 points

15 comments12 min readLW link

Garrabrant and Shah on human modeling in AGI

Rob Bensinger4 Aug 2021 4:35 UTC

57 points

10 comments47 min readLW link

Value loading in the human brain: a worked example

Steven Byrnes4 Aug 2021 17:20 UTC

45 points

2 comments8 min readLW link

[AN #159]: Building agents that know how to experiment, by training on procedurally generated games

Rohin Shah4 Aug 2021 17:10 UTC

18 points

4 comments14 min readLW link

(mailchi.mp)

[Question] How many parameters do self-driving-car neural nets have?

Daniel Kokotajlo6 Aug 2021 11:24 UTC

9 points

3 comments1 min readLW link

Rage Against The MOOChine

Borasko7 Aug 2021 17:57 UTC

20 points

12 comments7 min readLW link

Applications for Deconfusing Goal-Directedness

adamShimi8 Aug 2021 13:05 UTC

36 points

3 comments5 min readLW link 1 review

Instrumental Convergence: Power as Rademacher Complexity

Past Account12 Aug 2021 16:02 UTC

6 points

0 comments3 min readLW link

A new definition of “optimizer”

Chantiel9 Aug 2021 13:42 UTC

5 points

0 comments7 min readLW link

Goal-Directedness and Behavior, Redux

adamShimi9 Aug 2021 14:26 UTC

14 points

4 comments2 min readLW link

Automating Auditing: An ambitious concrete technical research proposal

evhub11 Aug 2021 20:32 UTC

77 points

9 comments14 min readLW link 1 review

Some criteria for sandwiching projects

dmz12 Aug 2021 3:40 UTC

18 points

1 comment4 min readLW link

Power-seeking for successive choices

adamShimi12 Aug 2021 20:37 UTC

11 points

9 comments4 min readLW link

[AN #160]: Building AIs that learn and think like people

Rohin Shah13 Aug 2021 17:10 UTC

28 points

6 comments10 min readLW link

(mailchi.mp)

[Question] How would the Scaling Hypothesis change things?

Aryeh Englander13 Aug 2021 15:42 UTC

4 points

4 comments1 min readLW link

A review of “Agents and Devices”

adamShimi13 Aug 2021 8:42 UTC

10 points

0 comments4 min readLW link

Approaches to gradient hacking

adamShimi14 Aug 2021 15:16 UTC

16 points

8 comments8 min readLW link

[Question] What are some open exposition problems in AI?

Sai Sasank Y16 Aug 2021 15:05 UTC

4 points

2 comments1 min readLW link

Thinking about AI relationally

TekhneMakre16 Aug 2021 22:03 UTC

5 points

0 comments2 min readLW link

Finite Factored Sets: Polynomials and Probability

Scott Garrabrant17 Aug 2021 21:53 UTC

21 points

2 comments8 min readLW link

How DeepMind’s Generally Capable Agents Were Trained

1a3orn20 Aug 2021 18:52 UTC

87 points

6 comments19 min readLW link

[AN #161]: Creating generalizable reward functions for multiple tasks by learning a model of functional similarity

Rohin Shah20 Aug 2021 17:20 UTC

15 points

0 comments9 min readLW link

(mailchi.mp)

Implication of AI timelines on planning and solutions

JJ Hepburn21 Aug 2021 5:12 UTC

18 points

5 comments2 min readLW link

Autoregressive Propaganda

lsusr22 Aug 2021 2:18 UTC

25 points

3 comments3 min readLW link

AI Risk for Epistemic Minimalists

Alex Flint22 Aug 2021 15:39 UTC

57 points

12 comments13 min readLW link 1 review

The Codex Skeptic FAQ

Michaël Trazzi24 Aug 2021 16:01 UTC

49 points

24 comments2 min readLW link

How to turn money into AI safety?

Charlie Steiner25 Aug 2021 10:49 UTC

66 points

26 comments8 min readLW link

Introduction to Reducing Goodhart

Charlie Steiner26 Aug 2021 18:38 UTC

40 points

10 comments4 min readLW link

Could you have stopped Chernobyl?

Carlos Ramirez27 Aug 2021 1:48 UTC

29 points

17 comments8 min readLW link

[AN #162]: Foundation models: a paradigm shift within AI

Rohin Shah27 Aug 2021 17:20 UTC

21 points

0 comments8 min readLW link

(mailchi.mp)

A short introduction to machine learning

Richard_Ngo30 Aug 2021 14:31 UTC

67 points

0 comments8 min readLW link

[Question] What could small scale disasters from AI look like?

CharlesD31 Aug 2021 15:52 UTC

14 points

8 comments1 min readLW link

NIST AI Risk Management Framework request for information (RFI)

Aryeh Englander1 Sep 2021 0:15 UTC

15 points

0 comments2 min readLW link

Reward splintering as reverse of interpretability

Stuart_Armstrong31 Aug 2021 22:27 UTC

10 points

0 comments1 min readLW link

What are biases, anyway? Multiple type signatures

Stuart_Armstrong31 Aug 2021 21:16 UTC

11 points

0 comments3 min readLW link

Finite Factored Sets: Applications

Scott Garrabrant31 Aug 2021 21:19 UTC

27 points

1 comment10 min readLW link

Finite Factored Sets: Inferring Time

Scott Garrabrant31 Aug 2021 21:18 UTC

17 points

5 comments4 min readLW link

US Military Global Information Dominance Experiments

NunoSempere1 Sep 2021 13:34 UTC

25 points

0 comments4 min readLW link

(www.defense.gov)

Competent Preferences

Charlie Steiner2 Sep 2021 14:26 UTC

27 points

2 comments6 min readLW link

Formalizing Objections against Surrogate Goals

VojtaKovarik2 Sep 2021 16:24 UTC

5 points

23 comments20 min readLW link

[Question] Is there a name for the theory that “There will be fast takeoff in real-world capabilities because almost everything is AGI-complete”?

David Scott Krueger (formerly: capybaralet)2 Sep 2021 23:00 UTC

31 points

8 comments1 min readLW link

Thoughts on gradient hacking

Richard_Ngo3 Sep 2021 13:02 UTC

33 points

12 comments4 min readLW link

Why the technological singularity by AGI may never happen

hippke3 Sep 2021 14:19 UTC

5 points

14 comments1 min readLW link

All Possible Views About Humanity’s Future Are Wild

HoldenKarnofsky3 Sep 2021 20:19 UTC

140 points

40 comments8 min readLW link 1 review

The Most Important Century: Sequence Introduction

HoldenKarnofsky3 Sep 2021 20:19 UTC

68 points

5 comments4 min readLW link 1 review

[Question] Are there substantial research efforts towards aligning narrow AIs?

Rossin4 Sep 2021 18:40 UTC

11 points

4 comments2 min readLW link

Multi-Agent Inverse Reinforcement Learning: Suboptimal Demonstrations and Alternative Solution Concepts

sage_bergerson7 Sep 2021 16:11 UTC

5 points

0 comments1 min readLW link

Bayeswatch 7: Wildfire

lsusr8 Sep 2021 5:35 UTC

47 points

6 comments3 min readLW link

[AN #163]: Using finite factored sets for causal and temporal inference

Rohin Shah8 Sep 2021 17:20 UTC

38 points

0 comments10 min readLW link

(mailchi.mp)

Gradient descent is not just more efficient genetic algorithms

leogao8 Sep 2021 16:23 UTC

54 points

14 comments1 min readLW link

Sam Altman Q&A Notes—Aftermath

p.b.8 Sep 2021 8:20 UTC

45 points

35 comments2 min readLW link

[Question] Does blockchain technology offer potential solutions to some AI alignment problems?

pilord9 Sep 2021 16:51 UTC

−4 points

8 comments2 min readLW link

Countably Factored Spaces

Diffractor9 Sep 2021 4:24 UTC

47 points

3 comments18 min readLW link

The alignment problem in different capability regimes

Buck9 Sep 2021 19:46 UTC

87 points

12 comments5 min readLW link

GPT-X, DALL-E, and our Multimodal Future [video series]

bakztfuture9 Sep 2021 23:05 UTC

0 points

1 comment1 min readLW link

(youtube.com)

Bayeswatch 8: Antimatter

lsusr10 Sep 2021 5:01 UTC

29 points

6 comments3 min readLW link

Measurement, Optimization, and Take-off Speed

jsteinhardt10 Sep 2021 19:30 UTC

47 points

4 comments13 min readLW link

Bayeswatch 9: Zombies

lsusr11 Sep 2021 5:57 UTC

41 points

15 comments3 min readLW link

[Question] Is MIRI’s reading list up to date?

Aryeh Englander11 Sep 2021 18:56 UTC

25 points

5 comments1 min readLW link

Soldiers, Scouts, and Albatrosses.

Jan12 Sep 2021 10:36 UTC

5 points

0 comments1 min readLW link

(universalprior.substack.com)

GPT-Augmented Blogging

lsusr14 Sep 2021 11:55 UTC

52 points

18 comments13 min readLW link

[AN #164]: How well can language models write code?

Rohin Shah15 Sep 2021 17:20 UTC

13 points

7 comments9 min readLW link

(mailchi.mp)

I wanted to interview Eliezer Yudkowsky but he’s busy so I simulated him instead

lsusr16 Sep 2021 7:34 UTC

110 points

33 comments5 min readLW link

Economic AI Safety

jsteinhardt16 Sep 2021 20:50 UTC

35 points

3 comments5 min readLW link

Jitters No Evidence of Stupidity in RL

1a3orn16 Sep 2021 22:43 UTC

82 points

18 comments3 min readLW link

Immobile AI makes a move: anti-wireheading, ontology change, and model splintering

Stuart_Armstrong17 Sep 2021 15:24 UTC

32 points

3 comments2 min readLW link

Great Power Conflict

Zach Stein-Perlman17 Sep 2021 15:00 UTC

11 points

7 comments4 min readLW link

The theory-practice gap

Buck17 Sep 2021 22:51 UTC

133 points

14 comments6 min readLW link

[Book Review] “The Alignment Problem” by Brian Christian

lsusr20 Sep 2021 6:36 UTC

70 points

16 comments6 min readLW link

AI, learn to be conservative, then learn to be less so: reducing side-effects, learning preserved features, and going beyond conservatism

Stuart_Armstrong20 Sep 2021 11:56 UTC

14 points

4 comments3 min readLW link

Sigmoids behaving badly: arXiv paper

Stuart_Armstrong20 Sep 2021 10:29 UTC

24 points

1 comment1 min readLW link

[Question] How much should you be willing to pay for an AGI?

Logan Zoellner20 Sep 2021 11:51 UTC

11 points

5 comments1 min readLW link

Announcing the Vitalik Buterin Fellowships in AI Existential Safety!

DanielFilan21 Sep 2021 0:33 UTC

64 points

2 comments1 min readLW link

(grants.futureoflife.org)

Redwood Research’s current project

Buck21 Sep 2021 23:30 UTC

143 points

29 comments15 min readLW link

[Question] What are good models of collusion in AI?

EconomicModel22 Sep 2021 15:16 UTC

7 points

1 comment1 min readLW link

[AN #165]: When large models are more likely to lie

Rohin Shah22 Sep 2021 17:30 UTC

23 points

0 comments8 min readLW link

(mailchi.mp)

Neural net / decision tree hybrids: a potential path toward bridging the interpretability gap

Nathan Helm-Burger23 Sep 2021 0:38 UTC

21 points

2 comments12 min readLW link

What is Compute? - Transformative AI and Compute [1/4]

lennart23 Sep 2021 16:25 UTC

24 points

8 comments19 min readLW link

Forecasting Transformative AI, Part 1: What Kind of AI?

HoldenKarnofsky24 Sep 2021 0:46 UTC

17 points

17 comments9 min readLW link

Pathways: Google’s AGI

Lê Nguyên Hoang25 Sep 2021 7:02 UTC

44 points

5 comments1 min readLW link

Cognitive Biases in Large Language Models

Jan25 Sep 2021 20:59 UTC

17 points

3 comments12 min readLW link

(universalprior.substack.com)

Transformative AI and Compute [Summary]

lennart26 Sep 2021 11:41 UTC

13 points

0 comments9 min readLW link

Beyond fire alarms: freeing the groupstruck

KatjaGrace26 Sep 2021 9:30 UTC

81 points

15 comments54 min readLW link

(worldspiritsockpuppet.com)

[Question] Any writeups on GPT agency?

Ozyrus26 Sep 2021 22:55 UTC

4 points

6 comments1 min readLW link

AI takeoff story: a continuation of progress by other means

Edouard Harris27 Sep 2021 15:55 UTC

75 points

13 comments10 min readLW link

A Confused Chemist’s Review of AlphaFold 2

J Bostock27 Sep 2021 11:10 UTC

23 points

4 comments5 min readLW link

[Question] Collection of arguments to expect (outer and inner) alignment failure?

Sam Clarke28 Sep 2021 16:55 UTC

20 points

10 comments1 min readLW link

Brain-inspired AGI and the “lifetime anchor”

Steven Byrnes29 Sep 2021 13:09 UTC

64 points

16 comments13 min readLW link

[Question] What Heuristics Do You Use to Think About Alignment Topics?

Logan Riggs29 Sep 2021 2:31 UTC

5 points

3 comments1 min readLW link

Bayeswatch 10: Spyware

lsusr29 Sep 2021 7:01 UTC

97 points

7 comments4 min readLW link

Unsolved ML Safety Problems

jsteinhardt29 Sep 2021 16:00 UTC

58 points

2 comments3 min readLW link

(bounded-regret.ghost.io)

Some Existing Selection Theorems

johnswentworth30 Sep 2021 16:13 UTC

48 points

2 comments4 min readLW link

Forecasting Compute—Transformative AI and Compute [2/4]

lennart2 Oct 2021 15:54 UTC

17 points

0 comments19 min readLW link

Nuclear Espionage and AI Governance

GAA4 Oct 2021 23:04 UTC

26 points

5 comments24 min readLW link

Modelling and Understanding SGD

J Bostock5 Oct 2021 13:41 UTC

8 points

0 comments3 min readLW link

Force neural nets to use models, then detect these

Stuart_Armstrong5 Oct 2021 11:31 UTC

17 points

8 comments2 min readLW link

[Question] Is GPT-3 already sample-efficient?

Daniel Kokotajlo6 Oct 2021 13:38 UTC

36 points

32 comments1 min readLW link

Preferences from (real and hypothetical) psychology papers

Stuart_Armstrong6 Oct 2021 9:06 UTC

15 points

0 comments2 min readLW link

Automated Fact Checking: A Look at the Field

Hoagy6 Oct 2021 23:52 UTC

12 points

0 comments8 min readLW link

Safety-capabilities tradeoff dials are inevitable in AGI

Steven Byrnes7 Oct 2021 19:03 UTC

57 points

4 comments3 min readLW link

Bayeswatch 11: Parabellum

lsusr9 Oct 2021 7:08 UTC

32 points

12 comments2 min readLW link

Steelman arguments against the idea that AGI is inevitable and will arrive soon

RomanS9 Oct 2021 6:22 UTC

19 points

13 comments4 min readLW link

Intelligence or Evolution?

Ramana Kumar9 Oct 2021 17:14 UTC

50 points

15 comments3 min readLW link

Bayeswatch 12: The Singularity War

lsusr10 Oct 2021 1:04 UTC

32 points

6 comments2 min readLW link

The Extrapolation Problem

lsusr10 Oct 2021 5:11 UTC

25 points

8 comments2 min readLW link

The evaluation function of an AI is not its aim

Yair Halberstadt10 Oct 2021 14:52 UTC

13 points

5 comments3 min readLW link

On Solving Problems Before They Appear: The Weird Epistemologies of Alignment

adamShimi11 Oct 2021 8:20 UTC

97 points

11 comments15 min readLW link

Bayeswatch 13: Spaceship

lsusr12 Oct 2021 21:35 UTC

51 points

4 comments1 min readLW link

Compute Governance and Conclusions—Transformative AI and Compute [3/4]

lennart14 Oct 2021 8:23 UTC

13 points

0 comments5 min readLW link

Classical symbol grounding and causal graphs

Stuart_Armstrong14 Oct 2021 18:04 UTC

22 points

2 comments5 min readLW link

NLP Position Paper: When Combatting Hype, Proceed with Caution

Sam Bowman15 Oct 2021 20:57 UTC

46 points

15 comments1 min readLW link

[Question] Memetic hazards of AGI architecture posts

Ozyrus16 Oct 2021 16:10 UTC

9 points

12 comments1 min readLW link

[Prediction] We are in an Algorithmic Overhang, Part 2

lsusr17 Oct 2021 7:48 UTC

20 points

29 comments2 min readLW link

Epistemic Strategies of Selection Theorems

adamShimi18 Oct 2021 8:57 UTC

32 points

1 comment12 min readLW link

On The Risks of Emergent Behavior in Foundation Models

jsteinhardt18 Oct 2021 20:00 UTC

30 points

0 comments3 min readLW link

(bounded-regret.ghost.io)

Beyond the human training distribution: would the AI CEO create almost-illegal teddies?

Stuart_Armstrong18 Oct 2021 21:10 UTC

36 points

2 comments3 min readLW link

[AN #167]: Concrete ML safety problems and their relevance to x-risk

Rohin Shah20 Oct 2021 17:10 UTC

19 points

4 comments9 min readLW link

(mailchi.mp)

Boring machine learning is where it’s at

George3d620 Oct 2021 11:23 UTC

28 points

16 comments3 min readLW link

(cerebralab.com)

AGI Safety Fundamentals curriculum and application

Richard_Ngo20 Oct 2021 21:44 UTC

67 points

0 comments8 min readLW link

(docs.google.com)

Epistemic Strategies of Safety-Capabilities Tradeoffs

adamShimi22 Oct 2021 8:22 UTC

5 points

0 comments6 min readLW link

General alignment plus human values, or alignment via human values?

Stuart_Armstrong22 Oct 2021 10:11 UTC

45 points

27 comments3 min readLW link

Naive self-supervised approaches to truthful AI

ryan_greenblatt23 Oct 2021 13:03 UTC

9 points

4 comments2 min readLW link

My ML Scaling bibliography

gwern23 Oct 2021 14:41 UTC

35 points

9 comments1 min readLW link

(www.gwern.net)

Selfishness, preference falsification, and AI alignment

jessicata28 Oct 2021 0:16 UTC

52 points

29 comments13 min readLW link

(unstableontology.com)

[AN #168]: Four technical topics for which Open Phil is soliciting grant proposals

Rohin Shah28 Oct 2021 17:20 UTC

15 points

0 comments9 min readLW link

(mailchi.mp)

Forecasting progress in language models

Matthew Barnett and Metaculus

28 Oct 2021 20:40 UTC

54 points

5 comments11 min readLW link

(www.metaculus.com)

Request for proposals for projects in AI alignment that work with deep learning systems

abergal and Nick_Beckstead

29 Oct 2021 7:26 UTC

87 points

0 comments5 min readLW link

Interpretability

abergal and Nick_Beckstead

29 Oct 2021 7:28 UTC

59 points

13 comments12 min readLW link

Truthful and honest AI

abergal, Nick_Beckstead and Owain_Evans

29 Oct 2021 7:28 UTC

41 points

1 comment13 min readLW link

Measuring and forecasting risks

abergal, Nick_Beckstead and jsteinhardt

29 Oct 2021 7:27 UTC

20 points

0 comments12 min readLW link

Techniques for enhancing human feedback

abergal, Ajeya Cotra and Nick_Beckstead

29 Oct 2021 7:27 UTC

22 points

0 comments2 min readLW link

Stuart Russell and Melanie Mitchell on Munk Debates

Alex Flint29 Oct 2021 19:13 UTC

29 points

3 comments3 min readLW link

True Stories of Algorithmic Improvement

johnswentworth29 Oct 2021 20:57 UTC

91 points

7 comments5 min readLW link

Must true AI sleep?

YimbyGeorge30 Oct 2021 16:47 UTC

0 points

1 comment1 min readLW link

Nate Soares on the Ultimate Newcomb’s Problem

Rob Bensinger31 Oct 2021 19:42 UTC

56 points

20 comments1 min readLW link

Models Modeling Models

Charlie Steiner2 Nov 2021 7:08 UTC

20 points

5 comments10 min readLW link

[Question] What’s the difference between newer Atari-playing AI and the older Deepmind one (from 2014)?

Raemon2 Nov 2021 23:36 UTC

27 points

8 comments1 min readLW link

Apply to the ML for Alignment Bootcamp (MLAB) in Berkeley [Jan 3 - Jan 22]

habryka and Buck

3 Nov 2021 18:22 UTC

95 points

4 comments1 min readLW link

[External Event] 2022 IEEE International Conference on Assured Autonomy (ICAA) - submission deadline extended

Aryeh Englander5 Nov 2021 15:29 UTC

13 points

0 comments3 min readLW link

Y2K: Successful Practice for AI Alignment

Darmani5 Nov 2021 6:09 UTC

47 points

5 comments6 min readLW link

Some Remarks on Regulator Theorems No One Asked For

Past Account5 Nov 2021 19:33 UTC

19 points

1 comment4 min readLW link

How should we compare neural network representations?

jsteinhardt5 Nov 2021 22:10 UTC

24 points

0 comments3 min readLW link

(bounded-regret.ghost.io)

Drug addicts and deceptively aligned agents—a comparative analysis

Jan5 Nov 2021 21:42 UTC

41 points

2 comments12 min readLW link

(universalprior.substack.com)

Comments on OpenPhil’s Interpretability RFP

paulfchristiano5 Nov 2021 22:36 UTC

84 points

5 comments7 min readLW link

How do we become confident in the safety of a machine learning system?

evhub8 Nov 2021 22:49 UTC

92 points

2 comments32 min readLW link

[Question] What exactly is GPT-3′s base objective?

Daniel Kokotajlo10 Nov 2021 0:57 UTC

60 points

15 comments2 min readLW link

Relaxation-Based Search, From Everyday Life To Unfamiliar Territory

johnswentworth10 Nov 2021 21:47 UTC

57 points

3 comments8 min readLW link

Using blinders to help you see things for what they are

Adam Zerner11 Nov 2021 7:07 UTC

13 points

2 comments2 min readLW link

AGI is at least as far away as Nuclear Fusion.

Logan Zoellner11 Nov 2021 21:33 UTC

0 points

8 comments1 min readLW link

Measuring and Forecasting Risks from AI

jsteinhardt12 Nov 2021 2:30 UTC

24 points

0 comments3 min readLW link

(bounded-regret.ghost.io)

Why I’m excited about Redwood Research’s current project

paulfchristiano12 Nov 2021 19:26 UTC

112 points

6 comments7 min readLW link

A Defense of Functional Decision Theory

Heighn12 Nov 2021 20:59 UTC

21 points

120 comments10 min readLW link

Comments on Carlsmith’s “Is power-seeking AI an existential risk?”

So8res13 Nov 2021 4:29 UTC

137 points

13 comments40 min readLW link

[Question] What’s the likelihood of only sub exponential growth for AGI?

M. Y. Zuo13 Nov 2021 22:46 UTC

5 points

22 comments1 min readLW link

My current uncertainties regarding AI, alignment, and the end of the world

dominicq14 Nov 2021 14:08 UTC

2 points

3 comments2 min readLW link

My understanding of the alignment problem

danieldewey15 Nov 2021 18:13 UTC

43 points

3 comments3 min readLW link

“Summarizing Books with Human Feedback” (recursive GPT-3)

gwern15 Nov 2021 17:41 UTC

24 points

4 comments1 min readLW link

(openai.com)

Quantilizer ≡ Optimizer with a Bounded Amount of Output

itaibn016 Nov 2021 1:03 UTC

10 points

4 comments2 min readLW link

Two Stupid AI Alignment Ideas

aphyer16 Nov 2021 16:13 UTC

24 points

3 comments4 min readLW link

[Question] What are the mutual benefits of AGI-human collaboration that would otherwise be unobtainable?

M. Y. Zuo17 Nov 2021 3:09 UTC

1 point

4 comments1 min readLW link

Applications for AI Safety Camp 2022 Now Open!

adamShimi17 Nov 2021 21:42 UTC

47 points

3 comments1 min readLW link

Ngo and Yudkowsky on AI capability gains

Eliezer Yudkowsky and Richard_Ngo

18 Nov 2021 22:19 UTC

129 points

61 comments39 min readLW link

“Acquisition of Chess Knowledge in AlphaZero”: probing AZ over time

jsd18 Nov 2021 23:24 UTC

11 points

9 comments1 min readLW link

(arxiv.org)

How To Get Into Independent Research On Alignment/Agency

johnswentworth19 Nov 2021 0:00 UTC

314 points

33 comments13 min readLW link

Goodhart: Endgame

Charlie Steiner19 Nov 2021 1:26 UTC

23 points

3 comments8 min readLW link

More detailed proposal for measuring alignment of current models

Beth Barnes20 Nov 2021 0:03 UTC

31 points

0 comments8 min readLW link

From language to ethics by automated reasoning

Michele Campolo21 Nov 2021 15:16 UTC

4 points

4 comments6 min readLW link

Morally underdefined situations can be deadly

Stuart_Armstrong22 Nov 2021 14:48 UTC

17 points

8 comments2 min readLW link

Yudkowsky and Christiano discuss “Takeoff Speeds”

Eliezer Yudkowsky22 Nov 2021 19:35 UTC

191 points

181 comments60 min readLW link 1 review

Potential Alignment mental tool: Keeping track of the types

Donald Hobson22 Nov 2021 20:05 UTC

28 points

1 comment2 min readLW link

Formalizing Policy-Modification Corrigibility

TurnTrout3 Dec 2021 1:31 UTC

23 points

6 comments6 min readLW link

[AN #169]: Collaborating with humans without human data

Rohin Shah24 Nov 2021 18:30 UTC

33 points

0 comments8 min readLW link

(mailchi.mp)

Christiano, Cotra, and Yudkowsky on AI progress

Eliezer Yudkowsky and Ajeya Cotra

25 Nov 2021 16:45 UTC

117 points

95 comments68 min readLW link

Latacora might be of interest to some AI Safety organizations

NunoSempere25 Nov 2021 23:57 UTC

14 points

10 comments1 min readLW link

(www.latacora.com)

Solve Corrigibility Week

Logan Riggs28 Nov 2021 17:00 UTC

39 points

21 comments1 min readLW link

TTS audio of “Ngo and Yudkowsky on alignment difficulty”

Quintin Pope28 Nov 2021 18:11 UTC

4 points

3 comments1 min readLW link

Redwood Research is hiring for several roles

Jack R and billzito

29 Nov 2021 0:16 UTC

44 points

0 comments1 min readLW link

Compute Research Questions and Metrics—Transformative AI and Compute [4/4]

lennart28 Nov 2021 22:49 UTC

6 points

0 comments16 min readLW link

Comments on Allan Dafoe on AI Governance

Alex Flint29 Nov 2021 16:16 UTC

13 points

0 comments7 min readLW link

Soares, Tallinn, and Yudkowsky discuss AGI cognition

So8res, Eliezer Yudkowsky and jaan

29 Nov 2021 19:26 UTC

118 points

35 comments40 min readLW link

Self-studying to develop an inside-view model of AI alignment; co-studiers welcome!

Vael Gates30 Nov 2021 9:25 UTC

13 points

0 comments4 min readLW link

Machine Agents, Hybrid Superintelligences, and The Loss of Human Control (Chapter 1)

Justin Bullock30 Nov 2021 17:35 UTC

4 points

0 comments8 min readLW link

AXRP Episode 12 - AI Existential Risk with Paul Christiano

DanielFilan2 Dec 2021 2:20 UTC

36 points

0 comments125 min readLW link

Morality is Scary

Wei Dai2 Dec 2021 6:35 UTC

175 points

125 comments4 min readLW link

Sydney AI Safety Fellowship

Chris_Leong2 Dec 2021 7:34 UTC

22 points

0 comments2 min readLW link

$100/$50 rewards for good references

Stuart_Armstrong3 Dec 2021 16:55 UTC

20 points

5 comments1 min readLW link

[Question] Does the Structure of an algorithm matter for AI Risk and/or consciousness?

Logan Zoellner3 Dec 2021 18:31 UTC

7 points

5 comments1 min readLW link

[Linkpost] A General Language Assistant as a Laboratory for Alignment

Quintin Pope3 Dec 2021 19:42 UTC

37 points

2 comments2 min readLW link

Agency: What it is and why it matters

Daniel Kokotajlo4 Dec 2021 21:32 UTC

25 points

2 comments2 min readLW link

[Question] Are limited-horizon agents a good heuristic for the off-switch problem?

Yonadav Shavit5 Dec 2021 19:27 UTC

5 points

19 comments1 min readLW link

Introduction to inaccessible information

Ryan Kidd9 Dec 2021 1:28 UTC

27 points

6 comments8 min readLW link

More Christiano, Cotra, and Yudkowsky on AI progress

Eliezer Yudkowsky and Ajeya Cotra

6 Dec 2021 20:33 UTC

85 points

30 comments40 min readLW link

Exterminating humans might be on the to-do list of a Friendly AI

RomanS7 Dec 2021 14:15 UTC

5 points

8 comments2 min readLW link

Interviews on Improving the AI Safety Pipeline

Chris_Leong7 Dec 2021 12:03 UTC

55 points

16 comments17 min readLW link

Let’s buy out Cyc, for use in AGI interpretability systems?

Steven Byrnes7 Dec 2021 20:46 UTC

47 points

10 comments2 min readLW link

[AN #170]: Analyzing the argument for risk from power-seeking AI

Rohin Shah8 Dec 2021 18:10 UTC

21 points

1 comment7 min readLW link

(mailchi.mp)

[MLSN #2]: Adversarial Training

Dan_H9 Dec 2021 17:16 UTC

26 points

0 comments3 min readLW link

Supervised learning and self-modeling: What’s “superhuman?”

Charlie Steiner9 Dec 2021 12:44 UTC

12 points

1 comment8 min readLW link

Some abstract, non-technical reasons to be non-maximally-pessimistic about AI alignment

Rob Bensinger12 Dec 2021 2:08 UTC

66 points

37 comments7 min readLW link

Transforming myopic optimization to ordinary optimization—Do we want to seek convergence for myopic optimization problems?

tailcalled11 Dec 2021 20:38 UTC

12 points

1 comment5 min readLW link

Redwood’s Technique-Focused Epistemic Strategy

adamShimi12 Dec 2021 16:36 UTC

48 points

1 comment7 min readLW link

[Question] [Resolved] Who else prefers “AI alignment” to “AI safety?”

Evan_Gaensbauer13 Dec 2021 0:35 UTC

5 points

8 comments1 min readLW link

Hard-Coding Neural Computation

MadHatter13 Dec 2021 4:35 UTC

32 points

8 comments27 min readLW link

Solving Interpretability Week

Logan Riggs13 Dec 2021 17:09 UTC

11 points

5 comments1 min readLW link

Understanding and controlling auto-induced distributional shift

L Rudolf L13 Dec 2021 14:59 UTC

26 points

3 comments16 min readLW link

Language Model Alignment Research Internships

Ethan Perez13 Dec 2021 19:53 UTC

68 points

1 comment1 min readLW link

Enabling More Feedback for AI Safety Researchers

frances_lorenz13 Dec 2021 20:10 UTC

17 points

0 comments3 min readLW link

ARC’s first technical report: Eliciting Latent Knowledge

paulfchristiano, Mark Xu and Ajeya Cotra

14 Dec 2021 20:09 UTC

212 points

88 comments1 min readLW link

(docs.google.com)

Interlude: Agents as Automobiles

Daniel Kokotajlo14 Dec 2021 18:49 UTC

25 points

6 comments5 min readLW link

ARC is hiring!

paulfchristiano and Mark Xu

14 Dec 2021 20:09 UTC

62 points

2 comments1 min readLW link

Ngo’s view on alignment difficulty

Richard_Ngo and Eliezer Yudkowsky

14 Dec 2021 21:34 UTC

63 points

7 comments17 min readLW link

The Natural Abstraction Hypothesis: Implications and Evidence

CallumMcDougall14 Dec 2021 23:14 UTC

30 points

8 comments19 min readLW link

Elicitation for Modeling Transformative AI Risks

Davidmanheim16 Dec 2021 15:24 UTC

30 points

2 comments9 min readLW link

Some motivations to gradient hack

peterbarnett17 Dec 2021 3:06 UTC

8 points

0 comments6 min readLW link

Introducing the Principles of Intelligent Behaviour in Biological and Social Systems (PIBBSS) Fellowship

adamShimi18 Dec 2021 15:23 UTC

51 points

4 comments10 min readLW link

[Question] Important ML systems from before 2012?

Jsevillamol18 Dec 2021 12:12 UTC

12 points

5 comments1 min readLW link

[Extended Deadline: Jan 23rd] Announcing the PIBBSS Summer Research Fellowship

Nora_Ammann18 Dec 2021 16:56 UTC

6 points

1 comment1 min readLW link

Exploring Decision Theories With Counterfactuals and Dynamic Agent Self-Pointers

JoshuaOSHickman18 Dec 2021 21:50 UTC

2 points

0 comments4 min readLW link

Don’t Influence the Influencers!

lhc19 Dec 2021 9:02 UTC

14 points

2 comments10 min readLW link

SGD Understood through Probability Current

J Bostock19 Dec 2021 23:26 UTC

23 points

1 comment5 min readLW link

Worst-case thinking in AI alignment

Buck23 Dec 2021 1:29 UTC

139 points

15 comments6 min readLW link

2021 AI Alignment Literature Review and Charity Comparison

Larks23 Dec 2021 14:06 UTC

164 points

26 comments73 min readLW link

Reply to Eliezer on Biological Anchors

HoldenKarnofsky23 Dec 2021 16:15 UTC

146 points

46 comments15 min readLW link

Risks from AI persuasion

Beth Barnes24 Dec 2021 1:48 UTC

68 points

15 comments31 min readLW link

Understanding the tensor product formulation in Transformer Circuits

Tom Lieberum24 Dec 2021 18:05 UTC

16 points

2 comments3 min readLW link

Mechanistic Interpretability for the MLP Layers (rough early thoughts)

MadHatter24 Dec 2021 7:24 UTC

11 points

2 comments1 min readLW link

(www.youtube.com)

My Overview of the AI Alignment Landscape: Threat Models

Neel Nanda25 Dec 2021 23:07 UTC

50 points

4 comments28 min readLW link

Reinforcement Learning Study Group

Kay Kozaronek26 Dec 2021 23:11 UTC

20 points

9 comments1 min readLW link

AI Fire Alarm Scenarios

PeterMcCluskey28 Dec 2021 2:20 UTC

10 points

0 comments6 min readLW link

(www.bayesianinvestor.com)

Reverse-engineering using interpretability

Beth Barnes29 Dec 2021 23:21 UTC

21 points

1 comment5 min readLW link

Counterexamples to some ELK proposals

paulfchristiano31 Dec 2021 17:05 UTC

50 points

10 comments7 min readLW link

We Choose To Align AI

johnswentworth1 Jan 2022 20:06 UTC

259 points

15 comments3 min readLW link

Why don’t we just, like, try and build safe AGI?

Sun1 Jan 2022 23:24 UTC

0 points

4 comments1 min readLW link

[Question] Tag for AI alignment?

Alex_Altair2 Jan 2022 18:55 UTC

7 points

6 comments1 min readLW link

How an alien theory of mind might be unlearnable

Stuart_Armstrong3 Jan 2022 11:16 UTC

26 points

35 comments5 min readLW link

Shadows Of The Coming Race (1879)

Capybasilisk3 Jan 2022 15:55 UTC

49 points

4 comments7 min readLW link

Apply for research internships at ARC!

paulfchristiano3 Jan 2022 20:26 UTC

61 points

0 comments1 min readLW link

Promising posts on AF that have fallen through the cracks

Evan R. Murphy4 Jan 2022 15:39 UTC

33 points

6 comments2 min readLW link

You can’t understand human agency without understanding amoeba agency

shminux6 Jan 2022 4:42 UTC

19 points

36 comments1 min readLW link

Satisf-AI: A Route to Reducing Risks From AI

harsimony6 Jan 2022 2:34 UTC

4 points

1 comment4 min readLW link

(harsimony.wordpress.com)

Importance of foresight evaluations within ELK

Jonathan Uesato6 Jan 2022 15:34 UTC

25 points

1 comment10 min readLW link

Goal-directedness: my baseline beliefs

Morgan_Rogers8 Jan 2022 13:09 UTC

21 points

3 comments3 min readLW link

The Unreasonable Feasibility Of Playing Chess Under The Influence

Jan12 Jan 2022 23:09 UTC

29 points

17 comments13 min readLW link

(universalprior.substack.com)

New year, new research agenda post

Charlie Steiner12 Jan 2022 17:58 UTC

29 points

4 comments16 min readLW link

Value extrapolation partially resolves symbol grounding

Stuart_Armstrong12 Jan 2022 16:30 UTC

24 points

10 comments1 min readLW link

2020 Review Article

Vaniver14 Jan 2022 4:58 UTC

74 points

3 comments7 min readLW link

The Greedy Doctor Problem… turns out to be relevant to the ELK problem?

Jan14 Jan 2022 11:58 UTC

33 points

10 comments14 min readLW link

(universalprior.substack.com)

PIBBSS Fellowship: Bounty for Referrals & Deadline Extension

Anna Gajdova17 Jan 2022 16:23 UTC

7 points

0 comments1 min readLW link

Different way classifiers can be diverse

Stuart_Armstrong17 Jan 2022 16:30 UTC

10 points

5 comments2 min readLW link

Scalar reward is not enough for aligned AGI

Peter Vamplew17 Jan 2022 21:02 UTC

15 points

3 comments11 min readLW link

Challenges with Breaking into MIRI-Style Research

Chris_Leong17 Jan 2022 9:23 UTC

72 points

15 comments3 min readLW link

Thought Experiments Provide a Third Anchor

jsteinhardt18 Jan 2022 16:00 UTC

44 points

20 comments4 min readLW link

(bounded-regret.ghost.io)

Anchor Weights for ML

jsteinhardt20 Jan 2022 16:20 UTC

17 points

2 comments2 min readLW link

(bounded-regret.ghost.io)

Estimating training compute of Deep Learning models

lennart, Jsevillamol, Marius Hobbhahn, Tamay Besiroglu and anson.ho

20 Jan 2022 16:12 UTC

37 points

4 comments1 min readLW link

Sharing Powerful AI Models

apc21 Jan 2022 11:57 UTC

6 points

4 comments1 min readLW link

[AN #171]: Disagreements between alignment “optimists” and “pessimists”

Rohin Shah21 Jan 2022 18:30 UTC

32 points

1 comment7 min readLW link

(mailchi.mp)

A one-question Turing test for GPT-3

Paul Crowley and rosiecam

22 Jan 2022 18:17 UTC

84 points

23 comments5 min readLW link

ML Systems Will Have Weird Failure Modes

jsteinhardt26 Jan 2022 1:40 UTC

54 points

8 comments6 min readLW link

(bounded-regret.ghost.io)

Search Is All You Need

blake808625 Jan 2022 23:13 UTC

33 points

13 comments3 min readLW link

Aligned AI Needs Slack

shminux26 Jan 2022 9:29 UTC

23 points

10 comments1 min readLW link

Empirical Findings Generalize Surprisingly Far

jsteinhardt1 Feb 2022 22:30 UTC

46 points

0 comments6 min readLW link

(bounded-regret.ghost.io)

OpenAI Solves (Some) Formal Math Olympiad Problems

Michaël Trazzi2 Feb 2022 21:49 UTC

77 points

26 comments2 min readLW link

Observed patterns around major technological advancements

Richard Korzekwa 3 Feb 2022 0:30 UTC

45 points

15 comments11 min readLW link

(aiimpacts.org)

Paperclippers, s-risks, hope

superads914 Feb 2022 19:03 UTC

13 points

17 comments1 min readLW link

AI Writeup Part 1

SNl4 Feb 2022 21:16 UTC

8 points

1 comment18 min readLW link

Alignment versus AI Alignment

Alex Flint4 Feb 2022 22:59 UTC

87 points

15 comments22 min readLW link

Capability Phase Transition Examples

gwern8 Feb 2022 3:32 UTC

39 points

1 comment1 min readLW link

(www.reddit.com)

A broad basin of attraction around human values?

Wei Dai12 Apr 2022 5:15 UTC

105 points

16 comments2 min readLW link

Appendix: More Is Different In Other Domains

jsteinhardt8 Feb 2022 16:00 UTC

12 points

1 comment4 min readLW link

(bounded-regret.ghost.io)

[Intro to brain-like-AGI safety] 2. “Learning from scratch” in the brain

Steven Byrnes2 Feb 2022 13:22 UTC

43 points

12 comments25 min readLW link

Better impossibility result for unbounded utilities

paulfchristiano9 Feb 2022 6:10 UTC

29 points

24 comments5 min readLW link

EleutherAI’s GPT-NeoX-20B release

leogao10 Feb 2022 6:56 UTC

30 points

3 comments1 min readLW link

(eaidata.bmk.sh)

Inferring utility functions from locally non-transitive preferences

Jan10 Feb 2022 10:33 UTC

28 points

15 comments8 min readLW link

(universalprior.substack.com)

A summary of aligning narrowly superhuman models

gugu10 Feb 2022 18:26 UTC

8 points

0 comments8 min readLW link

Idea: build alignment dataset for very capable models

Quintin Pope12 Feb 2022 19:30 UTC

9 points

2 comments3 min readLW link

Goal-directedness: exploring explanations

Morgan_Rogers14 Feb 2022 16:20 UTC

13 points

3 comments18 min readLW link

Is ELK enough? Diamond, Matrix and Child AI

adamShimi15 Feb 2022 2:29 UTC

17 points

10 comments4 min readLW link

What Does The Natural Abstraction Framework Say About ELK?

johnswentworth15 Feb 2022 2:27 UTC

34 points

0 comments6 min readLW link

Some Hacky ELK Ideas

johnswentworth15 Feb 2022 2:27 UTC

34 points

8 comments5 min readLW link

How harmful are improvements in AI? + Poll

tilmanr and Marius Hobbhahn

15 Feb 2022 18:16 UTC

15 points

4 comments8 min readLW link

Becoming Stronger as Epistemologist: Introduction

adamShimi15 Feb 2022 6:15 UTC

29 points

2 comments4 min readLW link

REPL’s: a type signature for agents

scottviteri15 Feb 2022 22:57 UTC

23 points

5 comments2 min readLW link

REPL’s and ELK

scottviteri17 Feb 2022 1:14 UTC

9 points

4 comments1 min readLW link

[Link] Eric Schmidt’s new AI2050 Fund

Aryeh Englander16 Feb 2022 21:21 UTC

32 points

3 comments2 min readLW link

Alignment researchers, how useful is extra compute for you?

Lauro Langosco19 Feb 2022 15:35 UTC

7 points

4 comments1 min readLW link

[Question] 2 (naive?) ideas for alignment

Jonathan Moregård20 Feb 2022 19:01 UTC

3 points

1 comment1 min readLW link

The Big Picture Of Alignment (Talk Part 1)

johnswentworth21 Feb 2022 5:49 UTC

98 points

35 comments1 min readLW link

(www.youtube.com)

[Question] Favorite / most obscure research on understanding DNNs?

Vivek Hebbar21 Feb 2022 5:49 UTC

16 points

1 comment1 min readLW link

Two Challenges for ELK

derek shiller21 Feb 2022 5:49 UTC

7 points

0 comments4 min readLW link

[Question] Do any AI alignment orgs hire remotely?

RobertM21 Feb 2022 22:33 UTC

24 points

9 comments2 min readLW link

More GPT-3 and symbol grounding

Stuart_Armstrong23 Feb 2022 18:30 UTC

21 points

7 comments3 min readLW link

Transformer inductive biases & RASP

Vivek Hebbar24 Feb 2022 0:42 UTC

15 points

4 comments1 min readLW link

(proceedings.mlr.press)

A comment on Ajeya Cotra’s draft report on AI timelines

Matthew Barnett24 Feb 2022 0:41 UTC

69 points

13 comments7 min readLW link

The Big Picture Of Alignment (Talk Part 2)

johnswentworth25 Feb 2022 2:53 UTC

33 points

12 comments1 min readLW link

(www.youtube.com)

Trust-maximizing AGI

Jan and Karl von Wendt

25 Feb 2022 15:13 UTC

7 points

26 comments9 min readLW link

(universalprior.substack.com)

IMO challenge bet with Eliezer

paulfchristiano26 Feb 2022 4:50 UTC

162 points

25 comments3 min readLW link

New Speaker Series on AI Alignment Starting March 3

Zechen Zhang26 Feb 2022 19:31 UTC

7 points

1 comment1 min readLW link

How I Formed My Own Views About AI Safety

Neel Nanda27 Feb 2022 18:50 UTC

64 points

6 comments13 min readLW link

(www.neelnanda.io)

Shah and Yudkowsky on alignment failures

Rohin Shah and Eliezer Yudkowsky

28 Feb 2022 19:18 UTC

83 points

38 comments91 min readLW link

ELK Thought Dump

abramdemski28 Feb 2022 18:46 UTC

58 points

18 comments17 min readLW link

Late 2021 MIRI Conversations: AMA / Discussion

Rob Bensinger28 Feb 2022 20:03 UTC

119 points

208 comments1 min readLW link

[Question] What are the causality effects of an agents presence in a reinforcement learning environment

Jonas Kgomo1 Mar 2022 21:57 UTC

0 points

2 comments1 min readLW link

Musings on the Speed Prior

evhub2 Mar 2022 4:04 UTC

19 points

4 comments10 min readLW link

AI Performance on Human Tasks

Asher Ellis3 Mar 2022 20:13 UTC

58 points

3 comments21 min readLW link

Introducing myself: Henry Lieberman, MIT CSAIL, whycantwe.org

Henry A Lieberman3 Mar 2022 23:42 UTC

−2 points

9 comments1 min readLW link

Preserving and continuing alignment research through a severe global catastrophe

A_donor6 Mar 2022 18:43 UTC

36 points

11 comments5 min readLW link

Why work at AI Impacts?

Katja6 Mar 2022 22:10 UTC

50 points

7 comments13 min readLW link

(aiimpacts.org)

Personal imitation software

Flaglandbase7 Mar 2022 7:55 UTC

6 points

6 comments1 min readLW link

[MLSN #3]: NeurIPS Safety Paper Roundup

Dan H8 Mar 2022 15:17 UTC

45 points

0 comments4 min readLW link

ELK prize results

paulfchristiano and Mark Xu

9 Mar 2022 0:01 UTC

130 points

50 comments21 min readLW link

[Question] Non-coercive motivation for alignment research?

Jonathan Moregård8 Mar 2022 20:50 UTC

1 point

0 comments1 min readLW link

On presenting the case for AI risk

Aryeh Englander9 Mar 2022 1:41 UTC

54 points

18 comments4 min readLW link

Ask AI companies about what they are doing for AI safety?

mic9 Mar 2022 15:14 UTC

50 points

0 comments2 min readLW link

Deriving Our World From Small Datasets

Capybasilisk9 Mar 2022 0:34 UTC

5 points

4 comments2 min readLW link

Value extrapolation, concept extrapolation, model splintering

Stuart_Armstrong8 Mar 2022 22:50 UTC

14 points

1 comment2 min readLW link

The Proof of Doom

johnlawrenceaspden9 Mar 2022 19:37 UTC

27 points

18 comments3 min readLW link

A Rephrasing Of and Footnote To An Embedded Agency Proposal

JoshuaOSHickman9 Mar 2022 18:13 UTC

5 points

0 comments5 min readLW link

ELK Sub—Note-taking in internal rollouts

Hoagy9 Mar 2022 17:23 UTC

6 points

0 comments5 min readLW link

[Question] Are there any impossibility theorems for strong and safe AI?

David Johnston11 Mar 2022 1:41 UTC

5 points

3 comments1 min readLW link

Compute Trends — Comparison to OpenAI’s AI and Compute

lennart, Jsevillamol, Pablo Villalobos, Marius Hobbhahn, Tamay Besiroglu and anson.ho

12 Mar 2022 18:09 UTC

23 points

3 comments3 min readLW link

ELK contest submission: route understanding through the human ontology

Vika, Ramana Kumar and Vikrant Varma

14 Mar 2022 21:42 UTC

21 points

2 comments2 min readLW link

Dual use of artificial-intelligence-powered drug discovery

Vaniver15 Mar 2022 2:52 UTC

91 points

15 comments1 min readLW link

(www.nature.com)

[Intro to brain-like-AGI safety] 8. Takeaways from neuro 1/2: On AGI development

Steven Byrnes16 Mar 2022 13:59 UTC

41 points

2 comments15 min readLW link

Some (potentially) fundable AI Safety Ideas

Logan Riggs16 Mar 2022 12:48 UTC

21 points

5 comments5 min readLW link

What do paradigm shifts look like?

leogao16 Mar 2022 19:17 UTC

15 points

2 comments1 min readLW link

[Question] What is the equivalent of the “do” operator for finite factored sets?

Chris van Merwijk17 Mar 2022 8:05 UTC

8 points

2 comments1 min readLW link

[Question] What to do after inventing AGI?

elephantcrew18 Mar 2022 22:30 UTC

9 points

4 comments1 min readLW link

Goal-directedness: imperfect reasoning, limited knowledge and inaccurate beliefs

Morgan_Rogers19 Mar 2022 17:28 UTC

4 points

1 comment21 min readLW link

Wargaming AGI Development

ryan_b19 Mar 2022 17:59 UTC

36 points

13 comments5 min readLW link

Exploring Finite Factored Sets with some toy examples

Thomas Kehrenberg19 Mar 2022 22:08 UTC

36 points

1 comment9 min readLW link

(tm.kehrenberg.net)

Natural Value Learning

Chris van Merwijk20 Mar 2022 12:44 UTC

7 points

10 comments4 min readLW link

Why will an AGI be rational?

azsantosk21 Mar 2022 21:54 UTC

4 points

8 comments2 min readLW link

We cannot directly choose an AGI’s utility function

azsantosk21 Mar 2022 22:08 UTC

12 points

18 comments3 min readLW link

Progress Report 1: interpretability experiments & learning, testing compression hypotheses

Nathan Helm-Burger22 Mar 2022 20:12 UTC

11 points

0 comments2 min readLW link

Lessons After a Couple Months of Trying to Do ML Research

KevinRoWang22 Mar 2022 23:45 UTC

68 points

8 comments6 min readLW link

Job Offering: Help Communicate Infrabayesianism

abramdemski, Vanessa Kosoy and Diffractor

23 Mar 2022 18:35 UTC

135 points

21 comments1 min readLW link

A survey of tool use and workflows in alignment research

Logan Riggs, Jan, janus and jacquesthibs

23 Mar 2022 23:44 UTC

43 points

5 comments1 min readLW link

Why Agent Foundations? An Overly Abstract Explanation

johnswentworth25 Mar 2022 23:17 UTC

247 points

54 comments8 min readLW link

[ASoT] Observations about ELK

leogao26 Mar 2022 0:42 UTC

30 points

0 comments3 min readLW link

[Question] When people ask for your P(doom), do you give them your inside view or your betting odds?

Vivek Hebbar26 Mar 2022 23:08 UTC

11 points

12 comments1 min readLW link

Compute Governance: The Role of Commodity Hardware

Jan26 Mar 2022 10:08 UTC

14 points

7 comments7 min readLW link

(universalprior.substack.com)

Agency and Coherence

David Udell26 Mar 2022 19:25 UTC

23 points

2 comments3 min readLW link

[ASoT] Some ways ELK could still be solvable in practice

leogao27 Mar 2022 1:15 UTC

26 points

1 comment2 min readLW link

[Question] Your specific attitudes towards AI safety

Esben Kran27 Mar 2022 22:33 UTC

8 points

22 comments1 min readLW link

[ASoT] Searching for consequentialist structure

leogao27 Mar 2022 19:09 UTC

25 points

2 comments4 min readLW link

Vaniver’s ELK Submission

Vaniver28 Mar 2022 21:14 UTC

10 points

0 comments7 min readLW link

Towards a better circuit prior: Improving on ELK state-of-the-art

evhub29 Mar 2022 1:56 UTC

19 points

0 comments16 min readLW link

Strategies for differential divulgation of key ideas in AI capability

azsantosk29 Mar 2022 3:22 UTC

8 points

0 comments6 min readLW link

[ASoT] Some thoughts about deceptive mesaoptimization

leogao28 Mar 2022 21:14 UTC

24 points

5 comments7 min readLW link

[Question] What would make you confident that AGI has been achieved?

Yitz29 Mar 2022 23:02 UTC

17 points

6 comments1 min readLW link

Progress Report 2

Nathan Helm-Burger30 Mar 2022 2:29 UTC

4 points

1 comment1 min readLW link

[ASoT] Some thoughts about LM monologue limitations and ELK

leogao30 Mar 2022 14:26 UTC

10 points

0 comments2 min readLW link

Procedurally evaluating factual accuracy: a request for research

Jacob_Hilton30 Mar 2022 16:37 UTC

24 points

2 comments6 min readLW link

No, EDT Did Not Get It Right All Along: Why the Coin Flip Creation Problem Is Irrelevant

Heighn30 Mar 2022 18:41 UTC

6 points

6 comments3 min readLW link

ELK Computational Complexity: Three Levels of Difficulty

abramdemski30 Mar 2022 20:56 UTC

46 points

9 comments7 min readLW link

[Link] Training Compute-Optimal Large Language Models

nostalgebraist31 Mar 2022 18:01 UTC

50 points

23 comments1 min readLW link

(arxiv.org)

Newcomb’s problem is just a standard time consistency problem

basil.halperin31 Mar 2022 17:32 UTC

12 points

6 comments12 min readLW link

The Calculus of Newcomb’s Problem

Heighn1 Apr 2022 14:41 UTC

3 points

6 comments2 min readLW link

New Scaling Laws for Large Language Models

1a3orn1 Apr 2022 20:41 UTC

223 points

21 comments5 min readLW link

Interacting with a Boxed AI

aphyer1 Apr 2022 22:42 UTC

11 points

19 comments4 min readLW link

Optimality is the tiger, and agents are its teeth

Veedrac2 Apr 2022 0:46 UTC

197 points

31 comments16 min readLW link

[Question] How can a layman contribute to AI Alignment efforts, given shorter timeline/doomier scenarios?

AprilSR2 Apr 2022 4:34 UTC

13 points

5 comments1 min readLW link

AI Governance across Slow/Fast Takeoff and Easy/Hard Alignment spectra

Davidmanheim3 Apr 2022 7:45 UTC

27 points

6 comments3 min readLW link

[Question] What are some ways in which we can die with more dignity?

Chris_Leong3 Apr 2022 5:32 UTC

14 points

19 comments1 min readLW link

[Question] Should we push for banning making hiring decisions based on AI?

ChristianKl3 Apr 2022 19:46 UTC

10 points

6 comments1 min readLW link

Bayeswatch 9.5: Rest & Relaxation

lsusr4 Apr 2022 1:13 UTC

24 points

1 comment2 min readLW link

Bayeswatch 6.5: Therapy

lsusr4 Apr 2022 1:20 UTC

15 points

0 comments1 min readLW link

Theories of Modularity in the Biological Literature

CallumMcDougall, Avery and Lucius Bushnaq

4 Apr 2022 12:48 UTC

47 points

13 comments7 min readLW link

Google’s new 540 billion parameter language model

Matthew Barnett4 Apr 2022 17:49 UTC

108 points

83 comments1 min readLW link

(storage.googleapis.com)

Call For Distillers

johnswentworth4 Apr 2022 18:25 UTC

192 points

42 comments3 min readLW link

Is the scaling race finally on?

p.b.4 Apr 2022 19:53 UTC

24 points

0 comments2 min readLW link

Yudkowsky Contra Christiano on AI Takeoff Speeds [Linkpost]

aogara5 Apr 2022 2:09 UTC

18 points

0 comments11 min readLW link

[Cross-post] Half baked ideas: defining and measuring Artificial Intelligence system effectiveness

David Johnston5 Apr 2022 0:29 UTC

2 points

0 comments7 min readLW link

[Question] Why is Toby Ord’s likelihood of human extinction due to AI so low?

ChristianKl5 Apr 2022 12:16 UTC

8 points

9 comments1 min readLW link

Non-programmers intro to AI for programmers

Dustin5 Apr 2022 18:12 UTC

6 points

0 comments2 min readLW link

What Would A Fight Between Humanity And AGI Look Like?

johnswentworth5 Apr 2022 20:03 UTC

79 points

22 comments3 min readLW link

Supervise Process, not Outcomes

stuhlmueller and jungofthewon

5 Apr 2022 22:18 UTC

119 points

8 comments10 min readLW link

AXRP Episode 14 - Infra-Bayesian Physicalism with Vanessa Kosoy

DanielFilan5 Apr 2022 23:10 UTC

23 points

9 comments52 min readLW link

[Question] What’s the problem with having an AI align itself?

FinalFormal26 Apr 2022 0:59 UTC

0 points

3 comments1 min readLW link

What if we stopped making GPUs for a bit?

MrPointy5 Apr 2022 23:02 UTC

−3 points

2 comments1 min readLW link

Don’t die with dignity; instead play to your outs

Jeffrey Ladish6 Apr 2022 7:53 UTC

243 points

58 comments5 min readLW link

What I Was Thinking About Before Alignment

johnswentworth6 Apr 2022 16:08 UTC

77 points

8 comments5 min readLW link

[Link] A minimal viable product for alignment

janleike6 Apr 2022 15:38 UTC

51 points

38 comments1 min readLW link

[Link] Why I’m excited about AI-assisted human feedback

janleike6 Apr 2022 15:37 UTC

29 points

0 comments1 min readLW link

Testing PaLM prompts on GPT3

Yitz6 Apr 2022 5:21 UTC

103 points

15 comments8 min readLW link

[ASoT] Some thoughts about imperfect world modeling

leogao7 Apr 2022 15:42 UTC

7 points

0 comments4 min readLW link

Truthfulness, standards and credibility

Joe_Collman7 Apr 2022 10:31 UTC

12 points

2 comments32 min readLW link

What if “friendly/unfriendly” GAI isn’t a thing?

homunq7 Apr 2022 16:54 UTC

−1 points

4 comments2 min readLW link

Productive Mistakes, Not Perfect Answers

adamShimi7 Apr 2022 16:41 UTC

95 points

11 comments6 min readLW link

Believable near-term AI disaster

Dagon7 Apr 2022 18:20 UTC

8 points

2 comments2 min readLW link

How BoMAI Might fail

Donald Hobson7 Apr 2022 15:32 UTC

11 points

3 comments2 min readLW link

DeepMind: The Podcast—Excerpts on AGI

WilliamKiely7 Apr 2022 22:09 UTC

75 points

10 comments5 min readLW link

AI Alignment and Recognition

Chris_Leong8 Apr 2022 5:39 UTC

7 points

2 comments1 min readLW link

Reverse (intent) alignment may allow for safer Oracles

azsantosk8 Apr 2022 2:48 UTC

4 points

0 comments4 min readLW link

AIs should learn human preferences, not biases

Stuart_Armstrong8 Apr 2022 13:45 UTC

10 points

1 comment1 min readLW link

[Question] Is there a possibility that the upcoming scaling of data in language models causes A.G.I.?

ArtMi8 Apr 2022 6:56 UTC

2 points

0 comments1 min readLW link

Different perspectives on concept extrapolation

Stuart_Armstrong8 Apr 2022 10:42 UTC

42 points

7 comments5 min readLW link

[RETRACTED] It’s time for EA leadership to pull the short-timelines fire alarm.

Not Relevant8 Apr 2022 16:07 UTC

112 points

165 comments4 min readLW link

Convincing All Capability Researchers

Logan Riggs8 Apr 2022 17:40 UTC

120 points

70 comments3 min readLW link

Language Model Tools for Alignment Research

Logan Riggs8 Apr 2022 17:32 UTC

27 points

0 comments2 min readLW link

[Question] What would the creation of aligned AGI look like for us?

Perhaps8 Apr 2022 18:05 UTC

3 points

4 comments1 min readLW link

Takeaways From 3 Years Working In Machine Learning

George3d68 Apr 2022 17:14 UTC

34 points

10 comments11 min readLW link

(www.epistem.ink)

[Question] Can AI systems have extremely impressive outputs and also not need to be aligned because they aren’t general enough or something?

WilliamKiely9 Apr 2022 6:03 UTC

6 points

3 comments1 min readLW link

Why Instrumental Goals are not a big AI Safety Problem

Jonathan Paulson9 Apr 2022 0:10 UTC

0 points

9 comments3 min readLW link

Emergent Ventures/Schmidt (new grantor for individual researchers)

gwern9 Apr 2022 14:41 UTC

21 points

6 comments1 min readLW link

(marginalrevolution.com)

Strategies for keeping AIs narrow in the short term

Rossin9 Apr 2022 16:42 UTC

9 points

3 comments3 min readLW link

A concrete bet offer to those with short AI timelines

Matthew Barnett and Tamay

9 Apr 2022 21:41 UTC

195 points

104 comments4 min readLW link

Finally Entering Alignment

Ulisse Mini10 Apr 2022 17:01 UTC

75 points

8 comments2 min readLW link

[Question] Does non-access to outputs prevent recursive self-improvement?

Gunnar_Zarncke10 Apr 2022 18:37 UTC

14 points

0 comments1 min readLW link

[Question] Convince me that humanity is as doomed by AGI as Yudkowsky et al., seems to believe

Yitz10 Apr 2022 21:02 UTC

91 points

142 comments2 min readLW link

[Question] Could we set a resolution/stopper for the upper bound of the utility function of an AI?

FinalFormal211 Apr 2022 3:10 UTC

−5 points

2 comments1 min readLW link

What can people not smart/technical enough for AI research/AI risk work do to reduce AI-risk/maximize AI safety? (which is most people?)

Alex K. Chen (parrot)11 Apr 2022 14:05 UTC

7 points

3 comments3 min readLW link

We should stop being so confident that AI coordination is unlikely

trevor11 Apr 2022 22:27 UTC

14 points

7 comments1 min readLW link

The Regulatory Option: A response to near 0% survival odds

Matthew Lowenstein11 Apr 2022 22:00 UTC

45 points

21 comments6 min readLW link

[Question] How can I determine that Elicit is not some weak AGI’s attempt at taking over the world ?

Lucie Philippon12 Apr 2022 0:54 UTC

5 points

3 comments1 min readLW link

[Question] Three questions about mesa-optimizers

Eric Neyman12 Apr 2022 2:58 UTC

23 points

5 comments3 min readLW link

A Small Negative Result on Debate

Sam Bowman12 Apr 2022 18:19 UTC

42 points

11 comments1 min readLW link

The Peerless

Tamsin Leake13 Apr 2022 1:07 UTC

18 points

2 comments1 min readLW link

(carado.moe)

Convincing People of Alignment with Street Epistemology

Logan Riggs12 Apr 2022 23:43 UTC

54 points

4 comments3 min readLW link

[Question] “Fragility of Value” vs. LLMs

Not Relevant13 Apr 2022 2:02 UTC

32 points

32 comments1 min readLW link

How dath ilan coordinates around solving alignment

Thomas Kwa13 Apr 2022 4:22 UTC

46 points

37 comments5 min readLW link

[Question] What’s a good probability distribution family (e.g. “log-normal”) to use for AGI timelines?

David Scott Krueger (formerly: capybaralet)13 Apr 2022 4:45 UTC

9 points

12 comments1 min readLW link

Takeoff speeds have a huge effect on what it means to work on AI x-risk

Buck13 Apr 2022 17:38 UTC

117 points

25 comments2 min readLW link

Design, Implement and Verify

rwallace13 Apr 2022 18:14 UTC

32 points

13 comments4 min readLW link

[Question] What to include in a guest lecture on existential risks from AI?

Aryeh Englander13 Apr 2022 17:03 UTC

20 points

9 comments1 min readLW link

A Quick Guide to Confronting Doom

Ruby13 Apr 2022 19:30 UTC

224 points

36 comments2 min readLW link

Exploring toy neural nets under node removal. Section 1.

Donald Hobson13 Apr 2022 23:30 UTC

12 points

7 comments8 min readLW link

[Question] Unchangeable Code possible ?

AntonTimmer14 Apr 2022 11:17 UTC

7 points

9 comments1 min readLW link

How to become an AI safety researcher

peterbarnett15 Apr 2022 11:41 UTC

19 points

0 comments14 min readLW link

Early 2022 Paper Round-up

jsteinhardt14 Apr 2022 20:50 UTC

80 points

4 comments3 min readLW link

(bounded-regret.ghost.io)

[Question] Can someone explain to me why MIRI is so pessimistic of our chances of survival?

iamthouthouarti14 Apr 2022 20:28 UTC

10 points

7 comments1 min readLW link

Pivotal acts from Math AIs

azsantosk15 Apr 2022 0:25 UTC

10 points

4 comments5 min readLW link

Refine: An Incubator for Conceptual Alignment Research Bets

adamShimi15 Apr 2022 8:57 UTC

123 points

13 comments4 min readLW link

My least favorite thing

sudo14 Apr 2022 22:33 UTC

41 points

30 comments3 min readLW link

[Question] Constraining narrow AI in a corporate setting

MaximumLiberty15 Apr 2022 22:36 UTC

28 points

4 comments1 min readLW link

Pop Culture Alignment Research and Taxes

Jan16 Apr 2022 15:45 UTC

16 points

14 comments11 min readLW link

(universalprior.substack.com)

Org announcement: [AC]RC

Vivek Hebbar17 Apr 2022 17:24 UTC

79 points

12 comments1 min readLW link

Code Generation as an AI risk setting

Not Relevant17 Apr 2022 22:27 UTC

91 points

16 comments2 min readLW link

Mental Health and the Alignment Problem: A Compilation of Resources

Chris Scammell18 Apr 2022 18:36 UTC

139 points

7 comments17 min readLW link

Is “Control” of a Superintelligence Possible?

Mahdi Complex18 Apr 2022 16:03 UTC

9 points

14 comments1 min readLW link

[Closed] Hiring a mathematician to work on the learning-theoretic AI alignment agenda

Vanessa Kosoy19 Apr 2022 6:44 UTC

84 points

21 comments2 min readLW link

[Question] The two missing core reasons why aligning at-least-partially superhuman AGI is hard

Joel Burget19 Apr 2022 17:15 UTC

7 points

2 comments1 min readLW link

[Question] How does the world look like 10 years after we have deployed an aligned AGI?

mukashi19 Apr 2022 11:34 UTC

4 points

3 comments1 min readLW link

[Question] Clarification on Definition of AGI

stanislaw19 Apr 2022 12:41 UTC

0 points

1 comment1 min readLW link

[Question] What’s the Relationship Between “Human Values” and the Brain’s Reward System?

interstice19 Apr 2022 5:15 UTC

36 points

16 comments1 min readLW link

Deceptive Agents are a Good Way to Do Things

David Udell19 Apr 2022 18:04 UTC

15 points

0 comments1 min readLW link

The Scale Problem in AI

tailcalled19 Apr 2022 17:46 UTC

22 points

17 comments3 min readLW link

Concept extrapolation: key posts

Stuart_Armstrong19 Apr 2022 10:01 UTC

12 points

2 comments1 min readLW link

“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

Andrew_Critch19 Apr 2022 20:25 UTC

96 points

56 comments7 min readLW link

GPT-3 and concept extrapolation

Stuart_Armstrong20 Apr 2022 10:39 UTC

19 points

28 comments1 min readLW link

[Intro to brain-like-AGI safety] 12. Two paths forward: “Controlled AGI” and “Social-instinct AGI”

Steven Byrnes20 Apr 2022 12:58 UTC

33 points

10 comments16 min readLW link

Preregistration: Air Conditioner Test

johnswentworth21 Apr 2022 19:48 UTC

109 points

64 comments9 min readLW link

[Question] Choice := Anthropics uncertainty? And potential implications for agency

Antoine de Scorraille21 Apr 2022 16:38 UTC

5 points

1 comment1 min readLW link

Understanding the Merging of Opinions with Increasing Information theorem

ViktoriaMalyasova21 Apr 2022 14:13 UTC

13 points

1 comment5 min readLW link

Early 2022 Paper Round-up (Part 2)

jsteinhardt21 Apr 2022 23:40 UTC

10 points

0 comments5 min readLW link

(bounded-regret.ghost.io)

[Question] What are the numbers in mind for the super-short AGI timelines so many long-termists are alarmed about?

Evan_Gaensbauer21 Apr 2022 23:32 UTC

22 points

14 comments1 min readLW link

AI Will Multiply

harsimony22 Apr 2022 4:33 UTC

13 points

4 comments1 min readLW link

(harsimony.wordpress.com)

Humanity as an entity: An alternative to Coherent Extrapolated Volition

ZT522 Apr 2022 12:48 UTC

0 points

2 comments4 min readLW link

[ASoT] Consequentialist models as a superset of mesaoptimizers

leogao23 Apr 2022 17:57 UTC

36 points

2 comments4 min readLW link

Skilling-up in ML Engineering for Alignment: request for comments

CallumMcDougall and Jamie Bernardi

23 Apr 2022 15:11 UTC

19 points

0 comments1 min readLW link

[Question] Wanting to change what you want

Mithrandir23 Apr 2022 4:23 UTC

−1 points

1 comment1 min readLW link

Progress Report 5: tying it together

Nathan Helm-Burger23 Apr 2022 21:07 UTC

10 points

0 comments2 min readLW link

Calling for Student Submissions: AI Safety Distillation Contest

Aris24 Apr 2022 1:53 UTC

48 points

15 comments4 min readLW link

Examining Evolution as an Upper Bound for AGI Timelines

meanderingmoose24 Apr 2022 19:08 UTC

5 points

1 comment9 min readLW link

AI safety raising awareness resources bleg

iivonen24 Apr 2022 17:13 UTC

6 points

1 comment1 min readLW link

Intuitions about solving hard problems

Richard_Ngo25 Apr 2022 15:29 UTC

92 points

23 comments6 min readLW link

[Request for Distillation] Coherence of Distributed Decisions With Different Inputs Implies Conditioning

johnswentworth25 Apr 2022 17:01 UTC

22 points

14 comments2 min readLW link

dalle2 comments

nostalgebraist26 Apr 2022 5:30 UTC

183 points

13 comments13 min readLW link

(nostalgebraist.tumblr.com)

Make a neural network in ~10 minutes

Arjun Yadav26 Apr 2022 5:24 UTC

8 points

0 comments4 min readLW link

(arjunyadav.net)

Law-Following AI 1: Sequence Introduction and Structure

Cullen27 Apr 2022 17:26 UTC

16 points

10 comments9 min readLW link

Law-Following AI 2: Intent Alignment + Superintelligence → Lawless AI (By Default)

Cullen27 Apr 2022 17:27 UTC

5 points

2 comments6 min readLW link

Law-Following AI 3: Lawless AI Agents Undermine Stabilizing Agreements

Cullen27 Apr 2022 17:30 UTC

2 points

2 comments3 min readLW link

If you’re very optimistic about ELK then you should be optimistic about outer alignment

Sam Marks27 Apr 2022 19:30 UTC

17 points

8 comments3 min readLW link

AI Alternative Futures: Scenario Mapping Artificial Intelligence Risk—Request for Participation (Closed)

Kakili27 Apr 2022 22:07 UTC

10 points

2 comments8 min readLW link

The Speed + Simplicity Prior is probably anti-deceptive

Yonadav Shavit27 Apr 2022 19:30 UTC

30 points

29 comments12 min readLW link

Slides: Potential Risks From Advanced AI

Aryeh Englander28 Apr 2022 2:15 UTC

7 points

0 comments1 min readLW link

How Might an Alignment Attractor Look like?

shminux28 Apr 2022 6:46 UTC

47 points

15 comments2 min readLW link

Naive comments on AGIlignment

Ericf28 Apr 2022 1:08 UTC

2 points

4 comments1 min readLW link

[Question] Is alignment possible?

Shay28 Apr 2022 21:18 UTC

0 points

5 comments1 min readLW link

Learning the smooth prior

Geoffrey Irving, Rohin Shah and evhub

29 Apr 2022 21:10 UTC

31 points

0 comments12 min readLW link

[Linkpost] New multi-modal Deepmind model fusing Chinchilla with images and videos

p.b.30 Apr 2022 3:47 UTC

53 points

18 comments1 min readLW link

Note-Taking without Hidden Messages

Hoagy30 Apr 2022 11:15 UTC

7 points

1 comment4 min readLW link

[Question] Why hasn’t deep learning generated significant economic value yet?

Alex_Altair30 Apr 2022 20:27 UTC

112 points

95 comments2 min readLW link

What is the solution to the Alignment problem?

Algon30 Apr 2022 23:19 UTC

24 points

2 comments1 min readLW link

[Linkpost] Value extraction via language model abduction

Paul Bricman1 May 2022 19:11 UTC

4 points

3 comments1 min readLW link

(paulbricman.com)

ELK shaving

Miss Aligned AI1 May 2022 21:05 UTC

6 points

1 comment1 min readLW link

So has AI conquered Bridge ?

Ponder Stibbons2 May 2022 15:01 UTC

16 points

2 comments14 min readLW link

Information security considerations for AI and the long term future

Jeffrey Ladish and lennart

2 May 2022 20:54 UTC

74 points

6 comments10 min readLW link

Is evolutionary influence the mesa objective that we’re interested in?

David Johnston3 May 2022 1:18 UTC

3 points

2 comments5 min readLW link

Various Alignment Strategies (and how likely they are to work)

Logan Zoellner3 May 2022 16:54 UTC

73 points

34 comments11 min readLW link

Introducing the ML Safety Scholars Program

Dan H, ThomasW, Mantas Mazeika, ozhang, Sidney Hough and Kevin Liu

4 May 2022 16:01 UTC

73 points

2 comments3 min readLW link

Frankenstein: A Modern AGI

Sable5 May 2022 16:16 UTC

9 points

10 comments9 min readLW link

[Question] What is bias in alignment terms?

Jonas Kgomo4 May 2022 21:35 UTC

0 points

2 comments1 min readLW link

Ethan Caballero on Private Scaling Progress

Michaël Trazzi5 May 2022 18:32 UTC

62 points

1 comment2 min readLW link

(theinsideview.github.io)

Apply to the second iteration of the ML for Alignment Bootcamp (MLAB 2) in Berkeley [Aug 15 - Fri Sept 2]

Buck6 May 2022 4:23 UTC

68 points

0 comments6 min readLW link

The case for becoming a black-box investigator of language models

Buck6 May 2022 14:35 UTC

118 points

19 comments3 min readLW link

Getting GPT-3 to predict Metaculus questions

MathiasKB6 May 2022 6:01 UTC

68 points

8 comments2 min readLW link

But What’s Your New Alignment Insight, out of a Future-Textbook Paragraph?

David Udell7 May 2022 3:10 UTC

24 points

18 comments5 min readLW link

Video and Transcript of Presentation on Existential Risk from Power-Seeking AI

Joe Carlsmith8 May 2022 3:50 UTC

20 points

1 comment29 min readLW link

A Bird’s Eye View of the ML Field [Pragmatic AI Safety #2]

Dan H and ThomasW

9 May 2022 17:18 UTC

126 points

5 comments35 min readLW link

Introduction to Pragmatic AI Safety [Pragmatic AI Safety #1]

Dan H and ThomasW

9 May 2022 17:06 UTC

70 points

1 comment6 min readLW link

Jobs: Help scale up LM alignment research at NYU

Sam Bowman9 May 2022 14:12 UTC

60 points

1 comment1 min readLW link

When is AI safety research harmful?

NathanBarnard9 May 2022 18:19 UTC

2 points

0 comments8 min readLW link

AI Alignment YouTube Playlists

jacquesthibs and remember

9 May 2022 21:33 UTC

29 points

4 comments1 min readLW link

Examining Armstrong’s category of generalized models

Morgan_Rogers10 May 2022 9:07 UTC

14 points

0 comments7 min readLW link

An Inside View of AI Alignment

Ansh Radhakrishnan11 May 2022 2:16 UTC

31 points

2 comments2 min readLW link

[Question] What are your recommendations for technical AI alignment podcasts?

Evan_Gaensbauer11 May 2022 21:52 UTC

5 points

4 comments1 min readLW link

Deepmind’s Gato: Generalist Agent

Daniel Kokotajlo12 May 2022 16:01 UTC

164 points

61 comments1 min readLW link

“A Generalist Agent”: New DeepMind Publication

1a3orn12 May 2022 15:30 UTC

79 points

43 comments1 min readLW link

A tentative dialogue with a Friendly-boxed-super-AGI on brain uploads

Ramiro P.12 May 2022 19:40 UTC

1 point

12 comments4 min readLW link

Positive outcomes under an unaligned AGI takeover

Yitz12 May 2022 7:45 UTC

19 points

12 comments3 min readLW link

The Last Paperclip

Logan Zoellner12 May 2022 19:25 UTC

57 points

15 comments17 min readLW link

RLHF

Ansh Radhakrishnan12 May 2022 21:18 UTC

16 points

5 comments5 min readLW link

[Question] What to do when starting a business in an imminent-AGI world?

ryan_b12 May 2022 21:07 UTC

25 points

7 comments1 min readLW link

DeepMind is hiring for the Scalable Alignment and Alignment Teams

Rohin Shah and Geoffrey Irving

13 May 2022 12:17 UTC

145 points

35 comments9 min readLW link

“Tech company singularities”, and steering them to reduce x-risk

Andrew_Critch13 May 2022 17:24 UTC

73 points

12 comments4 min readLW link

Against Time in Agent Models

johnswentworth13 May 2022 19:55 UTC

50 points

12 comments3 min readLW link

Frame for Take-Off Speeds to inform compute governance & scaling alignment

Logan Riggs13 May 2022 22:23 UTC

15 points

2 comments2 min readLW link

Alignment as Constraints

Logan Riggs13 May 2022 22:07 UTC

10 points

0 comments2 min readLW link

Fermi estimation of the impact you might have working on AI safety

Fabien Roger13 May 2022 17:49 UTC

6 points

0 comments1 min readLW link

An observation about Hubinger et al.’s framework for learned optimization

Spencer Becker-Kahn13 May 2022 16:20 UTC

33 points

9 comments8 min readLW link

Thoughts on AI Safety Camp

Charlie Steiner13 May 2022 7:16 UTC

24 points

7 comments7 min readLW link

Clarifying the confusion around inner alignment

Rauno Arike13 May 2022 23:05 UTC

27 points

0 comments11 min readLW link

[Link post] Promising Paths to Alignment—Connor Leahy | Talk

frances_lorenz14 May 2022 16:01 UTC

34 points

0 comments1 min readLW link

The AI Countdown Clock

River Lewis15 May 2022 18:37 UTC

40 points

27 comments2 min readLW link

(heytraveler.substack.com)

Surviving Automation In The 21st Century—Part 1

George3d615 May 2022 19:16 UTC

27 points

17 comments8 min readLW link

(www.epistem.ink)

Why I’m Optimistic About Near-Term AI Risk

harsimony15 May 2022 23:05 UTC

57 points

28 comments1 min readLW link

Optimization at a Distance

johnswentworth16 May 2022 17:58 UTC

78 points

13 comments4 min readLW link

[Question] To what extent is your AGI timeline bimodal or otherwise “bumpy”?

jchan16 May 2022 17:42 UTC

13 points

2 comments1 min readLW link

Proxy misspecification and the capabilities vs. value learning race

Sam Marks16 May 2022 18:58 UTC

19 points

1 comment4 min readLW link

How to invest in expectation of AGI?

Jakobovski17 May 2022 11:03 UTC

3 points

4 comments1 min readLW link

[Intro to brain-like-AGI safety] 15. Conclusion: Open problems, how to help, AMA

Steven Byrnes17 May 2022 15:11 UTC

81 points

11 comments14 min readLW link

Actionable-guidance and roadmap recommendations for the NIST AI Risk Management Framework

Dan H and Tony Barrett

17 May 2022 15:26 UTC

25 points

0 comments3 min readLW link

What are the possible trajectories of an AGI/ASI world?

Jakobovski17 May 2022 13:28 UTC

0 points

2 comments1 min readLW link

Maxent and Abstractions: Current Best Arguments

johnswentworth18 May 2022 19:54 UTC

34 points

2 comments3 min readLW link

How to get into AI safety research

Stuart_Armstrong18 May 2022 18:05 UTC

44 points

7 comments1 min readLW link

A bridge to Dath Ilan? Improved governance on the critical path to AI alignment.

Jackson Wagner18 May 2022 15:51 UTC

23 points

0 comments11 min readLW link

We have achieved Noob Gains in AI

phdead18 May 2022 20:56 UTC

114 points

21 comments7 min readLW link

[Question] Why does gradient descent always work on neural networks?

MichaelDickens20 May 2022 21:13 UTC

15 points

11 comments1 min readLW link

How RL Agents Behave When Their Actions Are Modified? [Distillation post]

PabloAMC20 May 2022 18:47 UTC

21 points

0 comments8 min readLW link

Over-digitalization: A Prelude to Analogia (Chapter 6)

Justin Bullock20 May 2022 16:39 UTC

3 points

0 comments13 min readLW link

Clarifying what ELK is trying to achieve

Towards_Keeperhood21 May 2022 7:34 UTC

7 points

0 comments5 min readLW link

[Short version] Information Loss --> Basin flatness

Vivek Hebbar21 May 2022 12:59 UTC

11 points

0 comments1 min readLW link

Information Loss --> Basin flatness

Vivek Hebbar21 May 2022 12:58 UTC

47 points

31 comments7 min readLW link

What kinds of algorithms do multi-human imitators learn?

Chris van Merwijk and Joar Skalse

22 May 2022 14:27 UTC

20 points

0 comments3 min readLW link

Are human imitators superhuman models with explicit constraints on capabilities?

Chris van Merwijk22 May 2022 12:46 UTC

41 points

3 comments1 min readLW link

Adversarial attacks and optimal control

Jan22 May 2022 18:22 UTC

16 points

7 comments8 min readLW link

(universalprior.substack.com)

CNN feature visualization in 50 lines of code

StefanHex26 May 2022 11:02 UTC

17 points

4 comments5 min readLW link

[Question] [Alignment] Is there a census on who’s working on what?

Cedar23 May 2022 15:33 UTC

23 points

6 comments1 min readLW link

AXRP Episode 15 - Natural Abstractions with John Wentworth

DanielFilan23 May 2022 5:40 UTC

32 points

1 comment57 min readLW link

Why I’m Worried About AI

peterbarnett23 May 2022 21:13 UTC

21 points

2 comments12 min readLW link

Complex Systems for AI Safety [Pragmatic AI Safety #3]

Dan H and ThomasW

24 May 2022 0:00 UTC

49 points

2 comments21 min readLW link

The No Free Lunch theorems and their Razor

Adrià Garriga-alonso24 May 2022 6:40 UTC

47 points

3 comments9 min readLW link

Google’s Imagen uses larger text encoder

Ben Livengood24 May 2022 21:55 UTC

27 points

2 comments1 min readLW link

autonomy: the missing AGI ingredient?

nostalgebraist25 May 2022 0:33 UTC

61 points

13 comments6 min readLW link

Paper: Teaching GPT3 to express uncertainty in words

Owain_Evans31 May 2022 13:27 UTC

96 points

7 comments4 min readLW link

Croesus, Cerberus, and the magpies: a gentle introduction to Eliciting Latent Knowledge

Alexandre Variengien27 May 2022 17:58 UTC

14 points

0 comments16 min readLW link

[Question] How much white collar work could be automated using existing ML models?

AM26 May 2022 8:09 UTC

25 points

4 comments1 min readLW link

The Pointers Problem—Distilled

Nina Panickssery26 May 2022 22:44 UTC

9 points

0 comments2 min readLW link

Iterated Distillation-Amplification, Gato, and Proto-AGI [Re-Explained]

Gabe M27 May 2022 5:42 UTC

21 points

4 comments6 min readLW link

Bootstrapping Language Models

harsimony27 May 2022 19:43 UTC

7 points

5 comments2 min readLW link

Understanding Selection Theorems

adamk28 May 2022 1:49 UTC

35 points

3 comments7 min readLW link

[Question] What have been the major “triumphs” in the field of AI over the last ten years?

lc28 May 2022 19:49 UTC

35 points

10 comments1 min readLW link

[Question] Bayesian Persuasion?

Karthik Tadepalli28 May 2022 17:52 UTC

8 points

2 comments1 min readLW link

Distributed Decisions

johnswentworth29 May 2022 2:43 UTC

65 points

4 comments6 min readLW link

The Problem With The Current State of AGI Definitions

Yitz29 May 2022 13:58 UTC

40 points

22 comments8 min readLW link

Functional Analysis Reading Group

Ulisse Mini28 May 2022 2:40 UTC

4 points

0 comments1 min readLW link

[Question] Impact of ” ‘Let’s think step by step’ is all you need”?

yrimon24 Jul 2022 20:59 UTC

20 points

2 comments1 min readLW link

Perform Tractable Research While Avoiding Capabilities Externalities [Pragmatic AI Safety #4]

Dan H and ThomasW

30 May 2022 20:25 UTC

43 points

3 comments25 min readLW link

[Question] What is the state of Chinese AI research?

Ratios31 May 2022 10:05 UTC

34 points

17 comments1 min readLW link

The Brain That Builds Itself

Jan31 May 2022 9:42 UTC

55 points

6 comments8 min readLW link

(universalprior.substack.com)

Machines vs. Memes 2: Memetically-Motivated Model Extensions

naterush31 May 2022 22:03 UTC

4 points

0 comments4 min readLW link

Machines vs Memes Part 3: Imitation and Memes

ceru231 Jun 2022 13:36 UTC

5 points

0 comments7 min readLW link

Paradigms of AI alignment: components and enablers

Vika2 Jun 2022 6:19 UTC

48 points

4 comments8 min readLW link

The Bio Anchors Forecast

Ansh Radhakrishnan2 Jun 2022 1:32 UTC

12 points

0 comments3 min readLW link

[MLSN #4]: Many New Interpretability Papers, Virtual Logit Matching, Rationalization Helps Robustness

Dan H3 Jun 2022 1:20 UTC

18 points

0 comments4 min readLW link

The prototypical catastrophic AI action is getting root access to its datacenter

Buck2 Jun 2022 23:46 UTC

142 points

10 comments2 min readLW link

Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing

Buck2 Jun 2022 23:48 UTC

33 points

0 comments3 min readLW link

Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

johnswentworth4 Jun 2022 5:41 UTC

118 points

52 comments2 min readLW link

How to pursue a career in technical AI alignment

Charlie Rogers-Smith4 Jun 2022 21:11 UTC

63 points

0 comments39 min readLW link

Noisy environment regulate utility maximizers

Niclas Kupper5 Jun 2022 18:48 UTC

4 points

0 comments7 min readLW link

Why agents are powerful

Daniel Kokotajlo6 Jun 2022 1:37 UTC

35 points

7 comments7 min readLW link

Why do some people try to make AGI?

TekhneMakre6 Jun 2022 9:14 UTC

14 points

7 comments3 min readLW link

Some ideas for follow-up projects to Redwood Research’s recent paper

JanB6 Jun 2022 13:29 UTC

10 points

0 comments7 min readLW link

Reading the ethicists 2: Hunting for AI alignment papers

Charlie Steiner6 Jun 2022 15:49 UTC

21 points

1 comment7 min readLW link

DALL-E 2 - Unofficial Natural Language Image Editing, Art Critique Survey

bakztfuture6 Jun 2022 18:27 UTC

0 points

0 comments1 min readLW link

(bakztfuture.substack.com)

Thinking about Broad Classes of Utility-like Functions

J Bostock7 Jun 2022 14:05 UTC

7 points

0 comments4 min readLW link

Thoughts on Formalizing Composition

Tom Lieberum7 Jun 2022 7:51 UTC

13 points

0 comments7 min readLW link

“Pivotal Acts” means something specific

Raemon7 Jun 2022 21:56 UTC

114 points

23 comments2 min readLW link

Why I don’t believe in doom

mukashi7 Jun 2022 23:49 UTC

6 points

30 comments4 min readLW link

[Question] Has anyone actually tried to convince Terry Tao or other top mathematicians to work on alignment?

P.8 Jun 2022 22:26 UTC

52 points

49 comments4 min readLW link

Today in AI Risk History: The Terminator (1984 film) was released.

Impassionata9 Jun 2022 1:32 UTC

−3 points

6 comments1 min readLW link

There’s probably a tradeoff between AI capability and safety, and we should act like it

David Johnston9 Jun 2022 0:17 UTC

3 points

3 comments1 min readLW link

AI Could Defeat All Of Us Combined

HoldenKarnofsky9 Jun 2022 15:50 UTC

168 points

29 comments17 min readLW link

(www.cold-takes.com)

[Question] If there was a millennium equivalent prize for AI alignment, what would the problems be?

Yair Halberstadt9 Jun 2022 16:56 UTC

17 points

4 comments1 min readLW link

[Linkpost & Discussion] AI Trained on 4Chan Becomes ‘Hate Speech Machine’ [and outperforms GPT-3 on TruthfulQA Benchmark?!]

Yitz9 Jun 2022 10:59 UTC

16 points

5 comments2 min readLW link

(www.vice.com)

If no near-term alignment strategy, research should aim for the long-term

harsimony9 Jun 2022 19:10 UTC

7 points

1 comment1 min readLW link

How Do Selection Theorems Relate To Interpretability?

johnswentworth9 Jun 2022 19:39 UTC

57 points

14 comments3 min readLW link

Bureaucracy of AIs

Logan Zoellner9 Jun 2022 23:03 UTC

11 points

6 comments14 min readLW link

Tao, Kontsevich & others on HLAI in Math

interstice10 Jun 2022 2:25 UTC

41 points

5 comments2 min readLW link

(www.youtube.com)

Open Problems in AI X-Risk [PAIS #5]

Dan H and ThomasW

10 Jun 2022 2:08 UTC

50 points

3 comments36 min readLW link

[Question] why assume AGIs will optimize for fixed goals?

nostalgebraist10 Jun 2022 1:28 UTC

119 points

52 comments4 min readLW link

Progress Report 6: get the tool working

Nathan Helm-Burger10 Jun 2022 11:18 UTC

4 points

0 comments2 min readLW link

Another plausible scenario of AI risk: AI builds military infrastructure while collaborating with humans, defects later.

avturchin10 Jun 2022 17:24 UTC

10 points

2 comments1 min readLW link

[Question] Is AI Alignment Impossible?

Heighn10 Jun 2022 10:08 UTC

3 points

3 comments1 min readLW link

How dangerous is human-level AI?

Alex_Altair10 Jun 2022 17:38 UTC

21 points

4 comments8 min readLW link

[linkpost] The final AI benchmark: BIG-bench

RomanS10 Jun 2022 8:53 UTC

30 points

19 comments1 min readLW link

[Question] Could Patent-Trolling delay AI timelines?

Pablo Repetto10 Jun 2022 2:53 UTC

1 point

3 comments1 min readLW link

How fast can we perform a forward pass?

jsteinhardt10 Jun 2022 23:30 UTC

53 points

9 comments15 min readLW link

(bounded-regret.ghost.io)

Steganography and the CycleGAN—alignment failure case study

Jan Czechowski11 Jun 2022 9:41 UTC

28 points

0 comments4 min readLW link

AGI Safety Communications Initiative

ines11 Jun 2022 17:34 UTC

7 points

0 comments1 min readLW link

[Question] How much stupider than humans can AI be and still kill us all through sheer numbers and resource access?

shminux12 Jun 2022 1:01 UTC

11 points

12 comments1 min readLW link

A claim that Google’s LaMDA is sentient

Ben Livengood12 Jun 2022 4:18 UTC

31 points

134 comments1 min readLW link

Let’s not name specific AI labs in an adversarial context

acylhalide12 Jun 2022 17:38 UTC

8 points

17 comments1 min readLW link

[Question] How much does cybersecurity reduce AI risk?

Darmani12 Jun 2022 22:13 UTC

34 points

23 comments1 min readLW link

[Question] How are compute assets distributed in the world?

Chris van Merwijk12 Jun 2022 22:13 UTC

29 points

7 comments1 min readLW link

The beautiful magical enchanted golden Dall-e Mini is underrated

p.b.13 Jun 2022 7:58 UTC

14 points

0 comments1 min readLW link

Why so little AI risk on rationalist-adjacent blogs?

Grant Demaree13 Jun 2022 6:31 UTC

46 points

23 comments8 min readLW link

[Question] What’s the “This AI is of moral concern.” fire alarm?

Quintin Pope13 Jun 2022 8:05 UTC

37 points

56 comments2 min readLW link

On A List of Lethalities

Zvi13 Jun 2022 12:30 UTC

154 points

48 comments54 min readLW link

(thezvi.wordpress.com)

[Question] Can you MRI a deep learning model?

Yair Halberstadt13 Jun 2022 13:43 UTC

3 points

3 comments1 min readLW link

What are some smaller-but-concrete challenges related to AI safety that are impacting people today?

nonzerosum13 Jun 2022 17:36 UTC

3 points

2 comments1 min readLW link

Continuity Assumptions

Jan_Kulveit13 Jun 2022 21:31 UTC

26 points

13 comments4 min readLW link

Crypto-fed Computation

aaguirre13 Jun 2022 21:20 UTC

22 points

7 comments7 min readLW link

Blake Richards on Why he is Skeptical of Existential Risk from AI

Michaël Trazzi14 Jun 2022 19:09 UTC

41 points

12 comments4 min readLW link

(theinsideview.ai)

I applied for a MIRI job in 2020. Here’s what happened next.

ViktoriaMalyasova15 Jun 2022 19:37 UTC

78 points

17 comments7 min readLW link

[Question] What are all the AI Alignment and AI Safety Communication Hubs?

Gunnar_Zarncke15 Jun 2022 16:16 UTC

25 points

5 comments1 min readLW link

[Question] Has there been any work on attempting to use Pascal’s Mugging to make an AGI behave?

Chris_Leong15 Jun 2022 8:33 UTC

7 points

17 comments1 min readLW link

Will vague “AI sentience” concerns do more for AI safety than anything else we might do?

Aryeh Englander14 Jun 2022 23:53 UTC

12 points

1 comment1 min readLW link

“Brain enthusiasts” in AI Safety

Jan and Samuel Nellessen

18 Jun 2022 9:59 UTC

57 points

5 comments10 min readLW link

(universalprior.substack.com)

FYI: I’m working on a book about the threat of AGI/ASI for a general audience. I hope it will be of value to the cause and the community

Darren McKee15 Jun 2022 18:08 UTC

40 points

17 comments2 min readLW link

A central AI alignment problem: capabilities generalization, and the sharp left turn

So8res15 Jun 2022 13:10 UTC

253 points

48 comments10 min readLW link

AI Risk, as Seen on Snapchat

dkirmani16 Jun 2022 19:31 UTC

23 points

8 comments1 min readLW link

Humans are very reliable agents

alyssavance16 Jun 2022 22:02 UTC

248 points

35 comments3 min readLW link

A possible AI-inoculation due to early “robot uprising”

shminux16 Jun 2022 21:21 UTC

16 points

2 comments1 min readLW link

A transparency and interpretability tech tree

evhub16 Jun 2022 23:44 UTC

136 points

10 comments19 min readLW link

Value extrapolation vs Wireheading

Stuart_Armstrong17 Jun 2022 15:02 UTC

16 points

1 comment1 min readLW link

#SAT with Tensor Networks

Adam Jermyn17 Jun 2022 13:20 UTC

4 points

0 comments2 min readLW link

wrapper-minds are the enemy

nostalgebraist17 Jun 2022 1:58 UTC

92 points

36 comments8 min readLW link

[Question] Is there an unified way to make sense of ai failure modes?

walking_mushroom17 Jun 2022 18:00 UTC

3 points

1 comment1 min readLW link

Quantifying General Intelligence

JasonBrown17 Jun 2022 21:57 UTC

9 points

6 comments13 min readLW link

Pivotal outcomes and pivotal processes

Andrew_Critch17 Jun 2022 23:43 UTC

79 points

32 comments4 min readLW link

Scott Aaronson is joining OpenAI to work on AI safety

peterbarnett18 Jun 2022 4:06 UTC

117 points

31 comments1 min readLW link

(scottaaronson.blog)

Can DALL-E understand simple geometry?

Isaac King18 Jun 2022 4:37 UTC

25 points

2 comments1 min readLW link

Specific problems with specific animal comparisons for AI policy

trevor19 Jun 2022 1:27 UTC

3 points

1 comment2 min readLW link

Agent level parallelism

Johannes C. Mayer18 Jun 2022 20:56 UTC

6 points

5 comments1 min readLW link

[Link-post] On Deference and Yudkowsky’s AI Risk Estimates

bmg19 Jun 2022 17:25 UTC

27 points

7 comments1 min readLW link

Where I agree and disagree with Eliezer

paulfchristiano19 Jun 2022 19:15 UTC

777 points

205 comments20 min readLW link

Let’s See You Write That Corrigibility Tag

Eliezer Yudkowsky19 Jun 2022 21:11 UTC

109 points

67 comments1 min readLW link

Are we there yet?

theflowerpot20 Jun 2022 11:19 UTC

2 points

2 comments1 min readLW link

On corrigibility and its basin

Donald Hobson20 Jun 2022 16:33 UTC

16 points

3 comments2 min readLW link

Parable: The Bomb that doesn’t Explode

Lone Pine20 Jun 2022 16:41 UTC

14 points

5 comments2 min readLW link

Key Papers in Language Model Safety

aogara20 Jun 2022 15:00 UTC

37 points

1 comment22 min readLW link

Survey re AIS/LTism office in NYC

RyanCarey20 Jun 2022 19:21 UTC

7 points

0 comments1 min readLW link

An AI defense-offense symmetry thesis

Chris van Merwijk20 Jun 2022 10:01 UTC

10 points

9 comments3 min readLW link

[Question] How easy/fast is it for a AGI to hack computers/a human brain?

Noosphere8921 Jun 2022 0:34 UTC

0 points

1 comment1 min readLW link

A Toy Model of Gradient Hacking

Oam Patel20 Jun 2022 22:01 UTC

25 points

7 comments4 min readLW link

Debating Whether AI is Conscious Is A Distraction from Real Problems

sidhe_they21 Jun 2022 16:56 UTC

4 points

10 comments1 min readLW link

(techpolicy.press)

The inordinately slow spread of good AGI conversations in ML

Rob Bensinger21 Jun 2022 16:09 UTC

160 points

66 comments8 min readLW link

[Question] What is the difference between AI misalignment and bad programming?

puzzleGuzzle21 Jun 2022 21:52 UTC

6 points

2 comments1 min readLW link

Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

elspood21 Jun 2022 23:55 UTC

331 points

40 comments7 min readLW link

A Quick List of Some Problems in AI Alignment As A Field

Nicholas / Heather Kross21 Jun 2022 23:23 UTC

74 points

12 comments6 min readLW link

(www.thinkingmuchbetter.com)

Confusion about neuroscience/cognitive science as a danger for AI Alignment

Samuel Nellessen22 Jun 2022 17:59 UTC

2 points

1 comment3 min readLW link

(snellessen.com)

Air Conditioner Test Results & Discussion

johnswentworth22 Jun 2022 22:26 UTC

80 points

38 comments6 min readLW link

Loose thoughts on AGI risk

Yitz23 Jun 2022 1:02 UTC

7 points

3 comments1 min readLW link

[Question] What’s the contingency plan if we get AGI tomorrow?

Yitz23 Jun 2022 3:10 UTC

61 points

24 comments1 min readLW link

[Question] What are the best “policy” approaches in worlds where alignment is difficult?

LHA23 Jun 2022 1:53 UTC

1 point

0 comments1 min readLW link

[Question] Is CIRL a promising agenda?

Chris_Leong23 Jun 2022 17:12 UTC

25 points

12 comments1 min readLW link

Half-baked AI Safety ideas thread

Aryeh Englander23 Jun 2022 16:11 UTC

58 points

60 comments1 min readLW link

20 Critiques of AI Safety That I Found on Twitter

dkirmani23 Jun 2022 19:23 UTC

21 points

16 comments1 min readLW link

Linkpost: Robin Hanson—Why Not Wait On AI Risk?

Yair Halberstadt24 Jun 2022 14:23 UTC

41 points

14 comments1 min readLW link

(www.overcomingbias.com)

Raphaël Millière on Generalization and Scaling Maximalism

Michaël Trazzi24 Jun 2022 18:18 UTC

21 points

2 comments4 min readLW link

(theinsideview.ai)

[Question] Do alignment concerns extend to powerful non-AI agents?

Ozyrus24 Jun 2022 18:26 UTC

21 points

13 comments1 min readLW link

Dependencies for AGI pessimism

Yitz24 Jun 2022 22:25 UTC

6 points

4 comments1 min readLW link

What if the best path for a person who wants to work on AGI alignment is to join Facebook or Google?

dbasch24 Jun 2022 21:23 UTC

2 points

3 comments1 min readLW link

[Link] Adversarially trained neural representations may already be as robust as corresponding biological neural representations

Gunnar_Zarncke24 Jun 2022 20:51 UTC

35 points

9 comments1 min readLW link

AI-Written Critiques Help Humans Notice Flaws

paulfchristiano25 Jun 2022 17:22 UTC

133 points

5 comments3 min readLW link

(openai.com)

[LQ] Some Thoughts on Messaging Around AI Risk

DragonGod25 Jun 2022 13:53 UTC

5 points

3 comments6 min readLW link

[Question] Should any human enslave an AGI system?

AlignmentMirror25 Jun 2022 19:35 UTC

−13 points

44 comments1 min readLW link

The Basics of AGI Policy (Flowchart)

trevor26 Jun 2022 2:01 UTC

18 points

8 comments2 min readLW link

Slow motion videos as AI risk intuition pumps

Andrew_Critch14 Jun 2022 19:31 UTC

209 points

36 comments2 min readLW link

Robin Hanson asks “Why Not Wait On AI Risk?”

Gunnar_Zarncke26 Jun 2022 23:32 UTC

22 points

4 comments1 min readLW link

(www.overcomingbias.com)

Epistemic modesty and how I think about AI risk

Aryeh Englander27 Jun 2022 18:47 UTC

22 points

4 comments4 min readLW link

Announcing the Inverse Scaling Prize ($250k Prize Pool)

Ethan Perez, Ian McKenzie and Sam Bowman

27 Jun 2022 15:58 UTC

166 points

14 comments7 min readLW link

Scott Aaronson and Steven Pinker Debate AI Scaling

Liron28 Jun 2022 16:04 UTC

37 points

10 comments1 min readLW link

(scottaaronson.blog)

Four reasons I find AI safety emotionally compelling

KatWoods and AmberDawn

28 Jun 2022 14:10 UTC

38 points

3 comments4 min readLW link

Some alternative AI safety research projects

Michele Campolo28 Jun 2022 14:09 UTC

9 points

0 comments3 min readLW link

Assessing AlephAlphas Multimodal Model

p.b.28 Jun 2022 9:28 UTC

30 points

5 comments3 min readLW link

Kurzgesagt – The Last Human (Youtube)

habryka29 Jun 2022 3:28 UTC

54 points

7 comments1 min readLW link

(www.youtube.com)

Can We Align AI by Having It Learn Human Preferences? I’m Scared (summary of last third of Human Compatible)

apollonianblues29 Jun 2022 4:09 UTC

19 points

3 comments6 min readLW link

Looking back on my alignment PhD

TurnTrout1 Jul 2022 3:19 UTC

287 points

60 comments11 min readLW link

Will Capabilities Generalise More?

Ramana Kumar29 Jun 2022 17:12 UTC

109 points

38 comments4 min readLW link

Gradient hacking: definitions and examples

Richard_Ngo29 Jun 2022 21:35 UTC

24 points

1 comment5 min readLW link

[Question] Correcting human error vs doing exactly what you’re told—is there literature on this in context of general system design?

Jan Czechowski29 Jun 2022 21:30 UTC

6 points

0 comments1 min readLW link

Most Functions Have Undesirable Global Extrema

En Kepeig30 Jun 2022 17:10 UTC

8 points

5 comments3 min readLW link

$500 bounty for alignment contest ideas

Akash30 Jun 2022 1:56 UTC

29 points

5 comments2 min readLW link

Quick survey on AI alignment resources

frances_lorenz30 Jun 2022 19:09 UTC

14 points

0 comments1 min readLW link

[Linkpost] Solving Quantitative Reasoning Problems with Language Models

Yitz30 Jun 2022 18:58 UTC

76 points

15 comments2 min readLW link

(storage.googleapis.com)

GPT-3 Catching Fish in Morse Code

Megan Kinniment30 Jun 2022 21:22 UTC

110 points

27 comments8 min readLW link

Selection processes for subagents

Ryan Kidd30 Jun 2022 23:57 UTC

33 points

2 comments9 min readLW link

AI safety university groups: a promising opportunity to reduce existential risk

mic1 Jul 2022 3:59 UTC

13 points

0 comments11 min readLW link

Safetywashing

Adam Scholl1 Jul 2022 11:56 UTC

212 points

17 comments1 min readLW link

[Question] AGI alignment with what?

AlignmentMirror1 Jul 2022 10:22 UTC

6 points

10 comments1 min readLW link

What Is The True Name of Modularity?

CallumMcDougall, Lucius Bushnaq and Avery

1 Jul 2022 14:55 UTC

21 points

10 comments12 min readLW link

AXRP Episode 16 - Preparing for Debate AI with Geoffrey Irving

DanielFilan1 Jul 2022 22:20 UTC

14 points

0 comments37 min readLW link

Agenty AGI – How Tempting?

PeterMcCluskey1 Jul 2022 23:40 UTC

21 points

3 comments5 min readLW link

(www.bayesianinvestor.com)

[Linkpost] Existential Risk Analysis in Empirical Research Papers

Dan H2 Jul 2022 0:09 UTC

40 points

0 comments1 min readLW link

(arxiv.org)

Minerva

Algon1 Jul 2022 20:06 UTC

35 points

6 comments2 min readLW link

(ai.googleblog.com)

Could an AI Alignment Sandbox be useful?

Michael Soareverix2 Jul 2022 5:06 UTC

2 points

1 comment1 min readLW link

Goal-directedness: tackling complexity

Morgan_Rogers2 Jul 2022 13:51 UTC

8 points

0 comments38 min readLW link

[Question] Which one of these two academic routes should I take to end up in AI Safety?

Martín Soto3 Jul 2022 1:05 UTC

5 points

2 comments1 min readLW link

Wonder and The Golden AI Rule

JeffreyK3 Jul 2022 18:21 UTC

0 points

4 comments6 min readLW link

Decision theory and dynamic inconsistency

paulfchristiano3 Jul 2022 22:20 UTC

66 points

33 comments10 min readLW link

(sideways-view.com)

AI Forecasting: One Year In

jsteinhardt4 Jul 2022 5:10 UTC

131 points

12 comments6 min readLW link

(bounded-regret.ghost.io)

Remaking EfficientZero (as best I can)

Hoagy4 Jul 2022 11:03 UTC

34 points

9 comments22 min readLW link

Please help us communicate AI xrisk. It could save the world.

otto.barten4 Jul 2022 21:47 UTC

4 points

7 comments2 min readLW link

Benchmark for successful concept extrapolation/avoiding goal misgeneralization

Stuart_Armstrong4 Jul 2022 20:48 UTC

80 points

12 comments4 min readLW link

Anthropic’s SoLU (Softmax Linear Unit)

Joel Burget4 Jul 2022 18:38 UTC

15 points

1 comment4 min readLW link

(transformer-circuits.pub)

[AN #172] Sorry for the long hiatus!

Rohin Shah5 Jul 2022 6:20 UTC

54 points

0 comments3 min readLW link

(mailchi.mp)

Principles for Alignment/Agency Projects

johnswentworth7 Jul 2022 2:07 UTC

115 points

20 comments4 min readLW link

Race Along Rashomon Ridge

Stephen Fowler, Peter S. Park and MichaelEinhorn

7 Jul 2022 3:20 UTC

49 points

15 comments8 min readLW link

Confusions in My Model of AI Risk

peterbarnett7 Jul 2022 1:05 UTC

21 points

9 comments5 min readLW link

Safety considerations for online generative modeling

Sam Marks7 Jul 2022 18:31 UTC

41 points

9 comments14 min readLW link

Reinforcement Learner Wireheading

Nate Showell8 Jul 2022 5:32 UTC

8 points

2 comments4 min readLW link

MATS Models

johnswentworth9 Jul 2022 0:14 UTC

84 points

5 comments16 min readLW link

Train first VS prune first in neural networks.

Donald Hobson9 Jul 2022 15:53 UTC

20 points

5 comments2 min readLW link

Research Notes: What are we aligning for?

Shoshannah Tekofsky8 Jul 2022 22:13 UTC

19 points

8 comments2 min readLW link

Report from a civilizational observer on Earth

owencb9 Jul 2022 17:26 UTC

49 points

12 comments6 min readLW link

Visualizing Neural networks, how to blame the bias

Donald Hobson9 Jul 2022 15:52 UTC

7 points

1 comment6 min readLW link

Comment on “Propositions Concerning Digital Minds and Society”

Zack_M_Davis10 Jul 2022 5:48 UTC

95 points

12 comments8 min readLW link

Hessian and Basin volume

Vivek Hebbar10 Jul 2022 6:59 UTC

33 points

9 comments4 min readLW link

Checksum Sensor Alignment

lsusr11 Jul 2022 3:31 UTC

12 points

2 comments1 min readLW link

The Alignment Problem

lsusr11 Jul 2022 3:03 UTC

45 points

20 comments3 min readLW link

[Question] How do AI timelines affect how you live your life?

Quadratic Reciprocity11 Jul 2022 13:54 UTC

77 points

47 comments1 min readLW link

Three Minimum Pivotal Acts Possible by Narrow AI

Michael Soareverix12 Jul 2022 9:51 UTC

0 points

4 comments2 min readLW link

On how various plans miss the hard bits of the alignment challenge

So8res12 Jul 2022 2:49 UTC

258 points

81 comments29 min readLW link

[Question] What is wrong with this approach to corrigibility?

Rafael Cosman12 Jul 2022 22:55 UTC

7 points

8 comments1 min readLW link

MIRI Conversations: Technology Forecasting & Gradualism (Distillation)

CallumMcDougall13 Jul 2022 15:55 UTC

31 points

1 comment20 min readLW link

[Question] Which AI Safety research agendas are the most promising?

Chris_Leong13 Jul 2022 7:54 UTC

27 points

6 comments1 min readLW link

Deep learning curriculum for large language model alignment

Jacob_Hilton13 Jul 2022 21:58 UTC

53 points

3 comments1 min readLW link

(github.com)

Artificial Sandwiching: When can we test scalable alignment protocols without humans?

Sam Bowman13 Jul 2022 21:14 UTC

40 points

6 comments5 min readLW link

[Question] How to impress students with recent advances in ML?

Charbel-Raphaël14 Jul 2022 0:03 UTC

12 points

2 comments1 min readLW link

Circumventing interpretability: How to defeat mind-readers

Lee Sharkey14 Jul 2022 16:59 UTC

94 points

8 comments36 min readLW link

Musings on the Human Objective Function

Michael Soareverix15 Jul 2022 7:13 UTC

3 points

0 comments3 min readLW link

Peter Singer’s first published piece on AI

Fai15 Jul 2022 6:18 UTC

20 points

5 comments1 min readLW link

(link.springer.com)

Notes on Learning the Prior

Spencer Becker-Kahn15 Jul 2022 17:28 UTC

21 points

2 comments25 min readLW link

Proposed Orthogonality Theses #2-5

rjbg14 Jul 2022 22:59 UTC

6 points

0 comments2 min readLW link

A story about a duplicitous API

LiLiLi15 Jul 2022 18:26 UTC

2 points

0 comments1 min readLW link

Safety Implications of LeCun’s path to machine intelligence

Ivan Vendrov15 Jul 2022 21:47 UTC

89 points

16 comments6 min readLW link

QNR Prospects

PeterMcCluskey16 Jul 2022 2:03 UTC

38 points

3 comments8 min readLW link

(www.bayesianinvestor.com)

All AGI safety questions welcome (especially basic ones) [July 2022]

plex and Robert Miles

16 Jul 2022 12:57 UTC

84 points

130 comments3 min readLW link

Alignment as Game Design

Shoshannah Tekofsky16 Jul 2022 22:36 UTC

11 points

7 comments2 min readLW link

Why I Think Abrupt AI Takeoff

lincolnquirk17 Jul 2022 17:04 UTC

14 points

6 comments1 min readLW link

Why you might expect homogeneous take-off: evidence from ML research

Andrei Alexandru17 Jul 2022 20:31 UTC

24 points

0 comments10 min readLW link

What should you change in response to an “emergency”? And AI risk

AnnaSalamon18 Jul 2022 1:11 UTC

303 points

60 comments6 min readLW link

Quantilizers and Generative Models

Adam Jermyn18 Jul 2022 16:32 UTC

24 points

5 comments4 min readLW link

Training goals for large language models

Johannes Treutlein18 Jul 2022 7:09 UTC

26 points

5 comments19 min readLW link

Machine Learning Model Sizes and the Parameter Gap [abridged]

Pablo Villalobos18 Jul 2022 16:51 UTC

20 points

0 comments1 min readLW link

(epochai.org)

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya Cotra18 Jul 2022 19:06 UTC

310 points

89 comments84 min readLW link

At what point will we know if Eliezer’s predictions are right or wrong?

anonymous12345618 Jul 2022 22:06 UTC

5 points

6 comments1 min readLW link

A daily routine I do for my AI safety research work

scasper19 Jul 2022 21:58 UTC

15 points

7 comments1 min readLW link

Pitfalls with Proofs

scasper19 Jul 2022 22:21 UTC

19 points

21 comments8 min readLW link

Which singularity schools plus the no singularity school was right?

Noosphere8923 Jul 2022 15:16 UTC

9 points

27 comments9 min readLW link

Defining Optimization in a Deeper Way Part 3

J Bostock20 Jul 2022 22:06 UTC

8 points

0 comments2 min readLW link

[AN #173] Recent language model results from DeepMind

Rohin Shah21 Jul 2022 2:30 UTC

37 points

9 comments8 min readLW link

(mailchi.mp)

[Question] How much to optimize for the short-timelines scenario?

SoerenMind21 Jul 2022 10:47 UTC

19 points

3 comments1 min readLW link

Making DALL-E Count

DirectedEvolution22 Jul 2022 9:11 UTC

23 points

12 comments4 min readLW link

Conditioning Generative Models with Restrictions

Adam Jermyn21 Jul 2022 20:33 UTC

16 points

4 comments8 min readLW link

General alignment properties

TurnTrout8 Aug 2022 23:40 UTC

46 points

2 comments1 min readLW link

Which values are stable under ontology shifts?

Richard_Ngo23 Jul 2022 2:40 UTC

68 points

47 comments3 min readLW link

(thinkingcomplete.blogspot.com)

Trying out Prompt Engineering on TruthfulQA

Megan Kinniment23 Jul 2022 2:04 UTC

10 points

0 comments8 min readLW link

Symbolic distillation, Diffusion, Entropy, Replicators, Agents, oh my (a mid-low quality thinking out loud post)

the gears to ascension23 Jul 2022 21:13 UTC

2 points

2 comments6 min readLW link

Eavesdropping on Aliens: A Data Decoding Challenge

anonymousaisafety24 Jul 2022 4:35 UTC

44 points

9 comments4 min readLW link

How much should we worry about mesa-optimization challenges?

sudo25 Jul 2022 3:56 UTC

4 points

13 comments2 min readLW link

[Question] Does agent foundations cover all future ML systems?

Jonas Hallgren25 Jul 2022 1:17 UTC

2 points

0 comments1 min readLW link

[Question] How optimistic should we be about AI figuring out how to interpret itself?

oh5432125 Jul 2022 22:09 UTC

3 points

1 comment1 min readLW link

Active Inference as a formalisation of instrumental convergence

Roman Leventov26 Jul 2022 17:55 UTC

6 points

2 comments3 min readLW link

(direct.mit.edu)

«Boundaries» Sequence (Index Post)

Andrew_Critch26 Jul 2022 19:12 UTC

23 points

1 comment1 min readLW link

Moral strategies at different capability levels

Richard_Ngo27 Jul 2022 18:50 UTC

95 points

14 comments5 min readLW link

(thinkingcomplete.blogspot.com)

Principles of Privacy for Alignment Research

johnswentworth27 Jul 2022 19:53 UTC

68 points

30 comments7 min readLW link

Seeking beta readers who are ignorant of biology but knowledgeable about AI safety

Holly_Elmore27 Jul 2022 23:02 UTC

10 points

6 comments1 min readLW link

Defining Optimization in a Deeper Way Part 4

J Bostock28 Jul 2022 17:02 UTC

7 points

0 comments5 min readLW link

Announcing the AI Safety Field Building Hub, a new effort to provide AISFB projects, mentorship, and funding

Vael Gates28 Jul 2022 21:29 UTC

49 points

3 comments6 min readLW link

Distillation Contest—Results and Recap

Aris29 Jul 2022 17:40 UTC

33 points

0 comments7 min readLW link

Abstracting The Hardness of Alignment: Unbounded Atomic Optimization

adamShimi29 Jul 2022 18:59 UTC

62 points

3 comments16 min readLW link

How transparency changed over time

ViktoriaMalyasova30 Jul 2022 4:36 UTC

21 points

0 comments6 min readLW link

Translating between Latent Spaces

JamesH, Jeremy Gillen and NickyP

30 Jul 2022 3:25 UTC

20 points

1 comment8 min readLW link

AGI-level reasoner will appear sooner than an agent; what the humanity will do with this reasoner is critical

Roman Leventov30 Jul 2022 20:56 UTC

24 points

10 comments1 min readLW link

chinchilla’s wild implications

nostalgebraist31 Jul 2022 1:18 UTC

366 points

114 comments11 min readLW link

Technical AI Alignment Study Group

Eric K1 Aug 2022 18:33 UTC

5 points

0 comments1 min readLW link

[Question] Which intro-to-AI-risk text would you recommend to...

Sherrinford1 Aug 2022 9:36 UTC

12 points

1 comment1 min readLW link

Two-year update on my personal AI timelines

Ajeya Cotra2 Aug 2022 23:07 UTC

287 points

60 comments16 min readLW link

What are the Red Flags for Neural Network Suffering? - Seeds of Science call for reviewers

rogersbacon2 Aug 2022 22:37 UTC

24 points

5 comments1 min readLW link

Precursor checking for deceptive alignment

evhub3 Aug 2022 22:56 UTC

18 points

0 comments14 min readLW link

Survey: What (de)motivates you about AI risk?

Daniel_Friedrich3 Aug 2022 19:17 UTC

1 point

0 comments1 min readLW link

(forms.gle)

High Reliability Orgs, and AI Companies

Raemon4 Aug 2022 5:45 UTC

73 points

6 comments12 min readLW link

Interpretability isn’t Free

Joel Burget4 Aug 2022 15:02 UTC

10 points

1 comment2 min readLW link

[Question] AI alignment: Would a lazy self-preservation instinct be sufficient?

BrainFrog4 Aug 2022 17:53 UTC

−1 points

4 comments1 min readLW link

[Question] What drives progress, theory or application?

lberglund5 Aug 2022 1:14 UTC

5 points

1 comment1 min readLW link

The Pragmascope Idea

johnswentworth4 Aug 2022 21:52 UTC

55 points

19 comments3 min readLW link

$20K In Bounties for AI Safety Public Materials

Dan H, ThomasW and ozhang

5 Aug 2022 2:52 UTC

68 points

7 comments6 min readLW link

Rant on Problem Factorization for Alignment

johnswentworth5 Aug 2022 19:23 UTC

73 points

48 comments6 min readLW link

Rant on Problem Factorization for Alignment

johnswentworth5 Aug 2022 19:23 UTC

73 points

48 comments6 min readLW link

Announcing the Introduction to ML Safety course

Dan H, ThomasW and ozhang

6 Aug 2022 2:46 UTC

69 points

6 comments7 min readLW link

Why I Am Skeptical of AI Regulation as an X-Risk Mitigation Strategy

A Ray6 Aug 2022 5:46 UTC

31 points

14 comments2 min readLW link

My advice on finding your own path

A Ray6 Aug 2022 4:57 UTC

34 points

3 comments3 min readLW link

A Deceptively Simple Argument in favor of Problem Factorization

Logan Zoellner6 Aug 2022 17:32 UTC

3 points

4 comments1 min readLW link

[Question] Can we get full audio for Eliezer’s conversation with Sam Harris?

JakubK7 Aug 2022 20:35 UTC

30 points

8 comments1 min readLW link

How Deadly Will Roughly-Human-Level AGI Be?

David Udell8 Aug 2022 1:59 UTC

12 points

6 comments1 min readLW link

Broad Basins and Data Compression

Jeremy Gillen, Stephen Fowler and Thomas Larsen

8 Aug 2022 20:33 UTC

29 points

6 comments7 min readLW link

Encultured AI Pre-planning, Part 1: Enabling New Benchmarks

Andrew_Critch and Nick Hay

8 Aug 2022 22:44 UTC

62 points

2 comments6 min readLW link

Encultured AI, Part 1 Appendix: Relevant Research Examples

Andrew_Critch and Nick Hay

8 Aug 2022 22:44 UTC

11 points

1 comment7 min readLW link

Disagreements about Alignment: Why, and how, we should try to solve them

ojorgensen9 Aug 2022 18:49 UTC

8 points

1 comment16 min readLW link

[Question] Many Gods refutation and Instrumental Goals. (Proper one)

aditya malik9 Aug 2022 11:59 UTC

0 points

15 comments1 min readLW link

[Question] Is it possible to find venture capital for AI research org with strong safety focus?

AnonResearch9 Aug 2022 16:12 UTC

6 points

1 comment1 min readLW link

Using GPT-3 to augment human intelligence

Henrik Karlsson10 Aug 2022 15:54 UTC

48 points

7 comments18 min readLW link

(escapingflatland.substack.com)

Emergent Abilities of Large Language Models [Linkpost]

aogara10 Aug 2022 18:02 UTC

25 points

2 comments1 min readLW link

(arxiv.org)

How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It)

Peter S. Park, NickyP and Stephen Fowler

10 Aug 2022 18:14 UTC

26 points

30 comments11 min readLW link

The alignment problem from a deep learning perspective

Richard_Ngo10 Aug 2022 22:46 UTC

93 points

13 comments27 min readLW link

How much alignment data will we need in the long run?

Jacob_Hilton10 Aug 2022 21:39 UTC

34 points

15 comments4 min readLW link

Thoughts on the good regulator theorem

JonasMoss11 Aug 2022 12:08 UTC

8 points

0 comments4 min readLW link

Language models seem to be much better than humans at next-token prediction

Buck, Fabien Roger and LawrenceC

11 Aug 2022 17:45 UTC

164 points

56 comments13 min readLW link

[Question] Seriously, what goes wrong with “reward the agent when it makes you smile”?

TurnTrout11 Aug 2022 22:22 UTC

76 points

41 comments2 min readLW link

Dissected boxed AI

Nathan112312 Aug 2022 2:37 UTC

−8 points

2 comments1 min readLW link

Steelmining via Analogy

Paul Bricman13 Aug 2022 9:59 UTC

24 points

0 comments2 min readLW link

(paulbricman.com)

Refining the Sharp Left Turn threat model, part 1: claims and mechanisms

Vika, Vikrant Varma, Ramana Kumar and Mary Phuong

12 Aug 2022 15:17 UTC

71 points

3 comments3 min readLW link

(vkrakovna.wordpress.com)

Oversight Misses 100% of Thoughts The AI Does Not Think

johnswentworth12 Aug 2022 16:30 UTC

85 points

49 comments1 min readLW link

Timelines explanation post part 1 of ?

Nathan Helm-Burger12 Aug 2022 16:13 UTC

10 points

1 comment2 min readLW link

A little playing around with Blenderbot3

Nathan Helm-Burger12 Aug 2022 16:06 UTC

9 points

0 comments1 min readLW link

DeepMind alignment team opinions on AGI ruin arguments

Vika12 Aug 2022 21:06 UTC

364 points

34 comments14 min readLW link

the Insulated Goal-Program idea

Tamsin Leake13 Aug 2022 9:57 UTC

39 points

3 comments2 min readLW link

(carado.moe)

goal-program bricks

Tamsin Leake13 Aug 2022 10:08 UTC

27 points

2 comments2 min readLW link

(carado.moe)

How I think about alignment

Linda Linsefors13 Aug 2022 10:01 UTC

30 points

11 comments5 min readLW link

Refine’s First Blog Post Day

adamShimi13 Aug 2022 10:23 UTC

55 points

3 comments1 min readLW link

Shapes of Mind and Pluralism in Alignment

adamShimi13 Aug 2022 10:01 UTC

30 points

1 comment2 min readLW link

An extended rocket alignment analogy

remember13 Aug 2022 18:22 UTC

25 points

3 comments4 min readLW link

Cultivating Valiance

Shoshannah Tekofsky13 Aug 2022 18:47 UTC

35 points

4 comments4 min readLW link

Evolution is a bad analogy for AGI: inner alignment

Quintin Pope13 Aug 2022 22:15 UTC

52 points

6 comments8 min readLW link

A brief note on Simplicity Bias

Spencer Becker-Kahn14 Aug 2022 2:05 UTC

16 points

0 comments4 min readLW link

Seeking Interns/RAs for Mechanistic Interpretability Projects

Neel Nanda15 Aug 2022 7:11 UTC

61 points

0 comments2 min readLW link

Extreme Security

lc15 Aug 2022 12:11 UTC

39 points

4 comments5 min readLW link

On Preference Manipulation in Reward Learning Processes

Felix Hofstätter15 Aug 2022 19:32 UTC

8 points

0 comments4 min readLW link

Limits of Asking ELK if Models are Deceptive

Oam Patel15 Aug 2022 20:44 UTC

6 points

2 comments4 min readLW link

What Makes an Idea Understandable? On Architecturally and Culturally Natural Ideas.

NickyP, Peter S. Park and Stephen Fowler

16 Aug 2022 2:09 UTC

17 points

2 comments16 min readLW link

Deception as the optimal: mesa-optimizers and inner alignment

Eleni Angelou16 Aug 2022 4:49 UTC

10 points

0 comments5 min readLW link

Understanding differences between humans and intelligence-in-general to build safe AGI

Florian_Dietz16 Aug 2022 8:27 UTC

7 points

8 comments1 min readLW link

Autonomy as taking responsibility for reference maintenance

Ramana Kumar17 Aug 2022 12:50 UTC

52 points

3 comments5 min readLW link

Thoughts on ‘List of Lethalities’

Alex Lawsen 17 Aug 2022 18:33 UTC

25 points

0 comments10 min readLW link

Human Mimicry Mainly Works When We’re Already Close

johnswentworth17 Aug 2022 18:41 UTC

68 points

16 comments5 min readLW link

The Core of the Alignment Problem is...

Thomas Larsen, Jeremy Gillen and JamesH

17 Aug 2022 20:07 UTC

58 points

10 comments9 min readLW link

Concrete Advice for Forming Inside Views on AI Safety

Neel Nanda17 Aug 2022 22:02 UTC

18 points

6 comments10 min readLW link

Announcing Encultured AI: Building a Video Game

Andrew_Critch and Nick Hay

18 Aug 2022 2:16 UTC

103 points

26 comments4 min readLW link

Announcing the Distillation for Alignment Practicum (DAP)

Jonas Hallgren and CallumMcDougall

18 Aug 2022 19:50 UTC

21 points

3 comments3 min readLW link

Alignment’s phlogiston

Eleni Angelou18 Aug 2022 22:27 UTC

10 points

2 comments2 min readLW link

[Question] Are language models close to the superhuman level in philosophy?

Roman Leventov19 Aug 2022 4:43 UTC

5 points

2 comments2 min readLW link

How to do theoretical research, a personal perspective

Mark Xu19 Aug 2022 19:41 UTC

84 points

4 comments15 min readLW link

Refine’s Second Blog Post Day

adamShimi20 Aug 2022 13:01 UTC

19 points

0 comments1 min readLW link

No One-Size-Fit-All Epistemic Strategy

adamShimi20 Aug 2022 12:56 UTC

23 points

1 comment2 min readLW link

Reducing Goodhart: Announcement, Executive Summary

Charlie Steiner20 Aug 2022 9:49 UTC

14 points

0 comments1 min readLW link

Pivotal acts using an unaligned AGI?

Simon Fischer21 Aug 2022 17:13 UTC

26 points

3 comments8 min readLW link

Beyond Hyperanthropomorphism

PointlessOne21 Aug 2022 17:55 UTC

3 points

17 comments1 min readLW link

(studio.ribbonfarm.com)

AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler

DanielFilan21 Aug 2022 23:50 UTC

16 points

0 comments34 min readLW link

[Question] What if we solve AI Safety but no one cares

14285722 Aug 2022 5:38 UTC

18 points

5 comments1 min readLW link

Finding Goals in the World Model

Jeremy Gillen, JamesH and Thomas Larsen

22 Aug 2022 18:06 UTC

55 points

8 comments13 min readLW link

[Question] AI Box Experiment: Are people still interested?

Double31 Aug 2022 3:04 UTC

31 points

13 comments1 min readLW link

Stable Diffusion has been released

P.22 Aug 2022 19:42 UTC

15 points

7 comments1 min readLW link

(stability.ai)

Discussion on utilizing AI for alignment

elifland23 Aug 2022 2:36 UTC

16 points

3 comments1 min readLW link

(www.foxy-scout.com)

It Looks Like You’re Trying To Take Over The Narrative

George3d624 Aug 2022 13:36 UTC

2 points

20 comments9 min readLW link

(www.epistem.ink)

Thoughts about OOD alignment

Catnee24 Aug 2022 15:31 UTC

11 points

10 comments2 min readLW link

Vingean Agency

abramdemski24 Aug 2022 20:08 UTC

57 points

13 comments3 min readLW link

Interspecies diplomacy as a potentially productive lens on AGI alignment

Shariq Hashme24 Aug 2022 17:59 UTC

5 points

1 comment2 min readLW link

OpenAI’s Alignment Plans

dkirmani24 Aug 2022 19:39 UTC

60 points

17 comments5 min readLW link

(openai.com)

What Makes A Good Measurement Device?

johnswentworth24 Aug 2022 22:45 UTC

35 points

7 comments2 min readLW link

Evaluating OpenAI’s alignment plans using training stories

ojorgensen25 Aug 2022 16:12 UTC

3 points

0 comments5 min readLW link

A Test for Language Model Consciousness

Ethan Perez25 Aug 2022 19:41 UTC

18 points

14 comments10 min readLW link

Seeking Student Submissions: Edit Your Source Code Contest

Aris26 Aug 2022 2:08 UTC

28 points

5 comments2 min readLW link

Basin broadness depends on the size and number of orthogonal features

CallumMcDougall, Avery and Lucius Bushnaq

27 Aug 2022 17:29 UTC

34 points

21 comments6 min readLW link

Sufficiently many Godzillas as an alignment strategy

14285728 Aug 2022 0:08 UTC

8 points

3 comments1 min readLW link

Artificial Moral Advisors: A New Perspective from Moral Psychology

David Gross28 Aug 2022 16:37 UTC

25 points

1 comment1 min readLW link

(dl.acm.org)

First thing AI will do when it takes over is get fission going

visiax28 Aug 2022 5:56 UTC

−2 points

0 comments1 min readLW link

Robert Long On Why Artificial Sentience Might Matter

Michaël Trazzi28 Aug 2022 17:30 UTC

26 points

5 comments5 min readLW link

(theinsideview.ai)

How Do AI Timelines Affect Existential Risk?

Stephen McAleese29 Aug 2022 16:57 UTC

7 points

9 comments23 min readLW link

[Question] What is the best critique of AI existential risk arguments?

joshc30 Aug 2022 2:18 UTC

5 points

10 comments1 min readLW link

Can We Align a Self-Improving AGI?

Peter S. Park30 Aug 2022 0:14 UTC

8 points

5 comments11 min readLW link

LessWrong’s prediction on apocalypse due to AGI (Aug 2022)

LetUsTalk29 Aug 2022 18:46 UTC

7 points

13 comments1 min readLW link

[Question] How can I reconcile the two most likely requirements for humanities near-term survival.

Erlja Jkdf.29 Aug 2022 18:46 UTC

1 point

6 comments1 min readLW link

How likely is deceptive alignment?

evhub30 Aug 2022 19:34 UTC

72 points

21 comments60 min readLW link

Inner Alignment via Superpowers

JamesH, Thomas Larsen and Jeremy Gillen

30 Aug 2022 20:01 UTC

37 points

13 comments4 min readLW link

Three scenarios of pseudo-alignment

Eleni Angelou3 Sep 2022 12:47 UTC

9 points

0 comments3 min readLW link

New 80,000 Hours problem profile on existential risks from AI

Benjamin Hilton31 Aug 2022 17:36 UTC

28 points

7 comments7 min readLW link

(80000hours.org)

Survey of NLP Researchers: NLP is contributing to AGI progress; major catastrophe plausible

Sam Bowman31 Aug 2022 1:39 UTC

89 points

6 comments2 min readLW link

Infra-Exercises, Part 1

Diffractor, Jack Parker and Connall Garrod

1 Sep 2022 5:06 UTC

49 points

9 comments1 min readLW link

Alignment is hard. Communicating that, might be harder

Eleni Angelou1 Sep 2022 16:57 UTC

7 points

8 comments3 min readLW link

A Survey of Foundational Methods in Inverse Reinforcement Learning

adamk1 Sep 2022 18:21 UTC

16 points

0 comments12 min readLW link

AI Safety and Neighboring Communities: A Quick-Start Guide, as of Summer 2022

Sam Bowman1 Sep 2022 19:15 UTC

74 points

2 comments7 min readLW link

A Richly Interactive AGI Alignment Chart

lisperati2 Sep 2022 0:44 UTC

14 points

6 comments1 min readLW link

Replacement for PONR concept

Daniel Kokotajlo2 Sep 2022 0:09 UTC

44 points

6 comments2 min readLW link

AI coordination needs clear wins

evhub1 Sep 2022 23:41 UTC

134 points

15 comments2 min readLW link

Simulators

janus2 Sep 2022 12:45 UTC

472 points

103 comments44 min readLW link

(generative.ink)

Laziness in AI

Richard Henage2 Sep 2022 17:04 UTC

11 points

5 comments1 min readLW link

Agency engineering: is AI-alignment “to human intent” enough?

catubc2 Sep 2022 18:14 UTC

9 points

10 comments6 min readLW link

Sticky goals: a concrete experiment for understanding deceptive alignment

evhub2 Sep 2022 21:57 UTC

35 points

13 comments3 min readLW link

[Question] Request for Alignment Research Project Recommendations

Rauno Arike3 Sep 2022 15:29 UTC

10 points

2 comments1 min readLW link

[Question] Request for Alignment Research Project Recommendations

Rauno Arike3 Sep 2022 15:29 UTC

10 points

2 comments1 min readLW link

Bugs or Features?

qbolec3 Sep 2022 7:04 UTC

69 points

9 comments2 min readLW link

Private alignment research sharing and coordination

porby4 Sep 2022 0:01 UTC

54 points

10 comments5 min readLW link

AXRP Episode 18 - Concept Extrapolation with Stuart Armstrong

DanielFilan3 Sep 2022 23:12 UTC

10 points

1 comment39 min readLW link

[Question] Help me find a good Hackathon subject

Charbel-Raphaël4 Sep 2022 8:40 UTC

6 points

18 comments1 min readLW link

How To Know What the AI Knows—An ELK Distillation

Fabien Roger4 Sep 2022 0:46 UTC

5 points

0 comments5 min readLW link

AI Governance Needs Technical Work

Mau5 Sep 2022 22:28 UTC

39 points

1 comment9 min readLW link

Community Building for Graduate Students: A Targeted Approach

Neil Crawford6 Sep 2022 17:17 UTC

6 points

0 comments3 min readLW link

program searches

Tamsin Leake5 Sep 2022 20:04 UTC

21 points

2 comments2 min readLW link

(carado.moe)

Alex Lawsen On Forecasting AI Progress

Michaël Trazzi6 Sep 2022 9:32 UTC

18 points

0 comments2 min readLW link

(theinsideview.ai)

It’s (not) how you use it

Eleni Angelou7 Sep 2022 17:15 UTC

8 points

1 comment2 min readLW link

AI-assisted list of ten concrete alignment things to do right now

lukehmiles7 Sep 2022 8:38 UTC

8 points

5 comments4 min readLW link

Progress Report 7: making GPT go hurrdurr instead of brrrrrrr

Nathan Helm-Burger7 Sep 2022 3:28 UTC

21 points

0 comments4 min readLW link

Is there a list of projects to get started with Interpretability?

Franziska Fischer7 Sep 2022 4:27 UTC

8 points

2 comments1 min readLW link

Understanding and avoiding value drift

TurnTrout9 Sep 2022 4:16 UTC

40 points

9 comments6 min readLW link

Linkpost: Github Copilot productivity experiment

Daniel Kokotajlo8 Sep 2022 4:41 UTC

88 points

4 comments1 min readLW link

(github.blog)

Thoughts on AGI consciousness / sentience

Steven Byrnes8 Sep 2022 16:40 UTC

37 points

37 comments6 min readLW link

What Should AI Owe To Us? Accountable and Aligned AI Systems via Contractualist AI Alignment

xuan8 Sep 2022 15:04 UTC

30 points

15 comments25 min readLW link

A rough idea for solving ELK: An approach for training generalist agents like GATO to make plans and describe them to humans clearly and honestly.

Michael Soareverix8 Sep 2022 15:20 UTC

2 points

2 comments2 min readLW link

Dath Ilan’s Views on Stopgap Corrigibility

David Udell22 Sep 2022 16:16 UTC

50 points

17 comments13 min readLW link

(www.glowfic.com)

Most People Start With The Same Few Bad Ideas

johnswentworth9 Sep 2022 0:29 UTC

161 points

30 comments3 min readLW link

Oversight Leagues: The Training Game as a Feature

Paul Bricman9 Sep 2022 10:08 UTC

20 points

6 comments10 min readLW link

AI alignment with humans… but with which humans?

geoffreymiller9 Sep 2022 18:21 UTC

11 points

33 comments3 min readLW link

Evaluations project @ ARC is hiring a researcher and a webdev/engineer

Beth Barnes9 Sep 2022 22:46 UTC

94 points

7 comments10 min readLW link

Swap and Scale

Stephen Fowler9 Sep 2022 22:41 UTC

17 points

3 comments1 min readLW link

AlexaTM − 20 Billion Parameter Model With Impressive Performance

MrThink9 Sep 2022 21:46 UTC

5 points

0 comments1 min readLW link

[Fun][Link] Alignment SMBC Comic

Gunnar_Zarncke9 Sep 2022 21:38 UTC

7 points

2 comments1 min readLW link

(www.smbc-comics.com)

Path dependence in ML inductive biases

Vivek Hebbar and evhub

10 Sep 2022 1:38 UTC

43 points

13 comments10 min readLW link

ethics and anthropics of homomorphically encrypted computations

Tamsin Leake9 Sep 2022 10:49 UTC

43 points

49 comments3 min readLW link

(carado.moe)

Join ASAP! (AI Safety Accountability Programme) 🚀

CallumMcDougall10 Sep 2022 11:15 UTC

19 points

0 comments3 min readLW link

AI Safety field-building projects I’d like to see

Akash11 Sep 2022 23:43 UTC

44 points

7 comments6 min readLW link

[Question] Why do People Think Intelligence Will be “Easy”?

DragonGod12 Sep 2022 17:32 UTC

15 points

32 comments2 min readLW link

Black Box Investigation Research Hackathon

Esben Kran and Jonas Hallgren

12 Sep 2022 7:20 UTC

9 points

4 comments2 min readLW link

Argument against 20% GDP growth from AI within 10 years [Linkpost]

aogara12 Sep 2022 4:08 UTC

58 points

21 comments5 min readLW link

(twitter.com)

Ideological Inference Engines: Making Deontology Differentiable*

Paul Bricman12 Sep 2022 12:00 UTC

6 points

0 comments14 min readLW link

Deep Q-Networks Explained

Jay Bailey13 Sep 2022 12:01 UTC

37 points

4 comments22 min readLW link

Git Re-Basin: Merging Models modulo Permutation Symmetries [Linkpost]

aogara14 Sep 2022 8:55 UTC

21 points

0 comments2 min readLW link

(arxiv.org)

Some ideas for epistles to the AI ethicists

Charlie Steiner14 Sep 2022 9:07 UTC

19 points

0 comments4 min readLW link

The problem with the media presentation of “believing in AI”

Roman Leventov14 Sep 2022 21:05 UTC

3 points

0 comments1 min readLW link

When is intent alignment sufficient or necessary to reduce AGI conflict?

JesseClifton, Sammy Martin and Anthony DiGiovanni

14 Sep 2022 19:39 UTC

32 points

0 comments9 min readLW link

When would AGIs engage in conflict?

JesseClifton, Sammy Martin and Anthony DiGiovanni

14 Sep 2022 19:38 UTC

37 points

3 comments13 min readLW link

Responding to ‘Beyond Hyperanthropomorphism’

ukc1001414 Sep 2022 20:37 UTC

8 points

0 comments16 min readLW link

How should DeepMind’s Chinchilla revise our AI forecasts?

Cleo Nardo15 Sep 2022 17:54 UTC

34 points

12 comments13 min readLW link

Rational Animations’ Script Writing Contest

Writer15 Sep 2022 16:56 UTC

22 points

1 comment3 min readLW link

Representational Tethers: Tying AI Latents To Human Ones

Paul Bricman16 Sep 2022 14:45 UTC

30 points

0 comments16 min readLW link

[Question] Why are we sure that AI will “want” something?

shminux16 Sep 2022 20:35 UTC

31 points

58 comments1 min readLW link

Refine Blogpost Day #3: The shortforms I did write

Alexander Gietelink Oldenziel16 Sep 2022 21:03 UTC

23 points

0 comments1 min readLW link

Takeaways from our robust injury classifier project [Redwood Research]

dmz17 Sep 2022 3:55 UTC

135 points

9 comments6 min readLW link

Refine’s Third Blog Post Day/Week

adamShimi17 Sep 2022 17:03 UTC

18 points

0 comments1 min readLW link

There is no royal road to alignment

Eleni Angelou18 Sep 2022 3:33 UTC

4 points

2 comments3 min readLW link

Prize and fast track to alignment research at ALTER

Vanessa Kosoy17 Sep 2022 16:58 UTC

65 points

4 comments3 min readLW link

[Question] Updates on FLI’s Value Aligment Map?

Fer32dwt34r3dfsz17 Sep 2022 22:27 UTC

17 points

4 comments1 min readLW link

[Question] Updates on FLI’s Value Aligment Map?

Fer32dwt34r3dfsz17 Sep 2022 22:27 UTC

17 points

4 comments1 min readLW link

Apply for mentorship in AI Safety field-building

Akash17 Sep 2022 19:06 UTC

9 points

0 comments1 min readLW link

(forum.effectivealtruism.org)

Sparse trinary weighted RNNs as a path to better language model interpretability

Am8ryllis17 Sep 2022 19:48 UTC

19 points

13 comments3 min readLW link

Podcasts on surveys, slower AI, AI arguments, etc

KatjaGrace18 Sep 2022 7:30 UTC

13 points

0 comments1 min readLW link

(worldspiritsockpuppet.com)

Inner alignment: what are we pointing at?

lukehmiles18 Sep 2022 11:09 UTC

7 points

2 comments1 min readLW link

The Inter-Agent Facet of AI Alignment

Michael Oesterle18 Sep 2022 20:39 UTC

12 points

1 comment5 min readLW link

Quintin’s alignment papers roundup—week 2

Quintin Pope19 Sep 2022 13:41 UTC

60 points

2 comments10 min readLW link

Safety timelines: How long will it take to solve alignment?

Esben Kran, JonathanRystroem and Steinthal

19 Sep 2022 12:53 UTC

35 points

7 comments6 min readLW link

(forum.effectivealtruism.org)

Prize idea: Transmit MIRI and Eliezer’s worldviews

elifland19 Sep 2022 21:21 UTC

45 points

18 comments2 min readLW link

A noob goes to the SERI MATS presentations

Lowell Dennings19 Sep 2022 17:35 UTC

26 points

0 comments5 min readLW link

How to make your CPU as fast as a GPU—Advances in Sparsity w/ Nir Shavit

the gears to ascension20 Sep 2022 3:48 UTC

0 points

0 comments27 min readLW link

(www.youtube.com)

Towards deconfusing wireheading and reward maximization

leogao21 Sep 2022 0:36 UTC

69 points

7 comments4 min readLW link

Here Be AGI Dragons

Oren Montano21 Sep 2022 22:28 UTC

−2 points

0 comments5 min readLW link

Announcing AISIC 2022 - the AI Safety Israel Conference, October 19-20

Davidmanheim21 Sep 2022 19:32 UTC

13 points

0 comments1 min readLW link

AI Risk Intro 2: Solving The Problem

CallumMcDougall and L Rudolf L

22 Sep 2022 13:55 UTC

13 points

0 comments27 min readLW link

[Question] AI career

ondragon22 Sep 2022 3:48 UTC

2 points

0 comments1 min readLW link

Shahar Avin On How To Regulate Advanced AI Systems

Michaël Trazzi23 Sep 2022 15:46 UTC

31 points

0 comments4 min readLW link

(theinsideview.ai)

The heterogeneity of human value types: Implications for AI alignment

geoffreymiller23 Sep 2022 17:03 UTC

10 points

2 comments10 min readLW link

Intelligence as a Platform

Robert Kennedy23 Sep 2022 5:51 UTC

10 points

5 comments3 min readLW link

Interpreting Neural Networks through the Polytope Lens

Sid Black, Lee Sharkey, Connor Leahy, beren, CRG, merizian, Eric Winsor and Dan Braun

23 Sep 2022 17:58 UTC

123 points

26 comments33 min readLW link

Under what circumstances have governments cancelled AI-type systems?

David Gross23 Sep 2022 21:11 UTC

7 points

1 comment1 min readLW link

(www.carnegieuktrust.org.uk)

[Question] I’m planning to start creating more write-ups summarizing my thoughts on various issues, mostly related to AI existential safety. What do you want to hear my nuanced takes on?

David Scott Krueger (formerly: capybaralet)24 Sep 2022 12:38 UTC

9 points

10 comments1 min readLW link

[Question] Why Do AI researchers Rate the Probability of Doom So Low?

Aorou24 Sep 2022 2:33 UTC

7 points

6 comments3 min readLW link

AI coöperation is more possible than you think

42317524 Sep 2022 21:26 UTC

6 points

0 comments2 min readLW link

An Unexpected GPT-3 Decision in a Simple Gamble

hatta_afiq25 Sep 2022 16:46 UTC

8 points

4 comments1 min readLW link

Prioritizing the Arts in response to AI automation

Casey25 Sep 2022 2:25 UTC

18 points

11 comments2 min readLW link

Planning capacity and daemons

lukehmiles26 Sep 2022 0:15 UTC

2 points

0 comments5 min readLW link

Recall and Regurgitation in GPT2

Megan Kinniment3 Oct 2022 19:35 UTC

33 points

1 comment26 min readLW link

[MLSN #5]: Prize Compilation

Dan H26 Sep 2022 21:55 UTC

14 points

1 comment2 min readLW link

Loss of Alignment is not the High-Order Bit for AI Risk

yieldthought26 Sep 2022 21:16 UTC

14 points

20 comments2 min readLW link

Inverse Scaling Prize: Round 1 Winners

Ethan Perez and Ian McKenzie

26 Sep 2022 19:57 UTC

88 points

16 comments4 min readLW link

(irmckenzie.co.uk)

[Question] Does the existence of shared human values imply alignment is “easy”?

Morpheus26 Sep 2022 18:01 UTC

7 points

14 comments1 min readLW link

Why we’re not founding a human-data-for-alignment org

L Rudolf L and Mathieu Putz

27 Sep 2022 20:14 UTC

80 points

5 comments29 min readLW link

(forum.effectivealtruism.org)

Be Not Afraid

Alex Beyman27 Sep 2022 22:04 UTC

8 points

0 comments6 min readLW link

Strange Loops—Self-Reference from Number Theory to AI

ojorgensen28 Sep 2022 14:10 UTC

9 points

5 comments18 min readLW link

AI Safety Endgame Stories

Ivan Vendrov28 Sep 2022 16:58 UTC

27 points

11 comments11 min readLW link

Estimating the Current and Future Number of AI Safety Researchers

Stephen McAleese28 Sep 2022 21:11 UTC

24 points

11 comments9 min readLW link

(forum.effectivealtruism.org)

Clarifying the Agent-Like Structure Problem

johnswentworth29 Sep 2022 21:28 UTC

53 points

14 comments6 min readLW link

Emergency learning

Stuart_Armstrong28 Jan 2017 10:05 UTC

13 points

10 comments4 min readLW link

EAG DC: Meta-Bottlenecks in Preventing AI Doom

Joseph Bloom30 Sep 2022 17:53 UTC

5 points

0 comments1 min readLW link

Interesting papers: formally verifying DNNs

the gears to ascension30 Sep 2022 8:49 UTC

13 points

0 comments3 min readLW link

linkpost: loss basin visualization

Nathan Helm-Burger30 Sep 2022 3:42 UTC

14 points

1 comment1 min readLW link

Four usages of “loss” in AI

TurnTrout2 Oct 2022 0:52 UTC

42 points

18 comments5 min readLW link

Announcing the AI Safety Nudge Competition to Help Beat Procrastination

Marc Carauleanu1 Oct 2022 1:49 UTC

10 points

0 comments1 min readLW link

Google could build a conscious AI in three months

derek shiller1 Oct 2022 13:24 UTC

9 points

18 comments1 min readLW link

AGI by 2050 probability less than 1%

fumin1 Oct 2022 19:45 UTC

−10 points

4 comments9 min readLW link

(docs.google.com)

[Question] Do anthropic considerations undercut the evolution anchor from the Bio Anchors report?

Ege Erdil1 Oct 2022 20:02 UTC

20 points

13 comments2 min readLW link

A review of the Bio-Anchors report

jylin043 Oct 2022 10:27 UTC

45 points

4 comments1 min readLW link

(docs.google.com)

Data for IRL: What is needed to learn human values?

Jan Wehner3 Oct 2022 9:23 UTC

18 points

6 comments12 min readLW link

my current outlook on AI risk mitigation

Tamsin Leake3 Oct 2022 20:06 UTC

58 points

4 comments11 min readLW link

(carado.moe)

No free lunch theorem is irrelevant

Catnee4 Oct 2022 0:21 UTC

12 points

7 comments1 min readLW link

Paper+Summary: OMNIGROK: GROKKING BEYOND ALGORITHMIC DATA

Marius Hobbhahn4 Oct 2022 7:22 UTC

44 points

11 comments1 min readLW link

(arxiv.org)

How are you dealing with ontology identification?

Erik Jenner4 Oct 2022 23:28 UTC

33 points

10 comments3 min readLW link

Reflection Mechanisms as an Alignment target: A follow-up survey

Marius Hobbhahn, elandgre and Beth Barnes

5 Oct 2022 14:03 UTC

13 points

2 comments7 min readLW link

Tracking Compute Stocks and Flows: Case Studies?

Cullen5 Oct 2022 17:57 UTC

11 points

5 comments1 min readLW link

Charitable Reads of Anti-AGI-X-Risk Arguments, Part 1

sstich5 Oct 2022 5:03 UTC

3 points

4 comments3 min readLW link

Neural Tangent Kernel Distillation

Thomas Larsen and Jeremy Gillen

5 Oct 2022 18:11 UTC

68 points

20 comments8 min readLW link

More Recent Progress in the Theory of Neural Networks

jylin046 Oct 2022 16:57 UTC

78 points

6 comments4 min readLW link

Analysing a 2036 Takeover Scenario

ukc100146 Oct 2022 20:48 UTC

8 points

2 comments27 min readLW link

Warning Shots Probably Wouldn’t Change The Picture Much

So8res6 Oct 2022 5:15 UTC

111 points

40 comments2 min readLW link

Alignment Might Never Be Solved, By Humans or AI

interstice7 Oct 2022 16:14 UTC

30 points

6 comments3 min readLW link

linkpost: neuro-symbolic hybrid ai

Nathan Helm-Burger6 Oct 2022 21:52 UTC

16 points

0 comments1 min readLW link

(youtu.be)

Polysemanticity and Capacity in Neural Networks

Buck, Adam Jermyn and Kshitij Sachan

7 Oct 2022 17:51 UTC

78 points

9 comments3 min readLW link

[Question] Deliberate practice for research?

Alex_Altair8 Oct 2022 3:45 UTC

16 points

2 comments1 min readLW link

[Question] How many GPUs does NVIDIA make?

leogao8 Oct 2022 17:54 UTC

27 points

2 comments1 min readLW link

SERI MATS Program—Winter 2022 Cohort

Ryan Kidd, Victor Warlop and Christian Smith

8 Oct 2022 19:09 UTC

71 points

12 comments4 min readLW link

[Question] Toy alignment problem: Social Nework KPI design

qbolec8 Oct 2022 22:14 UTC

7 points

1 comment1 min readLW link

My tentative interpretability research agenda—topology matching.

Maxwell Clarke8 Oct 2022 22:14 UTC

10 points

2 comments4 min readLW link

[Question] AI Risk Microdynamics Survey

Froolow9 Oct 2022 20:04 UTC

3 points

0 comments1 min readLW link

Possible miracles

Akash and Thomas Larsen

9 Oct 2022 18:17 UTC

60 points

33 comments8 min readLW link

The Lebowski Theorem — Charitable Reads of Anti-AGI-X-Risk Arguments, Part 2

sstich8 Oct 2022 22:39 UTC

1 point

10 comments7 min readLW link

Embedding AI into AR goggles

aixar9 Oct 2022 20:08 UTC

−12 points

0 comments1 min readLW link

Cataloguing Priors in Theory and Practice

Paul Bricman13 Oct 2022 12:36 UTC

13 points

8 comments7 min readLW link

Results from the language model hackathon

Esben Kran10 Oct 2022 8:29 UTC

21 points

1 comment4 min readLW link

Don’t expect AGI anytime soon

cveres10 Oct 2022 22:38 UTC

−14 points

6 comments1 min readLW link

Disentangling inner alignment failures

Erik Jenner10 Oct 2022 18:50 UTC

14 points

5 comments4 min readLW link

Anonymous advice: If you want to reduce AI risk, should you take roles that advance AI capabilities?

Benjamin Hilton11 Oct 2022 14:16 UTC

54 points

10 comments1 min readLW link

Prettified AI Safety Game Cards

abramdemski11 Oct 2022 19:35 UTC

46 points

6 comments1 min readLW link

Power-Seeking AI and Existential Risk

Antonio Franca11 Oct 2022 22:50 UTC

5 points

0 comments9 min readLW link

Alignment 201 curriculum

Richard_Ngo12 Oct 2022 18:03 UTC

102 points

3 comments1 min readLW link

(www.agisafetyfundamentals.com)

Article Review: Google’s AlphaTensor

Robert_AIZI12 Oct 2022 18:04 UTC

8 points

2 comments10 min readLW link

[Question] Previous Work on Recreating Neural Network Input from Intermediate Layer Activations

bglass12 Oct 2022 19:28 UTC

1 point

3 comments1 min readLW link

You are better at math (and alignment) than you think

trevor13 Oct 2022 3:07 UTC

37 points

7 comments22 min readLW link

(www.lesswrong.com)

Counterarguments to the basic AI x-risk case

KatjaGrace14 Oct 2022 13:00 UTC

336 points

122 comments34 min readLW link

(aiimpacts.org)

Another problem with AI confinement: ordinary CPUs can work as radio transmitters

RomanS14 Oct 2022 8:28 UTC

34 points

1 comment1 min readLW link

(news.softpedia.com)

“AGI soon, but Narrow works Better”

AnthonyRepetto14 Oct 2022 21:35 UTC

1 point

9 comments2 min readLW link

[Question] Best resource to go from “typical smart tech-savvy person” to “person who gets AGI risk urgency”?

Liron15 Oct 2022 22:26 UTC

14 points

8 comments1 min readLW link

[Question] Questions about the alignment problem

GG1017 Oct 2022 1:42 UTC

−5 points

13 comments3 min readLW link

[Question] Creating superintelligence without AGI

Antb17 Oct 2022 19:01 UTC

7 points

3 comments1 min readLW link

AI Safety Ideas: An Open AI Safety Research Platform

Esben Kran17 Oct 2022 17:01 UTC

24 points

0 comments1 min readLW link

Is GPT-N bounded by human capacities? No.

Cleo Nardo17 Oct 2022 23:26 UTC

5 points

4 comments2 min readLW link

A pragmatic metric for Artificial General Intelligence

lorepieri17 Oct 2022 22:07 UTC

6 points

0 comments1 min readLW link

(lorenzopieri.com)

Is GitHub Copilot in legal trouble?

tcelferact18 Oct 2022 16:19 UTC

34 points

2 comments1 min readLW link

Metaculus is building a team dedicated to AI forecasting

ChristianWilliams18 Oct 2022 16:08 UTC

3 points

0 comments1 min readLW link

[Question] Where can I find solution to the exercises of AGISF?

Charbel-Raphaël18 Oct 2022 14:11 UTC

7 points

0 comments1 min readLW link

A conversation about Katja’s counterarguments to AI risk

Matthew Barnett, Ege Erdil and Brangus Brangus

18 Oct 2022 18:40 UTC

43 points

9 comments33 min readLW link

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers

Neel Nanda18 Oct 2022 21:08 UTC

66 points

5 comments12 min readLW link

(www.neelnanda.io)

Distilled Representations Research Agenda

Hoagy and mishajw

18 Oct 2022 20:59 UTC

15 points

2 comments8 min readLW link

[Question] Should we push for requiring AI training data to be licensed?

ChristianKl19 Oct 2022 17:49 UTC

38 points

32 comments1 min readLW link

Hacker-AI and Digital Ghosts – Pre-AGI

Erland Wittkotter19 Oct 2022 15:33 UTC

9 points

7 comments8 min readLW link

Scaling Laws for Reward Model Overoptimization

leogao, John Schulman and Jacob_Hilton

20 Oct 2022 0:20 UTC

86 points

11 comments1 min readLW link

(arxiv.org)

The heritability of human values: A behavior genetic critique of Shard Theory

geoffreymiller20 Oct 2022 15:51 UTC

63 points

58 comments21 min readLW link

aisafety.community—A living document of AI safety communities

zeshen and plex

28 Oct 2022 17:50 UTC

52 points

22 comments1 min readLW link

Trajectories to 2036

ukc1001420 Oct 2022 20:23 UTC

1 point

1 comment14 min readLW link

Intelligent behaviour across systems, scales and substrates

Nora_Ammann21 Oct 2022 17:09 UTC

11 points

0 comments10 min readLW link

A framework and open questions for game theoretic shard modeling

Garrett Baker21 Oct 2022 21:40 UTC

11 points

4 comments4 min readLW link

[Question] The Last Year - is there an existing novel about the last year before AI doom?

Luca Petrolati22 Oct 2022 20:44 UTC

4 points

4 comments1 min readLW link

Empowerment is (almost) All We Need

jacob_cannell23 Oct 2022 21:48 UTC

36 points

43 comments17 min readLW link

The optimal timing of spending on AGI safety work; why we should probably be spending more now

Tristan Cook24 Oct 2022 17:42 UTC

62 points

0 comments1 min readLW link

A Barebones Guide to Mechanistic Interpretability Prerequisites

Neel Nanda24 Oct 2022 20:45 UTC

62 points

8 comments3 min readLW link

(neelnanda.io)

Consider trying Vivek Hebbar’s alignment exercises

Akash24 Oct 2022 19:46 UTC

36 points

1 comment4 min readLW link

POWERplay: An open-source toolchain to study AI power-seeking

Edouard Harris24 Oct 2022 20:03 UTC

22 points

0 comments1 min readLW link

(github.com)

What does it take to defend the world against out-of-control AGIs?

Steven Byrnes25 Oct 2022 14:47 UTC

141 points

31 comments30 min readLW link

Mechanism Design for AI Safety—Reading Group Curriculum

Rubi J. Hudson25 Oct 2022 3:54 UTC

7 points

1 comment1 min readLW link

Maps and Blueprint; the Two Sides of the Alignment Equation

Nora_Ammann25 Oct 2022 16:29 UTC

21 points

1 comment5 min readLW link

A Walkthrough of A Mathematical Framework for Transformer Circuits

Neel Nanda25 Oct 2022 20:24 UTC

49 points

5 comments1 min readLW link

(www.youtube.com)

Paper: In-context Reinforcement Learning with Algorithm Distillation [Deepmind]

LawrenceC26 Oct 2022 18:45 UTC

28 points

5 comments1 min readLW link

(arxiv.org)

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

maxnadeau, Xander Davies, Buck and Nate Thomas

27 Oct 2022 1:32 UTC

134 points

14 comments12 min readLW link

You won’t solve alignment without agent foundations

Mikhail Samin6 Nov 2022 8:07 UTC

21 points

3 comments8 min readLW link

AI & ML Safety Updates W43

Esben Kran and Steinthal

28 Oct 2022 13:18 UTC

9 points

3 comments3 min readLW link

Prizes for ML Safety Benchmark Ideas

joshc28 Oct 2022 2:51 UTC

36 points

3 comments1 min readLW link

Me (Steve Byrnes) on the “Brain Inspired” podcast

Steven Byrnes30 Oct 2022 19:15 UTC

26 points

1 comment1 min readLW link

(braininspired.co)

Join the interpretability research hackathon

Esben Kran28 Oct 2022 16:26 UTC

15 points

0 comments1 min readLW link

Instrumental ignoring AI, Dumb but not useless.

Donald Hobson30 Oct 2022 16:55 UTC

7 points

6 comments2 min readLW link

«Boundaries», Part 3a: Defining boundaries as directed Markov blankets

Andrew_Critch30 Oct 2022 6:31 UTC

58 points

13 comments15 min readLW link

[Book] Interpretable Machine Learning: A Guide for Making Black Box Models Explainable

Esben Kran31 Oct 2022 11:38 UTC

19 points

1 comment1 min readLW link

(christophm.github.io)

“Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability

David Scott Krueger (formerly: capybaralet)31 Oct 2022 21:26 UTC

47 points

25 comments2 min readLW link

ML Safety Scholars Summer 2022 Retrospective

ThomasW1 Nov 2022 3:09 UTC

29 points

0 comments1 min readLW link

What sorts of systems can be deceptive?

Andrei Alexandru31 Oct 2022 22:00 UTC

14 points

0 comments7 min readLW link

All AGI Safety questions welcome (especially basic ones) [~monthly thread]

Robert Miles1 Nov 2022 23:23 UTC

67 points

100 comments2 min readLW link

Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?

Neel Nanda1 Nov 2022 23:56 UTC

68 points

14 comments1 min readLW link

(youtu.be)

On the correspondence between AI-misalignment and cognitive dissonance using a behavioral economics model

Stijn Bruers1 Nov 2022 17:39 UTC

4 points

0 comments6 min readLW link

WFW?: Opportunity and Theory of Impact

DavidCorfield2 Nov 2022 1:24 UTC

1 point

0 comments1 min readLW link

AI Safety Needs Great Product Builders

goodgravy2 Nov 2022 11:33 UTC

14 points

2 comments1 min readLW link

A Mystery About High Dimensional Concept Encoding

Fabien Roger3 Nov 2022 17:05 UTC

46 points

13 comments7 min readLW link

Ethan Caballero on Broken Neural Scaling Laws, Deception, and Recursive Self Improvement

Michaël Trazzi and Ethan Caballero

4 Nov 2022 18:09 UTC

14 points

11 comments5 min readLW link

(theinsideview.ai)

Can we predict the abilities of future AI? MLAISU W44

Esben Kran and Steinthal

4 Nov 2022 15:19 UTC

10 points

0 comments3 min readLW link

(newsletter.apartresearch.com)

My summary of “Pragmatic AI Safety”

Eleni Angelou5 Nov 2022 12:54 UTC

2 points

0 comments5 min readLW link

Review of the Challenge

SD Marlow5 Nov 2022 6:38 UTC

−14 points

5 comments2 min readLW link

How to store human values on a computer

Oliver Siegel5 Nov 2022 19:17 UTC

−12 points

17 comments1 min readLW link

Should AI focus on problem-solving or strategic planning? Why not both?

Oliver Siegel5 Nov 2022 19:17 UTC

−12 points

3 comments1 min readLW link

Instead of technical research, more people should focus on buying time

Akash, OliviaJ and Thomas Larsen

5 Nov 2022 20:43 UTC

80 points

51 comments14 min readLW link

[Question] Is there some kind of backlog or delay for data center AI?

trevor7 Nov 2022 8:18 UTC

5 points

2 comments1 min readLW link

A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Neel Nanda7 Nov 2022 22:39 UTC

29 points

15 comments3 min readLW link

(youtu.be)

How could we know that an AGI system will have good consequences?

So8res7 Nov 2022 22:42 UTC

86 points

24 comments5 min readLW link

People care about each other even though they have imperfect motivational pointers?

TurnTrout8 Nov 2022 18:15 UTC

32 points

25 comments7 min readLW link

[ASoT] Thoughts on GPT-N

Ulisse Mini8 Nov 2022 7:14 UTC

8 points

0 comments1 min readLW link

Inverse scaling can become U-shaped

Edouard Harris8 Nov 2022 19:04 UTC

27 points

15 comments1 min readLW link

(arxiv.org)

Counterfactability

Scott Garrabrant7 Nov 2022 5:39 UTC

36 points

4 comments11 min readLW link

Takeaways from a survey on AI alignment resources

DanielFilan5 Nov 2022 23:40 UTC

73 points

9 comments6 min readLW link

(danielfilan.com)

[ASoT] Instrumental convergence is useful

Ulisse Mini9 Nov 2022 20:20 UTC

5 points

9 comments1 min readLW link

Mesatranslation and Metatranslation

jdp9 Nov 2022 18:46 UTC

23 points

4 comments11 min readLW link

The Interpretability Playground

Esben Kran10 Nov 2022 17:15 UTC

8 points

0 comments1 min readLW link

(alignmentjam.com)

Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

TurnTrout29 Nov 2022 6:23 UTC

55 points

27 comments15 min readLW link

[Question] What are some low-cost outside-the-box ways to do/fund alignment research?

trevor11 Nov 2022 5:25 UTC

10 points

0 comments1 min readLW link

Instrumental convergence is what makes general intelligence possible

tailcalled11 Nov 2022 16:38 UTC

72 points

11 comments4 min readLW link

A short critique of Vanessa Kosoy’s PreDCA

Martín Soto13 Nov 2022 16:00 UTC

25 points

8 comments4 min readLW link

[Question] Why don’t we have self driving cars yet?

Linda Linsefors14 Nov 2022 12:19 UTC

21 points

16 comments1 min readLW link

Winners of the AI Safety Nudge Competition

Marc Carauleanu15 Nov 2022 1:06 UTC

4 points

0 comments1 min readLW link

[Question] Will nanotech/biotech be what leads to AI doom?

tailcalled15 Nov 2022 17:38 UTC

4 points

8 comments2 min readLW link

[Question] What is our current best infohazard policy for AGI (safety) research?

Roman Leventov15 Nov 2022 22:33 UTC

12 points

2 comments1 min readLW link

Disagreement with bio anchors that lead to shorter timelines

Marius Hobbhahn16 Nov 2022 14:40 UTC

72 points

16 comments7 min readLW link

Current themes in mechanistic interpretability research

Lee Sharkey, Sid Black and beren

16 Nov 2022 14:14 UTC

82 points

3 comments12 min readLW link

[Question] Is there some reason LLMs haven’t seen broader use?

tailcalled16 Nov 2022 20:04 UTC

25 points

27 comments1 min readLW link

AI Forecasting Research Ideas

Jsevillamol17 Nov 2022 17:37 UTC

21 points

2 comments1 min readLW link

Results from the interpretability hackathon

Esben Kran and Neel Nanda

17 Nov 2022 14:51 UTC

80 points

0 comments6 min readLW link

Don’t design agents which exploit adversarial inputs

TurnTrout and Garrett Baker

18 Nov 2022 1:48 UTC

60 points

61 comments12 min readLW link

AI Ethics != Ai Safety

Dentin18 Nov 2022 3:02 UTC

2 points

0 comments1 min readLW link

Update to Mysteries of mode collapse: text-davinci-002 not RLHF

janus19 Nov 2022 23:51 UTC

69 points

8 comments2 min readLW link

Limits to the Controllability of AGI

Roman_Yampolskiy, Remmelt Ellen and Karl von Wendt

20 Nov 2022 19:18 UTC

10 points

2 comments9 min readLW link

[ASoT] Reflectivity in Narrow AI

Ulisse Mini21 Nov 2022 0:51 UTC

6 points

1 comment1 min readLW link

Here’s the exit.

Valentine21 Nov 2022 18:07 UTC

85 points

138 comments10 min readLW link

Clarifying wireheading terminology

leogao24 Nov 2022 4:53 UTC

53 points

6 comments1 min readLW link

A Walkthrough of In-Context Learning and Induction Heads (w/ Charles Frye) Part 1 of 2

Neel Nanda22 Nov 2022 17:12 UTC

20 points

0 comments1 min readLW link

(www.youtube.com)

Announcing AI safety Mentors and Mentees

Marius Hobbhahn23 Nov 2022 15:21 UTC

54 points

7 comments10 min readLW link

My take on Jacob Cannell’s take on AGI safety

Steven Byrnes28 Nov 2022 14:01 UTC

61 points

13 comments30 min readLW link

Don’t align agents to evaluations of plans

TurnTrout26 Nov 2022 21:16 UTC

37 points

46 comments18 min readLW link

[Question] Dumb and ill-posed question: Is conceptual research like this MIRI paper on the shutdown problem/Corrigibility “real”

joraine24 Nov 2022 5:08 UTC

25 points

11 comments1 min readLW link

Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Vika, Vikrant Varma, Ramana Kumar and Rohin Shah

25 Nov 2022 14:36 UTC

36 points

4 comments6 min readLW link

(vkrakovna.wordpress.com)

Podcast: Shoshannah Tekofsky on skilling up in AI safety, visiting Berkeley, and developing novel research ideas

Akash25 Nov 2022 20:47 UTC

37 points

2 comments9 min readLW link

Mechanistic anomaly detection and ELK

paulfchristiano25 Nov 2022 18:50 UTC

121 points

17 comments21 min readLW link

(ai-alignment.com)

The First Filter

adamShimi and Gabriel Alfour

26 Nov 2022 19:37 UTC

55 points

5 comments1 min readLW link

Discussing how to align Transformative AI if it’s developed very soon

elifland and CharlotteS

28 Nov 2022 16:17 UTC

36 points

2 comments30 min readLW link

On the Diplomacy AI

Zvi28 Nov 2022 13:20 UTC

119 points

29 comments11 min readLW link

(thezvi.wordpress.com)

Why Would AI “Aim” To Defeat Humanity?

HoldenKarnofsky29 Nov 2022 19:30 UTC

68 points

9 comments33 min readLW link

(www.cold-takes.com)

Distinguishing test from training

So8res29 Nov 2022 21:41 UTC

65 points

10 comments6 min readLW link

[Question] Do any of the AI Risk evaluations focus on humans as the risk?

jmh30 Nov 2022 3:09 UTC

10 points

8 comments1 min readLW link

Apply to attend winter AI alignment workshops (Dec 28-30 & Jan 3-5) near Berkeley

Akash, OliviaJ and Thomas Larsen

1 Dec 2022 20:46 UTC

25 points

1 comment1 min readLW link

Theories of impact for Science of Deep Learning

Marius Hobbhahn1 Dec 2022 14:39 UTC

16 points

0 comments11 min readLW link

Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout2 Dec 2022 2:43 UTC

96 points

18 comments53 min readLW link

The Plan − 2022 Update

johnswentworth1 Dec 2022 20:43 UTC

211 points

33 comments8 min readLW link

Finding gliders in the game of life

paulfchristiano1 Dec 2022 20:40 UTC

91 points

7 comments16 min readLW link

(ai-alignment.com)

Take 1: We’re not going to reverse-engineer the AI.

Charlie Steiner1 Dec 2022 22:41 UTC

38 points

4 comments4 min readLW link

 Understanding goals in complex systems

Johannes C. Mayer1 Dec 2022 23:49 UTC

9 points

0 comments1 min readLW link

(www.youtube.com)

Mastering Stratego (Deepmind)

svemirski2 Dec 2022 2:21 UTC

6 points

0 comments1 min readLW link

(www.deepmind.com)

Jailbreaking ChatGPT on Release Day

Zvi2 Dec 2022 13:10 UTC

237 points

74 comments6 min readLW link

(thezvi.wordpress.com)

[Question] Did I just catch GPTchat doing something unexpectedly insightful?

trevor2 Dec 2022 7:48 UTC

9 points

0 comments1 min readLW link

Take 2: Building tools to help build FAI is a legitimate strategy, but it’s dual-use.

Charlie Steiner3 Dec 2022 0:54 UTC

16 points

1 comment2 min readLW link

Causal scrubbing: results on induction heads

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:59 UTC

32 points

0 comments17 min readLW link

Logical induction for software engineers

Alex Flint3 Dec 2022 19:55 UTC

124 points

2 comments27 min readLW link

ChatGPT is surprisingly and uncanningly good at pretending to be sentient

ZT53 Dec 2022 14:47 UTC

17 points

11 comments18 min readLW link

Monthly Shorts 11/22

Celer5 Dec 2022 7:30 UTC

8 points

0 comments3 min readLW link

(keller.substack.com)

Take 4: One problem with natural abstractions is there’s too many of them.

Charlie Steiner5 Dec 2022 10:39 UTC

34 points

4 comments1 min readLW link

The No Free Lunch theorem for dummies

Steven Byrnes5 Dec 2022 21:46 UTC

28 points

16 comments3 min readLW link

[Link] Why I’m optimistic about OpenAI’s alignment approach

janleike5 Dec 2022 22:51 UTC

93 points

13 comments1 min readLW link

(aligned.substack.com)

Updating my AI timelines

Matthew Barnett5 Dec 2022 20:46 UTC

134 points

40 comments2 min readLW link

ChatGPT and Ideological Turing Test

Viliam5 Dec 2022 21:45 UTC

41 points

1 comment1 min readLW link

Verification Is Not Easier Than Generation In General

johnswentworth6 Dec 2022 5:20 UTC

56 points

23 comments1 min readLW link

[Question] What are the major underlying divisions in AI safety?

Chris Leong6 Dec 2022 3:28 UTC

5 points

2 comments1 min readLW link

Take 5: Another problem for natural abstractions is laziness.

Charlie Steiner6 Dec 2022 7:00 UTC

30 points

4 comments3 min readLW link

Mesa-Optimizers via Grokking

orthonormal6 Dec 2022 20:05 UTC

35 points

4 comments6 min readLW link

[Question] How do finite factored sets compare with phase space?

Alex_Altair6 Dec 2022 20:05 UTC

14 points

1 comment1 min readLW link

Using GPT-Eliezer against ChatGPT Jailbreaking

Stuart_Armstrong and rgorman

6 Dec 2022 19:54 UTC

159 points

77 comments9 min readLW link

Take 6: CAIS is actually Orwellian.

Charlie Steiner7 Dec 2022 13:50 UTC

14 points

5 comments2 min readLW link

[Question] Looking for ideas of public assets (stocks, funds, ETFs) that I can invest in to have a chance at profiting from the mass adoption and commercialization of AI technology

Annapurna7 Dec 2022 22:35 UTC

15 points

9 comments1 min readLW link

You should consider launching an AI startup

joshc8 Dec 2022 0:28 UTC

5 points

16 comments4 min readLW link

Machine Learning Consent

jefftk8 Dec 2022 3:50 UTC

38 points

14 comments3 min readLW link

(www.jefftk.com)

Relevant to natural abstractions: Euclidean Symmetry Equivariant Machine Learning—Overview, Applications, and Open Questions

the gears to ascension8 Dec 2022 18:01 UTC

7 points

0 comments1 min readLW link

(youtu.be)

AI Safety Seems Hard to Measure

HoldenKarnofsky8 Dec 2022 19:50 UTC

68 points

5 comments14 min readLW link

(www.cold-takes.com)

[Question] How is the “sharp left turn defined”?

Chris_Leong9 Dec 2022 0:04 UTC

13 points

3 comments1 min readLW link

Linkpost for a generalist algorithmic learner: capable of carrying out sorting, shortest paths, string matching, convex hull finding in one network

lovetheusers9 Dec 2022 0:02 UTC

7 points

1 comment1 min readLW link

(twitter.com)

Timelines ARE relevant to alignment research (timelines 2 of ?)

Nathan Helm-Burger24 Aug 2022 0:19 UTC

11 points

5 comments6 min readLW link

Prosaic misalignment from the Solomonoff Predictor

Cleo Nardo9 Dec 2022 17:53 UTC

11 points

0 comments5 min readLW link

[Question] Does a LLM have a utility function?

Dagon9 Dec 2022 17:19 UTC

16 points

6 comments1 min readLW link

ML Safety at NeurIPS & Paradigmatic AI Safety? MLAISU W49

Esben Kran and Steinthal

9 Dec 2022 10:38 UTC

14 points

0 comments4 min readLW link

(newsletter.apartresearch.com)

Take 8: Queer the inner/outer alignment dichotomy.

Charlie Steiner9 Dec 2022 17:46 UTC

26 points

2 comments2 min readLW link

My thoughts on OpenAI’s Alignment plan

Donald Hobson10 Dec 2022 10:35 UTC

20 points

0 comments6 min readLW link

[ASoT] Natural abstractions and AlphaZero

Ulisse Mini10 Dec 2022 17:53 UTC

31 points

1 comment1 min readLW link

(arxiv.org)

[Question] How promising are legal avenues to restrict AI training data?

thehalliard10 Dec 2022 16:31 UTC

9 points

2 comments1 min readLW link

Consider using reversible automata for alignment research

Alex_Altair11 Dec 2022 1:00 UTC

81 points

29 comments2 min readLW link

[fiction] Our Final Hour

Mati_Roy11 Dec 2022 5:49 UTC

16 points

5 comments3 min readLW link

A crisis for online communication: bots and bot users will overrun the Internet?

Mitchell_Porter11 Dec 2022 21:11 UTC

23 points

11 comments1 min readLW link

Reframing inner alignment

davidad11 Dec 2022 13:53 UTC

47 points

13 comments4 min readLW link

Side-channels: input versus output

davidad12 Dec 2022 12:32 UTC

35 points

9 comments2 min readLW link

Psychological Disorders and Problems

adamShimi and Gabriel Alfour

12 Dec 2022 18:15 UTC

35 points

5 comments1 min readLW link

Prodding ChatGPT to solve a basic algebra problem

shminux12 Dec 2022 4:09 UTC

14 points

6 comments1 min readLW link

(twitter.com)

A brainteaser for language models

Adam Scherlis12 Dec 2022 2:43 UTC

46 points

3 comments2 min readLW link

Take 9: No, RLHF/IDA/debate doesn’t solve outer alignment.

Charlie Steiner12 Dec 2022 11:51 UTC

36 points

14 comments2 min readLW link

12 career-related questions that may (or may not) be helpful for people interested in alignment research

Akash12 Dec 2022 22:36 UTC

18 points

0 comments2 min readLW link

Finite Factored Sets in Pictures

Magdalena Wache11 Dec 2022 18:49 UTC

149 points

31 comments12 min readLW link

Concept extrapolation for hypothesis generation

Stuart_Armstrong, Patrick Leask and rgorman

12 Dec 2022 22:09 UTC

20 points

2 comments3 min readLW link

Take 10: Fine-tuning with RLHF is aesthetically unsatisfying.

Charlie Steiner13 Dec 2022 7:04 UTC

30 points

3 comments2 min readLW link

AI alignment is distinct from its near-term applications

paulfchristiano13 Dec 2022 7:10 UTC

233 points

5 comments2 min readLW link

(ai-alignment.com)

Okay, I feel it now

g113 Dec 2022 11:01 UTC

84 points

14 comments1 min readLW link

What Does It Mean to Align AI With Human Values?

Algon13 Dec 2022 16:56 UTC

8 points

3 comments1 min readLW link

(www.quantamagazine.org)

[Question] Is the ChatGPT-simulated Linux virtual machine real?

Kenoubi13 Dec 2022 15:41 UTC

18 points

7 comments1 min readLW link

[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey, Dan Braun and beren

13 Dec 2022 15:41 UTC

80 points

10 comments22 min readLW link

Existential AI Safety is NOT separate from near-term applications

scasper13 Dec 2022 14:47 UTC

37 points

16 comments3 min readLW link

My AGI safety research—2022 review, ’23 plans

Steven Byrnes14 Dec 2022 15:15 UTC

34 points

6 comments6 min readLW link

Trying to disambiguate different questions about whether RLHF is “good”

Buck14 Dec 2022 4:03 UTC

92 points

40 comments7 min readLW link

Predicting GPU performance

Marius Hobbhahn and Tamay

14 Dec 2022 16:27 UTC

59 points

24 comments1 min readLW link

(epochai.org)

[Question] Is the AI timeline too short to have children?

Yoreth14 Dec 2022 18:32 UTC

33 points

20 comments1 min readLW link

«Boundaries», Part 3b: Alignment problems in terms of boundaries

Andrew_Critch14 Dec 2022 22:34 UTC

49 points

2 comments13 min readLW link

[Question] Is Paul Christiano still as optimistic about Approval-Directed Agents as he was in 2018?

Chris_Leong14 Dec 2022 23:28 UTC

8 points

0 comments1 min readLW link

Aligning alignment with performance

Marv K14 Dec 2022 22:19 UTC

2 points

0 comments2 min readLW link

AI Neorealism: a threat model & success criterion for existential safety

davidad15 Dec 2022 13:42 UTC

39 points

0 comments3 min readLW link

The next decades might be wild

Marius Hobbhahn15 Dec 2022 16:10 UTC

157 points

27 comments41 min readLW link

High-level hopes for AI alignment

HoldenKarnofsky15 Dec 2022 18:00 UTC

42 points

3 comments19 min readLW link

(www.cold-takes.com)

[Question] How is ARC planning to use ELK?

jacquesthibs15 Dec 2022 20:11 UTC

23 points

5 comments1 min readLW link

AI overhangs depend on whether algorithms, compute and data are substitutes or complements

NathanBarnard16 Dec 2022 2:23 UTC

2 points

0 comments3 min readLW link

Paper: Transformers learn in-context by gradient descent

LawrenceC16 Dec 2022 11:10 UTC

26 points

11 comments2 min readLW link

(arxiv.org)

How important are accurate AI timelines for the optimal spending schedule on AI risk interventions?

Tristan Cook16 Dec 2022 16:05 UTC

27 points

2 comments1 min readLW link

Will Machines Ever Rule the World? MLAISU W50

Esben Kran16 Dec 2022 11:03 UTC

12 points

7 comments4 min readLW link

(newsletter.apartresearch.com)

Can we efficiently explain model behaviors?

paulfchristiano16 Dec 2022 19:40 UTC

63 points

0 comments9 min readLW link

(ai-alignment.com)

[Question] College Selection Advice for Technical Alignment

TempCollegeAsk16 Dec 2022 17:11 UTC

11 points

8 comments1 min readLW link

Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC16 Dec 2022 22:12 UTC

60 points

10 comments1 min readLW link

(www.anthropic.com)

Positive values seem more robust and lasting than prohibitions

TurnTrout17 Dec 2022 21:43 UTC

42 points

12 comments2 min readLW link

Take 11: “Aligning language models” should be weirder.

Charlie Steiner18 Dec 2022 14:14 UTC

29 points

0 comments2 min readLW link

Why I think that teaching philosophy is high impact

Eleni Angelou19 Dec 2022 3:11 UTC

5 points

0 comments2 min readLW link

Event [Berkeley]: Alignment Collaborator Speed-Meeting

AlexMennen and Carson Jones

19 Dec 2022 2:24 UTC

18 points

2 comments1 min readLW link

The ‘Old AI’: Lessons for AI governance from early electricity regulation

Sam Clarke and Di Cooke

19 Dec 2022 2:42 UTC

7 points

0 comments13 min readLW link

Note on algorithms with multiple trained components

Steven Byrnes20 Dec 2022 17:08 UTC

19 points

4 comments2 min readLW link

Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)

Remmelt19 Dec 2022 12:02 UTC

8 points

6 comments31 min readLW link

Next Level Seinfeld

Zvi19 Dec 2022 13:30 UTC

45 points

6 comments1 min readLW link

(thezvi.wordpress.com)

Solution to The Alignment Problem

Algon19 Dec 2022 20:12 UTC

10 points

0 comments2 min readLW link

Shard Theory in Nine Theses: a Distillation and Critical Appraisal

LawrenceC19 Dec 2022 22:52 UTC

80 points

14 comments17 min readLW link

The “Minimal Latents” Approach to Natural Abstractions

johnswentworth20 Dec 2022 1:22 UTC

41 points

14 comments12 min readLW link

Take 12: RLHF’s use is evidence that orgs will jam RL at real-world problems.

Charlie Steiner20 Dec 2022 5:01 UTC

23 points

0 comments3 min readLW link

[link, 2019] AI paradigm: interactive learning from unlabeled instructions

the gears to ascension20 Dec 2022 6:45 UTC

2 points

0 comments2 min readLW link

(jgrizou.github.io)

Discovering Language Model Behaviors with Model-Written Evaluations

evhub and Ethan Perez

20 Dec 2022 20:08 UTC

45 points

6 comments1 min readLW link

(www.anthropic.com)

Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic

Akash20 Dec 2022 21:39 UTC

14 points

2 comments11 min readLW link

Google Search loses to ChatGPT fair and square

shminux21 Dec 2022 8:11 UTC

12 points

6 comments1 min readLW link

(www.surgehq.ai)

A Comprehensive Mechanistic Interpretability Explainer & Glossary

Neel Nanda21 Dec 2022 12:35 UTC

40 points

0 comments2 min readLW link

(neelnanda.io)

Price’s equation for neural networks

tailcalled21 Dec 2022 13:09 UTC

22 points

3 comments2 min readLW link

[Question] [DISC] Are Values Robust?

DragonGod21 Dec 2022 1:00 UTC

12 points

5 comments2 min readLW link

Metaphor.systems

the gears to ascension21 Dec 2022 21:31 UTC

9 points

2 comments1 min readLW link

(metaphor.systems)

The Human’s Hidden Utility Function (Maybe)

lukeprog23 Jan 2012 19:39 UTC

64 points

90 comments3 min readLW link

Using vector fields to visualise preferences and make them consistent

MichaelA and JustinShovelain

28 Jan 2020 19:44 UTC

41 points

32 comments11 min readLW link

[Article review] Artificial Intelligence, Values, and Alignment

MichaelA9 Mar 2020 12:42 UTC

13 points

5 comments10 min readLW link

Clarifying some key hypotheses in AI alignment

Ben Cottier and Rohin Shah

15 Aug 2019 21:29 UTC

78 points

12 comments9 min readLW link

Failures in technology forecasting? A reply to Ord and Yudkowsky

MichaelA8 May 2020 12:41 UTC

44 points

19 comments11 min readLW link

[Link and commentary] The Offense-Defense Balance of Scientific Knowledge: Does Publishing AI Research Reduce Misuse?

MichaelA16 Feb 2020 19:56 UTC

24 points

4 comments3 min readLW link

How can Interpretability help Alignment?

RobertKirk, Tomáš Gavenčiak and axioman

23 May 2020 16:16 UTC

37 points

3 comments9 min readLW link

A Problem With Patternism

B Jacobs19 May 2020 20:16 UTC

5 points

52 comments1 min readLW link

Goal-directedness is behavioral, not structural

adamShimi8 Jun 2020 23:05 UTC

6 points

12 comments3 min readLW link

Learning Deep Learning: Joining data science research as a mathematician

magfrump19 Oct 2017 19:14 UTC

10 points

4 comments3 min readLW link

Will AI undergo discontinuous progress?

Sammy Martin21 Feb 2020 22:16 UTC

26 points

21 comments20 min readLW link

The Value Definition Problem

Sammy Martin18 Nov 2019 19:56 UTC

14 points

6 comments11 min readLW link

Life at Three Tails of the Bell Curve

lsusr27 Jun 2020 8:49 UTC

63 points

10 comments4 min readLW link

How do takeoff speeds affect the probability of bad outcomes from AGI?

KR29 Jun 2020 22:06 UTC

15 points

2 comments8 min readLW link

AI Benefits Post 2: How AI Benefits Differs from AI Alignment & AI for Good

Cullen29 Jun 2020 17:00 UTC

8 points

7 comments2 min readLW link

Null-boxing Newcomb’s Problem

Yitz13 Jul 2020 16:32 UTC

33 points

10 comments4 min readLW link

No nonsense version of the “racial algorithm bias”

Yuxi_Liu13 Jul 2019 15:39 UTC

115 points

20 comments2 min readLW link

Education 2.0 — A brand new education system

aryan15 Jul 2020 10:09 UTC

−8 points

3 comments6 min readLW link

What it means to optimise

Neel Nanda25 Jul 2020 9:40 UTC

5 points

0 comments8 min readLW link

(www.neelnanda.io)

[Question] Where are people thinking and talking about global coordination for AI safety?

Wei Dai22 May 2019 6:24 UTC

103 points

22 comments1 min readLW link

The strategy-stealing assumption

paulfchristiano16 Sep 2019 15:23 UTC

72 points

46 comments12 min readLW link 3 reviews

Conversation with Paul Christiano

abergal11 Sep 2019 23:20 UTC

44 points

6 comments30 min readLW link

(aiimpacts.org)

Transcription of Eliezer’s January 2010 video Q&A

curiousepic14 Nov 2011 17:02 UTC

112 points

9 comments56 min readLW link

Resources for AI Alignment Cartography

Gyrodiot4 Apr 2020 14:20 UTC

45 points

8 comments9 min readLW link

Thoughts on Ben Garfinkel’s “How sure are we about this AI stuff?”

David Scott Krueger (formerly: capybaralet)6 Feb 2019 19:09 UTC

25 points

17 comments1 min readLW link

Announcement: AI alignment prize round 2 winners and next round

cousin_it16 Apr 2018 3:08 UTC

64 points

29 comments2 min readLW link

Announcement: AI alignment prize round 3 winners and next round

cousin_it15 Jul 2018 7:40 UTC

93 points

7 comments1 min readLW link

Security Mindset and the Logistic Success Curve

Eliezer Yudkowsky26 Nov 2017 15:58 UTC

76 points

48 comments20 min readLW link

Arbital scrape

emmab6 Jun 2019 23:11 UTC

89 points

23 comments1 min readLW link

The Strangest Thing An AI Could Tell You

Eliezer Yudkowsky15 Jul 2009 2:27 UTC

116 points

605 comments2 min readLW link

Self-fulfilling correlations

PhilGoetz26 Aug 2010 21:07 UTC

144 points

50 comments3 min readLW link

Zoom In: An Introduction to Circuits

evhub10 Mar 2020 19:36 UTC

84 points

11 comments2 min readLW link

(distill.pub)

Should ethicists be inside or outside a profession?

Eliezer Yudkowsky12 Dec 2018 1:40 UTC

87 points

6 comments9 min readLW link

Implicit extortion

paulfchristiano13 Apr 2018 16:33 UTC

29 points

16 comments6 min readLW link

(ai-alignment.com)

Bayesian Judo

Eliezer Yudkowsky31 Jul 2007 5:53 UTC

87 points

108 comments1 min readLW link

Announcing AlignmentForum.org Beta

Raemon10 Jul 2018 20:19 UTC

67 points

35 comments2 min readLW link

Announcing the Alignment Newsletter

Rohin Shah9 Apr 2018 21:16 UTC

29 points

3 comments1 min readLW link

Helen Toner on China, CSET, and AI

Rob Bensinger21 Apr 2019 4:10 UTC

68 points

3 comments7 min readLW link

(rationallyspeakingpodcast.org)

A simple environment for showing mesa misalignment

Matthew Barnett26 Sep 2019 4:44 UTC

70 points

9 comments2 min readLW link

The E-Coli Test for AI Alignment

johnswentworth16 Dec 2018 8:10 UTC

69 points

24 comments1 min readLW link

Recent Progress in the Theory of Neural Networks

interstice4 Dec 2019 23:11 UTC

76 points

9 comments9 min readLW link

The Art of the Artificial: Insights from ‘Artificial Intelligence: A Modern Approach’

TurnTrout25 Mar 2018 6:55 UTC

31 points

8 comments15 min readLW link

Heading off a near-term AGI arms race

lincolnquirk22 Aug 2012 14:23 UTC

10 points

70 comments1 min readLW link

Outperforming the human Atari benchmark

Vaniver31 Mar 2020 19:33 UTC

58 points

5 comments1 min readLW link

(deepmind.com)

Conversational Presentation of Why Automation is Different This Time

ryan_b17 Jan 2018 22:11 UTC

33 points

26 comments1 min readLW link

A rant against robots

Lê Nguyên Hoang14 Jan 2020 22:03 UTC

64 points

7 comments5 min readLW link

Clarifying “AI Alignment”

paulfchristiano15 Nov 2018 14:41 UTC

64 points

82 comments3 min readLW link 2 reviews

Tiling Agents for Self-Modifying AI (OPFAI #2)

Eliezer Yudkowsky6 Jun 2013 20:24 UTC

84 points

259 comments3 min readLW link

EDT solves 5 and 10 with conditional oracles

jessicata30 Sep 2018 7:57 UTC

59 points

8 comments13 min readLW link

AGI and Friendly AI in the dominant AI textbook

lukeprog11 Mar 2011 4:12 UTC

73 points

27 comments3 min readLW link

Tabooing ‘Agent’ for Prosaic Alignment

Hjalmar_Wijk23 Aug 2019 2:55 UTC

54 points

10 comments6 min readLW link

Is this what FAI outreach success looks like?

Charlie Steiner9 Mar 2018 13:12 UTC

17 points

3 comments1 min readLW link

(www.youtube.com)

Aligning a toy model of optimization

paulfchristiano28 Jun 2019 20:23 UTC

52 points

26 comments3 min readLW link

DeepMind article: AI Safety Gridworlds

scarcegreengrass30 Nov 2017 16:13 UTC

24 points

5 comments1 min readLW link

(deepmind.com)

Botworld: a cellular automaton for studying self-modifying agents embedded in their environment

So8res12 Apr 2014 0:56 UTC

78 points

55 comments7 min readLW link

“UDT2” and “against UD+ASSA”

Wei Dai12 May 2019 4:18 UTC

50 points

7 comments7 min readLW link

Using lying to detect human values

Stuart_Armstrong15 Mar 2018 11:37 UTC

19 points

6 comments1 min readLW link

Another AI Winter?

PeterMcCluskey25 Dec 2019 0:58 UTC

47 points

14 comments4 min readLW link

(www.bayesianinvestor.com)

Modeling AGI Safety Frameworks with Causal Influence Diagrams

Ramana Kumar21 Jun 2019 12:50 UTC

43 points

6 comments1 min readLW link

(arxiv.org)

The Urgent Meta-Ethics of Friendly Artificial Intelligence

lukeprog1 Feb 2011 14:15 UTC

76 points

252 comments1 min readLW link

Henry Kissinger: AI Could Mean the End of Human History

ESRogs15 May 2018 20:11 UTC

17 points

12 comments1 min readLW link

(www.theatlantic.com)

Self-confirming predictions can be arbitrarily bad

Stuart_Armstrong3 May 2019 11:34 UTC

46 points

11 comments5 min readLW link

A Visualization of Nick Bostrom’s Superintelligence

[deleted]23 Jul 2014 0:24 UTC

62 points

28 comments3 min readLW link

[Question] What are the most plausible “AI Safety warning shot” scenarios?

Daniel Kokotajlo26 Mar 2020 20:59 UTC

35 points

16 comments1 min readLW link

AGI in a vulnerable world

AI Impacts and abergal

26 Mar 2020 0:10 UTC

42 points

21 comments1 min readLW link

(aiimpacts.org)

Three Kinds of Competitiveness

Daniel Kokotajlo31 Mar 2020 1:00 UTC

36 points

18 comments5 min readLW link

Biological humans and the rising tide of AI

cousin_it29 Jan 2018 16:04 UTC

22 points

23 comments1 min readLW link

HLAI 2018 Field Report

Gordon Seidoh Worley29 Aug 2018 0:11 UTC

48 points

12 comments5 min readLW link

Magical Categories

Eliezer Yudkowsky24 Aug 2008 19:51 UTC

65 points

133 comments9 min readLW link

Alignment as Translation

johnswentworth19 Mar 2020 21:40 UTC

62 points

39 comments4 min readLW link

Resolving human values, completely and adequately

Stuart_Armstrong30 Mar 2018 3:35 UTC

32 points

30 comments12 min readLW link

Will transparency help catch deception? Perhaps not

Matthew Barnett4 Nov 2019 20:52 UTC

43 points

5 comments7 min readLW link

A dilemma for prosaic AI alignment

Daniel Kokotajlo17 Dec 2019 22:11 UTC

40 points

30 comments3 min readLW link

[1911.08265] Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | Arxiv

DragonGod21 Nov 2019 1:18 UTC

52 points

4 comments1 min readLW link

(arxiv.org)

Glenn Beck discusses the Singularity, cites SI researchers

Brihaspati12 Jun 2012 16:45 UTC

73 points

183 comments10 min readLW link

Siren worlds and the perils of over-optimised search

Stuart_Armstrong7 Apr 2014 11:00 UTC

73 points

417 comments7 min readLW link

Human-Aligned AI Summer School: A Summary

Michaël Trazzi11 Aug 2018 8:11 UTC

39 points

5 comments4 min readLW link

Top 9+2 myths about AI risk

Stuart_Armstrong29 Jun 2015 20:41 UTC

68 points

45 comments2 min readLW link

Learning biases and rewards simultaneously

Rohin Shah6 Jul 2019 1:45 UTC

41 points

3 comments4 min readLW link

Looking for AI Safety Experts to Provide High Level Guidance for RAISE

Ofer6 May 2018 2:06 UTC

17 points

5 comments1 min readLW link

[Question] How much funding and researchers were in AI, and AI Safety, in 2018?

Raemon3 Mar 2019 21:46 UTC

41 points

11 comments1 min readLW link

Deep learning—deeper flaws?

Richard_Ngo24 Sep 2018 18:40 UTC

39 points

17 comments4 min readLW link

(thinkingcomplete.blogspot.com)

A model of UDT with a concrete prior over logical statements

Benya28 Aug 2012 21:45 UTC

62 points

24 comments4 min readLW link

Malign generalization without internal search

Matthew Barnett12 Jan 2020 18:03 UTC

43 points

12 comments4 min readLW link

Announcing the second AI Safety Camp

Lachouette11 Jun 2018 18:59 UTC

34 points

0 comments1 min readLW link

Vaniver’s View on Factored Cognition

Vaniver23 Aug 2019 2:54 UTC

48 points

4 comments8 min readLW link

Detached Lever Fallacy

Eliezer Yudkowsky31 Jul 2008 18:57 UTC

70 points

41 comments7 min readLW link

When to use quantilization

RyanCarey5 Feb 2019 17:17 UTC

65 points

5 comments4 min readLW link

The first AI Safety Camp & onwards

Remmelt7 Jun 2018 20:13 UTC

45 points

0 comments8 min readLW link

Learning preferences by looking at the world

Rohin Shah12 Feb 2019 22:25 UTC

43 points

10 comments7 min readLW link

(bair.berkeley.edu)

Selling Nonapples

Eliezer Yudkowsky13 Nov 2008 20:10 UTC

71 points

78 comments7 min readLW link

The AI Alignment Problem Has Already Been Solved(?) Once

SquirrelInHell22 Apr 2017 13:24 UTC

50 points

45 comments4 min readLW link

(squirrelinhell.blogspot.com)

Trace README

johnswentworth11 Mar 2020 21:08 UTC

35 points

1 comment8 min readLW link

[Link] Computer improves its Civilization II gameplay by reading the manual

Kaj_Sotala13 Jul 2011 12:00 UTC

49 points

5 comments4 min readLW link

Idea: Open Access AI Safety Journal

Gordon Seidoh Worley23 Mar 2018 18:27 UTC

28 points

11 comments1 min readLW link

Another take on agent foundations: formalizing zero-shot reasoning

zhukeepa1 Jul 2018 6:12 UTC

59 points

20 comments12 min readLW link

Logical Updatelessness as a Robust Delegation Problem

Scott Garrabrant27 Oct 2017 21:16 UTC

30 points

2 comments2 min readLW link

Some thoughts after reading Artificial Intelligence: A Modern Approach

swift_spiral19 Mar 2019 23:39 UTC

38 points

4 comments2 min readLW link

AI safety without goal-directed behavior

Rohin Shah7 Jan 2019 7:48 UTC

65 points

15 comments4 min readLW link

No Universally Compelling Arguments

Eliezer Yudkowsky26 Jun 2008 8:29 UTC

62 points

57 comments5 min readLW link

What AI Safety Researchers Have Written About the Nature of Human Values

avturchin16 Jan 2019 13:59 UTC

50 points

3 comments15 min readLW link

Disambiguating “alignment” and related notions

David Scott Krueger (formerly: capybaralet)5 Jun 2018 15:35 UTC

22 points

21 comments2 min readLW link

Inductive biases stick around

evhub18 Dec 2019 19:52 UTC

63 points

14 comments3 min readLW link

Bill Gates: problem of strong AI with conflicting goals “very worthy of study and time”

Paul Crowley22 Jan 2015 20:21 UTC

73 points

18 comments1 min readLW link

So You Want to Save the World

lukeprog1 Jan 2012 7:39 UTC

54 points

149 comments12 min readLW link

Metaphilosophical competence can’t be disentangled from alignment

zhukeepa1 Apr 2018 0:38 UTC

32 points

39 comments3 min readLW link

Some Thoughts on Metaphilosophy

Wei Dai10 Feb 2019 0:28 UTC

62 points

27 comments4 min readLW link

Reasons compute may not drive AI capabilities growth

Tristan H19 Dec 2018 22:13 UTC

42 points

10 comments8 min readLW link

Distance Functions are Hard

Grue_Slinky13 Aug 2019 17:33 UTC

31 points

19 comments6 min readLW link

Takeaways from safety by default interviews

AI Impacts and abergal

3 Apr 2020 17:20 UTC

28 points

2 comments13 min readLW link

(aiimpacts.org)

Bridge Collapse: Reductionism as Engineering Problem

Rob Bensinger18 Feb 2014 22:03 UTC

78 points

62 comments15 min readLW link

Probability as Minimal Map

johnswentworth1 Sep 2019 19:19 UTC

49 points

10 comments5 min readLW link

Policy Alignment

abramdemski30 Jun 2018 0:24 UTC

50 points

25 comments8 min readLW link

Stable Pointers to Value: An Agent Embedded in Its Own Utility Function

abramdemski17 Aug 2017 0:22 UTC

15 points

9 comments5 min readLW link

Stable Pointers to Value II: Environmental Goals

abramdemski9 Feb 2018 6:03 UTC

18 points

2 comments4 min readLW link

The Argument from Philosophical Difficulty

Wei Dai10 Feb 2019 0:28 UTC

54 points

31 comments1 min readLW link

human psycholinguists: a critical appraisal

nostalgebraist31 Dec 2019 0:20 UTC

174 points

59 comments16 min readLW link 2 reviews

(nostalgebraist.tumblr.com)

My take on agent foundations: formalizing metaphilosophical competence

zhukeepa1 Apr 2018 6:33 UTC

20 points

6 comments1 min readLW link

Critique my Model: The EV of AGI to Selfish Individuals

ozziegooen8 Apr 2018 20:04 UTC

19 points

9 comments4 min readLW link

AI Safety Debate and Its Applications

VojtaKovarik23 Jul 2019 22:31 UTC

36 points

5 comments12 min readLW link

TAISU 2019 Field Report

Gordon Seidoh Worley15 Oct 2019 1:09 UTC

36 points

5 comments5 min readLW link

Human-AI Collaboration

Rohin Shah22 Oct 2019 6:32 UTC

42 points

7 comments2 min readLW link

(bair.berkeley.edu)

Analyzing the Problem GPT-3 is Trying to Solve

adamShimi6 Aug 2020 21:58 UTC

16 points

2 comments4 min readLW link

[LINK] Speed superintelligence?

Stuart_Armstrong14 Aug 2014 15:57 UTC

53 points

20 comments1 min readLW link

A big Singularity-themed Hollywood movie out in April offers many opportunities to talk about AI risk

chaosmage7 Jan 2014 17:48 UTC

49 points

85 comments1 min readLW link

New paper: (When) is Truth-telling Favored in AI debate?

VojtaKovarik26 Dec 2019 19:59 UTC

32 points

7 comments5 min readLW link

(medium.com)

Artificial Addition

Eliezer Yudkowsky20 Nov 2007 7:58 UTC

68 points

129 comments6 min readLW link

Exploring safe exploration

evhub6 Jan 2020 21:07 UTC

37 points

8 comments3 min readLW link

‘Dumb’ AI observes and manipulates controllers

Stuart_Armstrong13 Jan 2015 13:35 UTC

52 points

19 comments2 min readLW link

AI Reading Group Thoughts (1/?): The Mandate of Heaven

Alicorn10 Aug 2018 0:24 UTC

45 points

18 comments4 min readLW link

AI Reading Group Thoughts (2/?): Reconstructive Psychosurgery

Alicorn25 Sep 2018 4:25 UTC

27 points

6 comments3 min readLW link

(notes on) Policy Desiderata for Superintelligent AI: A Vector Field Approach

Ben Pace4 Feb 2019 22:08 UTC

43 points

5 comments7 min readLW link

AI Governance: A Research Agenda

habryka5 Sep 2018 18:00 UTC

25 points

3 comments1 min readLW link

(www.fhi.ox.ac.uk)

Global online debate on the governance of AI

CarolineJ5 Jan 2018 15:31 UTC

8 points

5 comments1 min readLW link

[AN #61] AI policy and governance, from two people in the field

Rohin Shah5 Aug 2019 17:00 UTC

12 points

2 comments9 min readLW link

(mailchi.mp)

2019 AI Alignment Literature Review and Charity Comparison

Larks19 Dec 2019 3:00 UTC

130 points

18 comments62 min readLW link

[Question] What’s wrong with these analogies for understanding Informed Oversight and IDA?

Wei Dai20 Mar 2019 9:11 UTC

35 points

3 comments1 min readLW link

The Alignment Newsletter #1: 04/09/18

Rohin Shah9 Apr 2018 16:00 UTC

12 points

3 comments4 min readLW link

The Alignment Newsletter #2: 04/16/18

Rohin Shah16 Apr 2018 16:00 UTC

8 points

0 comments5 min readLW link

The Alignment Newsletter #3: 04/23/18

Rohin Shah23 Apr 2018 16:00 UTC

9 points

0 comments6 min readLW link

The Alignment Newsletter #4: 04/30/18

Rohin Shah30 Apr 2018 16:00 UTC

8 points

0 comments3 min readLW link

The Alignment Newsletter #5: 05/07/18

Rohin Shah7 May 2018 16:00 UTC

8 points

0 comments7 min readLW link

The Alignment Newsletter #6: 05/14/18

Rohin Shah14 May 2018 16:00 UTC

8 points

0 comments2 min readLW link

The Alignment Newsletter #7: 05/21/18

Rohin Shah21 May 2018 16:00 UTC

8 points

0 comments5 min readLW link

The Alignment Newsletter #8: 05/28/18

Rohin Shah28 May 2018 16:00 UTC

8 points

0 comments6 min readLW link

The Alignment Newsletter #9: 06/04/18

Rohin Shah4 Jun 2018 16:00 UTC

8 points

0 comments2 min readLW link

The Alignment Newsletter #10: 06/11/18

Rohin Shah11 Jun 2018 16:00 UTC

16 points

0 comments9 min readLW link

The Alignment Newsletter #11: 06/18/18

Rohin Shah18 Jun 2018 16:00 UTC

8 points

0 comments10 min readLW link

The Alignment Newsletter #12: 06/25/18

Rohin Shah25 Jun 2018 16:00 UTC

15 points

0 comments3 min readLW link

Alignment Newsletter #13: 07/02/18

Rohin Shah2 Jul 2018 16:10 UTC

70 points

12 comments8 min readLW link

(mailchi.mp)

Alignment Newsletter #14

Rohin Shah9 Jul 2018 16:20 UTC

14 points

0 comments9 min readLW link

(mailchi.mp)

Alignment Newsletter #15: 07/16/18

Rohin Shah16 Jul 2018 16:10 UTC

42 points

0 comments15 min readLW link

(mailchi.mp)

Alignment Newsletter #17

Rohin Shah30 Jul 2018 16:10 UTC

32 points

0 comments13 min readLW link

(mailchi.mp)

Alignment Newsletter #18

Rohin Shah6 Aug 2018 16:00 UTC

17 points

0 comments10 min readLW link

(mailchi.mp)

Alignment Newsletter #19

Rohin Shah14 Aug 2018 2:10 UTC

18 points

0 comments13 min readLW link

(mailchi.mp)

Alignment Newsletter #20

Rohin Shah20 Aug 2018 16:00 UTC

12 points

2 comments6 min readLW link

(mailchi.mp)

Alignment Newsletter #21

Rohin Shah27 Aug 2018 16:20 UTC

25 points

0 comments7 min readLW link

(mailchi.mp)

Alignment Newsletter #22

Rohin Shah3 Sep 2018 16:10 UTC

18 points

0 comments6 min readLW link

(mailchi.mp)

Alignment Newsletter #23

Rohin Shah10 Sep 2018 17:10 UTC

16 points

0 comments7 min readLW link

(mailchi.mp)

Alignment Newsletter #24

Rohin Shah17 Sep 2018 16:20 UTC

10 points

6 comments12 min readLW link

(mailchi.mp)

Alignment Newsletter #25

Rohin Shah24 Sep 2018 16:10 UTC

18 points

3 comments9 min readLW link

(mailchi.mp)

Alignment Newsletter #26

Rohin Shah2 Oct 2018 16:10 UTC

13 points

0 comments7 min readLW link

(mailchi.mp)

Alignment Newsletter #27

Rohin Shah9 Oct 2018 1:10 UTC

16 points

0 comments9 min readLW link

(mailchi.mp)

Alignment Newsletter #28

Rohin Shah15 Oct 2018 21:20 UTC

11 points

0 comments8 min readLW link

(mailchi.mp)

Alignment Newsletter #29

Rohin Shah22 Oct 2018 16:20 UTC

15 points

0 comments9 min readLW link

(mailchi.mp)

Alignment Newsletter #30

Rohin Shah29 Oct 2018 16:10 UTC

29 points

2 comments6 min readLW link

(mailchi.mp)

Alignment Newsletter #31

Rohin Shah5 Nov 2018 23:50 UTC

17 points

0 comments12 min readLW link

(mailchi.mp)

Alignment Newsletter #32

Rohin Shah12 Nov 2018 17:20 UTC

18 points

0 comments12 min readLW link

(mailchi.mp)

Alignment Newsletter #33

Rohin Shah19 Nov 2018 17:20 UTC

23 points

0 comments9 min readLW link

(mailchi.mp)

Alignment Newsletter #34

Rohin Shah26 Nov 2018 23:10 UTC

24 points

0 comments10 min readLW link

(mailchi.mp)

Alignment Newsletter #35

Rohin Shah4 Dec 2018 1:10 UTC

15 points

0 comments6 min readLW link

(mailchi.mp)

Alignment Newsletter #37

Rohin Shah17 Dec 2018 19:10 UTC

25 points

4 comments10 min readLW link

(mailchi.mp)

Alignment Newsletter #38

Rohin Shah25 Dec 2018 16:10 UTC

9 points

0 comments8 min readLW link

(mailchi.mp)

Alignment Newsletter #39

Rohin Shah1 Jan 2019 8:10 UTC

32 points

2 comments5 min readLW link

(mailchi.mp)

Alignment Newsletter #40

Rohin Shah8 Jan 2019 20:10 UTC

21 points

2 comments5 min readLW link

(mailchi.mp)

Alignment Newsletter #41

Rohin Shah17 Jan 2019 8:10 UTC

22 points

6 comments10 min readLW link

(mailchi.mp)

Alignment Newsletter #42

Rohin Shah22 Jan 2019 2:00 UTC

20 points

1 comment10 min readLW link

(mailchi.mp)

Alignment Newsletter #43

Rohin Shah29 Jan 2019 21:10 UTC

14 points

2 comments13 min readLW link

(mailchi.mp)

Alignment Newsletter #44

Rohin Shah6 Feb 2019 8:30 UTC

18 points

0 comments9 min readLW link

(mailchi.mp)

Alignment Newsletter #45

Rohin Shah14 Feb 2019 2:10 UTC

25 points

2 comments8 min readLW link

(mailchi.mp)

Alignment Newsletter #46

Rohin Shah22 Feb 2019 0:10 UTC

12 points

0 comments9 min readLW link

(mailchi.mp)

Alignment Newsletter #48

Rohin Shah11 Mar 2019 21:10 UTC

29 points

14 comments9 min readLW link

(mailchi.mp)

Alignment Newsletter #49

Rohin Shah20 Mar 2019 4:20 UTC

23 points

1 comment11 min readLW link

(mailchi.mp)

Alignment Newsletter #50

Rohin Shah28 Mar 2019 18:10 UTC

15 points

2 comments10 min readLW link

(mailchi.mp)

Alignment Newsletter #51

Rohin Shah3 Apr 2019 4:10 UTC

25 points

2 comments15 min readLW link

(mailchi.mp)

Alignment Newsletter #52

Rohin Shah6 Apr 2019 1:20 UTC

19 points

1 comment8 min readLW link

(mailchi.mp)

Alignment Newsletter One Year Retrospective

Rohin Shah10 Apr 2019 6:58 UTC

93 points

31 comments21 min readLW link

Alignment Newsletter #53

Rohin Shah18 Apr 2019 17:20 UTC

20 points

0 comments8 min readLW link

(mailchi.mp)

[AN #54] Boxing a finite-horizon AI system to keep it unambitious

Rohin Shah28 Apr 2019 5:20 UTC

20 points

0 comments8 min readLW link

(mailchi.mp)

[AN #55] Regulatory markets and international standards as a means of ensuring beneficial AI

Rohin Shah5 May 2019 2:20 UTC

17 points

2 comments8 min readLW link

(mailchi.mp)

[AN #56] Should ML researchers stop running experiments before making hypotheses?

Rohin Shah21 May 2019 2:20 UTC

21 points

8 comments9 min readLW link

(mailchi.mp)

[AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming

Rohin Shah5 Jun 2019 23:20 UTC

26 points

15 comments7 min readLW link

(mailchi.mp)

[AN #58] Mesa optimization: what it is, and why we should care

Rohin Shah24 Jun 2019 16:10 UTC

54 points

9 comments8 min readLW link

(mailchi.mp)

[AN #59] How arguments for AI risk have changed over time

Rohin Shah8 Jul 2019 17:20 UTC

43 points

4 comments7 min readLW link

(mailchi.mp)

[AN #60] A new AI challenge: Minecraft agents that assist human players in creative mode

Rohin Shah22 Jul 2019 17:00 UTC

23 points

6 comments9 min readLW link

(mailchi.mp)

[AN #62] Are adversarial examples caused by real but imperceptible features?

Rohin Shah22 Aug 2019 17:10 UTC

27 points

10 comments9 min readLW link

(mailchi.mp)

[AN #63] How architecture search, meta learning, and environment design could lead to general intelligence

Rohin Shah10 Sep 2019 19:10 UTC

21 points

12 comments8 min readLW link

(mailchi.mp)

[AN #64]: Using Deep RL and Reward Uncertainty to Incentivize Preference Learning

Rohin Shah16 Sep 2019 17:10 UTC

11 points

8 comments7 min readLW link

(mailchi.mp)

[AN #65]: Learning useful skills by watching humans “play”

Rohin Shah23 Sep 2019 17:30 UTC

11 points

0 comments9 min readLW link

(mailchi.mp)

[AN #66]: Decomposing robustness into capability robustness and alignment robustness

Rohin Shah30 Sep 2019 18:00 UTC

12 points

1 comment7 min readLW link

(mailchi.mp)

[AN #67]: Creating environments in which to study inner alignment failures

Rohin Shah7 Oct 2019 17:10 UTC

17 points

0 comments8 min readLW link

(mailchi.mp)

[AN #68]: The attainable utility theory of impact

Rohin Shah14 Oct 2019 17:00 UTC

17 points

0 comments8 min readLW link

(mailchi.mp)

[AN #69] Stuart Russell’s new book on why we need to replace the standard model of AI

Rohin Shah19 Oct 2019 0:30 UTC

60 points

12 comments15 min readLW link

(mailchi.mp)

[AN #70]: Agents that help humans who are still learning about their own preferences

Rohin Shah23 Oct 2019 17:10 UTC

16 points

0 comments9 min readLW link

(mailchi.mp)

[AN #71]: Avoiding reward tampering through current-RF optimization

Rohin Shah30 Oct 2019 17:10 UTC

12 points

0 comments7 min readLW link

(mailchi.mp)

[AN #72]: Alignment, robustness, methodology, and system building as research priorities for AI safety

Rohin Shah6 Nov 2019 18:10 UTC

26 points

4 comments10 min readLW link

(mailchi.mp)

[AN #73]: Detecting catastrophic failures by learning how agents tend to break

Rohin Shah13 Nov 2019 18:10 UTC

11 points

0 comments7 min readLW link

(mailchi.mp)

[AN #74]: Separating beneficial AI into competence, alignment, and coping with impacts

Rohin Shah20 Nov 2019 18:20 UTC

19 points

0 comments7 min readLW link

(mailchi.mp)

[AN #75]: Solving Atari and Go with learned game models, and thoughts from a MIRI employee

Rohin Shah27 Nov 2019 18:10 UTC

38 points

1 comment10 min readLW link

(mailchi.mp)

[AN #76]: How dataset size affects robustness, and benchmarking safe exploration by measuring constraint violations

Rohin Shah4 Dec 2019 18:10 UTC

14 points

6 comments9 min readLW link

(mailchi.mp)

[AN #77]: Double descent: a unification of statistical theory and modern ML practice

Rohin Shah18 Dec 2019 18:30 UTC

21 points

4 comments14 min readLW link

(mailchi.mp)

[AN #78] Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison

Rohin Shah26 Dec 2019 1:10 UTC

26 points

10 comments9 min readLW link

(mailchi.mp)

[AN #79]: Recursive reward modeling as an alignment technique integrated with deep RL

Rohin Shah1 Jan 2020 18:00 UTC

13 points

0 comments12 min readLW link

(mailchi.mp)

[AN #81]: Universality as a potential solution to conceptual difficulties in intent alignment

Rohin Shah8 Jan 2020 18:00 UTC

31 points

4 comments11 min readLW link

(mailchi.mp)

[AN #82]: How OpenAI Five distributed their training computation

Rohin Shah15 Jan 2020 18:20 UTC

19 points

0 comments8 min readLW link

(mailchi.mp)

[AN #83]: Sample-efficient deep learning with ReMixMatch

Rohin Shah22 Jan 2020 18:10 UTC

15 points

4 comments11 min readLW link

(mailchi.mp)

[AN #84] Reviewing AI alignment work in 2018-19

Rohin Shah29 Jan 2020 18:30 UTC

23 points

0 comments6 min readLW link

(mailchi.mp)

[AN #85]: The normative questions we should be asking for AI alignment, and a surprisingly good chatbot

Rohin Shah5 Feb 2020 18:20 UTC

14 points

2 comments7 min readLW link

(mailchi.mp)

[AN #86]: Improving debate and factored cognition through human experiments

Rohin Shah12 Feb 2020 18:10 UTC

14 points

0 comments9 min readLW link

(mailchi.mp)

[AN #87]: What might happen as deep learning scales even further?

Rohin Shah19 Feb 2020 18:20 UTC

28 points

0 comments4 min readLW link

(mailchi.mp)

[AN #88]: How the principal-agent literature relates to AI risk

Rohin Shah27 Feb 2020 9:10 UTC

18 points

0 comments9 min readLW link

(mailchi.mp)

[AN #89]: A unifying formalism for preference learning algorithms

Rohin Shah4 Mar 2020 18:20 UTC

16 points

0 comments9 min readLW link

(mailchi.mp)

[AN #90]: How search landscapes can contain self-reinforcing feedback loops

Rohin Shah11 Mar 2020 17:30 UTC

11 points

6 comments8 min readLW link

(mailchi.mp)

[AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement

Rohin Shah18 Mar 2020 17:10 UTC

15 points

10 comments13 min readLW link

(mailchi.mp)

[AN #92]: Learning good representations with contrastive predictive coding

Rohin Shah25 Mar 2020 17:20 UTC

18 points

1 comment10 min readLW link

(mailchi.mp)

[AN #93]: The Precipice we’re standing at, and how we can back away from it

Rohin Shah1 Apr 2020 17:10 UTC

24 points

0 comments7 min readLW link

(mailchi.mp)

Forecasting AI Progress: A Research Agenda

rossg and axioman

10 Aug 2020 1:04 UTC

39 points

4 comments1 min readLW link

The Steering Problem

paulfchristiano13 Nov 2018 17:14 UTC

43 points

12 comments7 min readLW link

Will humans build goal-directed agents?

Rohin Shah5 Jan 2019 1:33 UTC

51 points

43 comments5 min readLW link

Prosaic AI alignment

paulfchristiano20 Nov 2018 13:56 UTC

40 points

10 comments8 min readLW link

David Chalmers’ “The Singularity: A Philosophical Analysis”

lukeprog29 Jan 2011 2:52 UTC

55 points

203 comments4 min readLW link

[Talk] Paul Christiano on his alignment taxonomy

jp27 Sep 2019 18:37 UTC

31 points

1 comment1 min readLW link

(www.youtube.com)

Dreams of AI Design

Eliezer Yudkowsky27 Aug 2008 4:04 UTC

26 points

61 comments5 min readLW link

Qualitative Strategies of Friendliness

Eliezer Yudkowsky30 Aug 2008 2:12 UTC

30 points

56 comments12 min readLW link

Oracles, sequence predictors, and self-confirming predictions

Stuart_Armstrong3 May 2019 14:09 UTC

22 points

0 comments3 min readLW link

Self-confirming prophecies, and simplified Oracle designs

Stuart_Armstrong28 Jun 2019 9:57 UTC

6 points

1 comment5 min readLW link

Investment idea: basket of tech stocks weighted towards AI

ioannes12 Aug 2020 21:30 UTC

14 points

7 comments3 min readLW link

Conceptual issues in AI safety: the paradigmatic gap

vedevazz24 Jun 2018 15:09 UTC

33 points

0 comments1 min readLW link

(www.foldl.me)

Disagreement with Paul: alignment induction

Stuart_Armstrong10 Sep 2018 13:54 UTC

31 points

6 comments1 min readLW link

Largest open collection quotes about AI

teradimich12 Jul 2019 17:18 UTC

35 points

2 comments3 min readLW link

(drive.google.com)

S.E.A.R.L.E’s COBOL room

Stuart_Armstrong1 Feb 2013 20:29 UTC

52 points

36 comments2 min readLW link

Introducing Corrigibility (an FAI research subfield)

So8res20 Oct 2014 21:09 UTC

52 points

28 comments3 min readLW link

NES-game playing AI [video link and AI-boxing-related comment]

Dr_Manhattan12 Apr 2013 13:11 UTC

42 points

22 comments1 min readLW link

On unfixably unsafe AGI architectures

Steven Byrnes19 Feb 2020 21:16 UTC

33 points

8 comments5 min readLW link

To contribute to AI safety, consider doing AI research

Vika16 Jan 2016 20:42 UTC

39 points

39 comments2 min readLW link

Ghosts in the Machine

Eliezer Yudkowsky17 Jun 2008 23:29 UTC

54 points

30 comments4 min readLW link

Technical AGI safety research outside AI

Richard_Ngo18 Oct 2019 15:00 UTC

43 points

3 comments3 min readLW link

Deciphering China’s AI Dream

Qiaochu_Yuan18 Mar 2018 3:26 UTC

12 points

2 comments1 min readLW link

(www.fhi.ox.ac.uk)

Above-Average AI Scientists

Eliezer Yudkowsky28 Sep 2008 11:04 UTC

57 points

97 comments8 min readLW link

The Nature of Logic

Eliezer Yudkowsky15 Nov 2008 6:20 UTC

37 points

12 comments10 min readLW link

Oracle paper

Stuart_Armstrong13 Dec 2017 14:59 UTC

12 points

7 comments1 min readLW link

AI Alignment Writing Day Roundup #1

Ben Pace30 Aug 2019 1:26 UTC

32 points

12 comments1 min readLW link

Notes on the Safety in Artificial Intelligence conference

UmamiSalami1 Jul 2016 0:36 UTC

40 points

15 comments13 min readLW link

Reinterpreting “AI and Compute”

habryka25 Dec 2018 21:12 UTC

30 points

10 comments1 min readLW link

(aiimpacts.org)

AI Safety Prerequisites Course: Revamp and New Lessons

philip_b3 Feb 2019 21:04 UTC

24 points

5 comments1 min readLW link

An angle of attack on Open Problem #1

Benya18 Aug 2012 12:08 UTC

47 points

85 comments7 min readLW link

Evaluating the feasibility of SI’s plan

JoshuaFox10 Jan 2013 8:17 UTC

38 points

188 comments4 min readLW link

Only humans can have human values

PhilGoetz26 Apr 2010 18:57 UTC

51 points

161 comments17 min readLW link

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

DragonGod6 Dec 2017 6:01 UTC

13 points

4 comments1 min readLW link

(arxiv.org)

Cake, or death!

Stuart_Armstrong25 Oct 2012 10:33 UTC

46 points

13 comments4 min readLW link

Self-regulation of safety in AI research

Gordon Seidoh Worley25 Feb 2018 23:17 UTC

12 points

6 comments2 min readLW link

How safe “safe” AI development?

Gordon Seidoh Worley28 Feb 2018 23:21 UTC

9 points

1 comment1 min readLW link

Stanford Intro to AI course to be taught for free online

Psy-Kosh30 Jul 2011 16:22 UTC

38 points

39 comments1 min readLW link

Bayesian Utility: Representing Preference by Probability Measures

Vladimir_Nesov27 Jul 2009 14:28 UTC

45 points

37 comments2 min readLW link

Gains from trade: Slug versus Galaxy—how much would I give up to control you?

Stuart_Armstrong23 Jul 2013 19:06 UTC

55 points

67 comments7 min readLW link

Defeating Mundane Holocausts With Robots

lsparrish30 May 2011 22:34 UTC

34 points

28 comments2 min readLW link

Assuming we’ve solved X, could we do Y...

Stuart_Armstrong11 Dec 2018 18:13 UTC

31 points

16 comments2 min readLW link

The Stamp Collector

So8res1 May 2015 23:11 UTC

45 points

14 comments6 min readLW link

Saving the world in 80 days: Prologue

Logan Riggs9 May 2018 21:16 UTC

12 points

16 comments2 min readLW link

Project Proposal: Considerations for trading off capabilities and safety impacts of AI research

David Scott Krueger (formerly: capybaralet)6 Aug 2019 22:22 UTC

25 points

11 comments2 min readLW link

AI Safety Prerequisites Course: Basic abstract representations of computation

RAISE13 Mar 2019 19:38 UTC

28 points

2 comments1 min readLW link

What I Think, If Not Why

Eliezer Yudkowsky11 Dec 2008 17:41 UTC

41 points

103 comments4 min readLW link

RFC: Philosophical Conservatism in AI Alignment Research

Gordon Seidoh Worley15 May 2018 3:29 UTC

17 points

13 comments1 min readLW link

Predicted AI alignment event/meeting calendar

rmoehn14 Aug 2019 7:14 UTC

29 points

14 comments1 min readLW link

Simplified preferences needed; simplified preferences sufficient

Stuart_Armstrong5 Mar 2019 19:39 UTC

29 points

6 comments3 min readLW link

Reward function learning: the value function

Stuart_Armstrong24 Apr 2018 16:29 UTC

9 points

0 comments11 min readLW link

Reward function learning: the learning process

Stuart_Armstrong24 Apr 2018 12:56 UTC

6 points

11 comments8 min readLW link

Utility versus Reward function: partial equivalence

Stuart_Armstrong13 Apr 2018 14:58 UTC

17 points

5 comments5 min readLW link

Full toy model for preference learning

Stuart_Armstrong16 Oct 2019 11:06 UTC

20 points

2 comments12 min readLW link

New(ish) AI control ideas

Stuart_Armstrong31 Oct 2017 12:52 UTC

0 points

0 comments4 min readLW link

Rigging is a form of wireheading

Stuart_Armstrong3 May 2018 12:50 UTC

11 points

2 comments1 min readLW link

The reward engineering problem

paulfchristiano16 Jan 2019 18:47 UTC

26 points

3 comments7 min readLW link

AI cooperation in practice

cousin_it30 Jul 2010 16:21 UTC

37 points

166 comments1 min readLW link

Examples of AI’s behaving badly

Stuart_Armstrong16 Jul 2015 10:01 UTC

41 points

37 comments1 min readLW link

Controlling Constant Programs

Vladimir_Nesov5 Sep 2010 13:45 UTC

34 points

33 comments5 min readLW link

Autism, Watson, the Turing test, and General Intelligence

Stuart_Armstrong24 Sep 2013 11:00 UTC

11 points

22 comments1 min readLW link

Pessimism About Unknown Unknowns Inspires Conservatism

michaelcohen3 Feb 2020 14:48 UTC

31 points

2 comments5 min readLW link

The National Security Commission on Artificial Intelligence Wants You (to submit essays and articles on the future of government AI policy)

quanticle18 Jul 2019 17:21 UTC

30 points

0 comments1 min readLW link

(warontherocks.com)

Systems Engineering and the META Program

ryan_b20 Dec 2018 20:19 UTC

30 points

3 comments1 min readLW link

Human errors, human values

PhilGoetz9 Apr 2011 2:50 UTC

45 points

138 comments1 min readLW link

ISO: Name of Problem

johnswentworth24 Jul 2018 17:15 UTC

28 points

15 comments1 min readLW link

Muehlhauser-Goertzel Dialogue, Part 1

lukeprog16 Mar 2012 17:12 UTC

42 points

161 comments33 min readLW link

Specification gaming examples in AI

Samuel Rødal10 Nov 2018 12:00 UTC

24 points

6 comments1 min readLW link

(docs.google.com)

Superintelligence Reading Group—Section 1: Past Developments and Present Capabilities

KatjaGrace16 Sep 2014 1:00 UTC

43 points

233 comments7 min readLW link

[Question] What are the differences between all the iterative/recursive approaches to AI alignment?

riceissa21 Sep 2019 2:09 UTC

30 points

14 comments2 min readLW link

Algorithmic Similarity

LukasM23 Aug 2019 16:39 UTC

27 points

10 comments11 min readLW link

Directions and desiderata for AI alignment

paulfchristiano13 Jan 2019 7:47 UTC

47 points

1 comment14 min readLW link

Friendly AI Research and Taskification

multifoliaterose14 Dec 2010 6:30 UTC

30 points

47 comments5 min readLW link

Against easy superintelligence: the unforeseen friction argument

Stuart_Armstrong10 Jul 2013 13:47 UTC

39 points

48 comments5 min readLW link

[Question] Why are the people who could be doing safety research, but aren’t, doing something else?

Adam Scholl29 Aug 2019 8:51 UTC

27 points

19 comments1 min readLW link

TV’s “Elementary” Tackles Friendly AI and X-Risk—“Bella” (Possible Spoilers)

pjeby22 Nov 2014 19:51 UTC

48 points

18 comments2 min readLW link

Universality Unwrapped

adamShimi21 Aug 2020 18:53 UTC

28 points

2 comments18 min readLW link

AI Risk and Opportunity: Humanity’s Efforts So Far

lukeprog21 Mar 2012 2:49 UTC

53 points

49 comments23 min readLW link

Learning with catastrophes

paulfchristiano23 Jan 2019 3:01 UTC

27 points

9 comments4 min readLW link

[Question] Degree of duplication and coordination in projects that examine computing prices, AI progress, and related topics?

riceissa23 Apr 2019 12:27 UTC

26 points

1 comment2 min readLW link

Solving the AI Race Finalists

Gordon Seidoh Worley19 Jul 2018 21:04 UTC

24 points

0 comments1 min readLW link

(medium.com)

An Agent is a Worldline in Tegmark V

komponisto12 Jul 2018 5:12 UTC

24 points

12 comments2 min readLW link

Towards formalizing universality

paulfchristiano13 Jan 2019 20:39 UTC

27 points

19 comments18 min readLW link

Conceptual Analysis for AI Alignment

David Scott Krueger (formerly: capybaralet)30 Dec 2018 0:46 UTC

26 points

3 comments2 min readLW link

Gwern’s “Why Tool AIs Want to Be Agent AIs: The Power of Agency”

habryka5 May 2019 5:11 UTC

26 points

3 comments1 min readLW link

(www.gwern.net)

[Question] Why not tool AI?

smithee19 Jan 2019 22:18 UTC

19 points

10 comments1 min readLW link

Superintelligence 16: Tool AIs

KatjaGrace30 Dec 2014 2:00 UTC

12 points

37 comments7 min readLW link

Thinking of tool AIs

Michele Campolo20 Nov 2019 21:47 UTC

6 points

2 comments4 min readLW link

Reply to Holden on ‘Tool AI’

Eliezer Yudkowsky12 Jun 2012 18:00 UTC

152 points

357 comments17 min readLW link

Reply to Holden on The Singularity Institute

lukeprog10 Jul 2012 23:20 UTC

69 points

215 comments26 min readLW link

Levels of AI Self-Improvement

avturchin29 Apr 2018 11:45 UTC

11 points

0 comments39 min readLW link

AI: requirements for pernicious policies

Stuart_Armstrong17 Jul 2015 14:18 UTC

11 points

3 comments3 min readLW link

Tools want to become agents

Stuart_Armstrong4 Jul 2014 10:12 UTC

24 points

81 comments1 min readLW link

Superintelligence reading group

KatjaGrace31 Aug 2014 14:59 UTC

31 points

2 comments2 min readLW link

Superintelligence Reading Group 2: Forecasting AI

KatjaGrace23 Sep 2014 1:00 UTC

17 points

109 comments11 min readLW link

Superintelligence Reading Group 3: AI and Uploads

KatjaGrace30 Sep 2014 1:00 UTC

17 points

139 comments6 min readLW link

SRG 4: Biological Cognition, BCIs, Organizations

KatjaGrace7 Oct 2014 1:00 UTC

14 points

139 comments5 min readLW link

Superintelligence 5: Forms of Superintelligence

KatjaGrace14 Oct 2014 1:00 UTC

22 points

114 comments5 min readLW link

Superintelligence 6: Intelligence explosion kinetics

KatjaGrace21 Oct 2014 1:00 UTC

15 points

68 comments8 min readLW link

Superintelligence 7: Decisive strategic advantage

KatjaGrace28 Oct 2014 1:01 UTC

18 points

60 comments6 min readLW link

Superintelligence 8: Cognitive superpowers

KatjaGrace4 Nov 2014 2:01 UTC

14 points

96 comments6 min readLW link

Superintelligence 9: The orthogonality of intelligence and goals

KatjaGrace11 Nov 2014 2:00 UTC

13 points

80 comments7 min readLW link

Superintelligence 10: Instrumentally convergent goals

KatjaGrace18 Nov 2014 2:00 UTC

13 points

33 comments5 min readLW link

Superintelligence 11: The treacherous turn

KatjaGrace25 Nov 2014 2:00 UTC

16 points

50 comments6 min readLW link

Superintelligence 12: Malignant failure modes

KatjaGrace2 Dec 2014 2:02 UTC

15 points

51 comments5 min readLW link

Superintelligence 13: Capability control methods

KatjaGrace9 Dec 2014 2:00 UTC

14 points

48 comments6 min readLW link

Superintelligence 14: Motivation selection methods

KatjaGrace16 Dec 2014 2:00 UTC

9 points

28 comments5 min readLW link

Superintelligence 15: Oracles, genies and sovereigns

KatjaGrace23 Dec 2014 2:01 UTC

11 points

30 comments7 min readLW link

Superintelligence 17: Multipolar scenarios

KatjaGrace6 Jan 2015 6:44 UTC

9 points

38 comments6 min readLW link

Superintelligence 18: Life in an algorithmic economy

KatjaGrace13 Jan 2015 2:00 UTC

10 points

52 comments6 min readLW link

Superintelligence 19: Post-transition formation of a singleton

KatjaGrace20 Jan 2015 2:00 UTC

12 points

35 comments7 min readLW link

Superintelligence 20: The value-loading problem

KatjaGrace27 Jan 2015 2:00 UTC

8 points

21 comments6 min readLW link

Superintelligence 21: Value learning

KatjaGrace3 Feb 2015 2:01 UTC

12 points

33 comments4 min readLW link

Superintelligence 22: Emulation modulation and institutional design

KatjaGrace10 Feb 2015 2:06 UTC

13 points

11 comments6 min readLW link

Superintelligence 23: Coherent extrapolated volition

KatjaGrace17 Feb 2015 2:00 UTC

11 points

97 comments7 min readLW link

Superintelligence 24: Morality models and “do what I mean”

KatjaGrace24 Feb 2015 2:00 UTC

13 points

47 comments6 min readLW link

Objections to Coherent Extrapolated Volition

XiXiDu22 Nov 2011 10:32 UTC

12 points

56 comments3 min readLW link

CEV: coherence versus extrapolation

Stuart_Armstrong22 Sep 2014 11:24 UTC

21 points

17 comments2 min readLW link

What if AI doesn’t quite go FOOM?

Mass_Driver20 Jun 2010 0:03 UTC

16 points

191 comments5 min readLW link

Superintelligence 25: Components list for acquiring values

KatjaGrace3 Mar 2015 2:01 UTC

11 points

12 comments8 min readLW link

Superintelligence 26: Science and technology strategy

KatjaGrace10 Mar 2015 1:43 UTC

14 points

21 comments6 min readLW link

Superintelligence 27: Pathways and enablers

KatjaGrace17 Mar 2015 1:00 UTC

15 points

21 comments8 min readLW link

Superintelligence 28: Collaboration

KatjaGrace24 Mar 2015 1:29 UTC

13 points

21 comments6 min readLW link

Superintelligence 29: Crunch time

KatjaGrace31 Mar 2015 4:24 UTC

14 points

27 comments6 min readLW link

Universal agents and utility functions

Anja14 Nov 2012 4:05 UTC

43 points

38 comments6 min readLW link

Looking for remote writing partners (for AI alignment research)

rmoehn1 Oct 2019 2:16 UTC

23 points

4 comments2 min readLW link

Self-Supervised Learning and AGI Safety

Steven Byrnes7 Aug 2019 14:21 UTC

29 points

9 comments12 min readLW link

Which of these five AI alignment research projects ideas are no good?

rmoehn8 Aug 2019 7:17 UTC

25 points

13 comments1 min readLW link

Understanding understanding

mthq23 Aug 2019 18:10 UTC

24 points

1 comment2 min readLW link

Evaluating Existing Approaches to AGI Alignment

Gordon Seidoh Worley27 Mar 2018 19:57 UTC

12 points

0 comments4 min readLW link

(mapandterritory.org)

CEV: a utilitarian critique

Pablo26 Jan 2013 16:12 UTC

32 points

94 comments5 min readLW link

Vingean Reflection: Reliable Reasoning for Self-Improving Agents

So8res15 Jan 2015 22:47 UTC

37 points

5 comments9 min readLW link

Slide deck: Introduction to AI Safety

Aryeh Englander29 Jan 2020 15:57 UTC

22 points

0 comments1 min readLW link

(drive.google.com)

The Self-Unaware AI Oracle

Steven Byrnes22 Jul 2019 19:04 UTC

21 points

38 comments8 min readLW link

May Gwern.net newsletter (w/GPT-3 commentary)

gwern2 Jun 2020 15:40 UTC

32 points

7 comments1 min readLW link

(www.gwern.net)

Build a Causal Decision Theorist

michaelcohen23 Sep 2019 20:43 UTC

1 point

14 comments4 min readLW link

A trick for Safer GPT-N

Razied23 Aug 2020 0:39 UTC

7 points

1 comment2 min readLW link

Introduction To The Infra-Bayesianism Sequence

Diffractor and Vanessa Kosoy

26 Aug 2020 20:31 UTC

104 points

64 comments14 min readLW link 2 reviews

Model splintering: moving from one imperfect model to another

Stuart_Armstrong27 Aug 2020 11:53 UTC

74 points

10 comments33 min readLW link

Algorithmic Progress in Six Domains

lukeprog3 Aug 2013 2:29 UTC

38 points

32 comments1 min readLW link

[Question] What are some good examples of incorrigibility?

RyanCarey28 Apr 2019 0:22 UTC

23 points

17 comments1 min readLW link

Safely and usefully spectating on AIs optimizing over toy worlds

AlexMennen31 Jul 2018 18:30 UTC

24 points

16 comments2 min readLW link

Updates and additions to “Embedded Agency”

Rob Bensinger and abramdemski

29 Aug 2020 4:22 UTC

73 points

1 comment3 min readLW link

[LINK] Terrorists target AI researchers

RobertLumley15 Sep 2011 14:22 UTC

32 points

35 comments1 min readLW link

Analysing: Dangerous messages from future UFAI via Oracles

Stuart_Armstrong22 Nov 2019 14:17 UTC

22 points

16 comments4 min readLW link

Exploring Botworld

So8res30 Apr 2014 22:29 UTC

34 points

2 comments6 min readLW link

interpreting GPT: the logit lens

nostalgebraist31 Aug 2020 2:47 UTC

158 points

32 comments11 min readLW link

From GPT to AGI

ChristianKl31 Aug 2020 13:28 UTC

6 points

7 comments1 min readLW link

Logical or Connectionist AI?

Eliezer Yudkowsky17 Nov 2008 8:03 UTC

39 points

26 comments9 min readLW link

Artificial Intelligence and Life Sciences (Why Big Data is not enough to capture biological systems?)

HansNauj15 Jan 2020 1:59 UTC

6 points

3 comments6 min readLW link

The Case against Killer Robots (link)

D_Alex20 Nov 2012 7:47 UTC

12 points

25 comments1 min readLW link

Near-Term Risk: Killer Robots a Threat to Freedom and Democracy

Epiphany14 Jun 2013 6:28 UTC

15 points

105 comments2 min readLW link

Muehlhauser-Wang Dialogue

lukeprog22 Apr 2012 22:40 UTC

34 points

288 comments12 min readLW link

Google may be trying to take over the world

[deleted]27 Jan 2014 9:33 UTC

33 points

133 comments1 min readLW link

Gwern about centaurs: there is no chance that any useful man+machine combination will work together for more than 10 years, as humans soon will be only a liability

avturchin15 Dec 2018 21:32 UTC

31 points

4 comments1 min readLW link

(www.reddit.com)

Q&A with Abram Demski on risks from AI

XiXiDu17 Jan 2012 9:43 UTC

33 points

71 comments9 min readLW link

Q&A with experts on risks from AI #2

XiXiDu9 Jan 2012 19:40 UTC

22 points

29 comments7 min readLW link

Let the AI teach you how to flirt

DirectedEvolution17 Sep 2020 19:04 UTC

47 points

11 comments2 min readLW link

Online AI Safety Discussion Day

Linda Linsefors8 Oct 2020 12:11 UTC

5 points

0 comments1 min readLW link

New(ish) AI control ideas

Stuart_Armstrong5 Mar 2015 17:03 UTC

34 points

14 comments3 min readLW link

Not Taking Over the World

Eliezer Yudkowsky15 Dec 2008 22:18 UTC

35 points

97 comments4 min readLW link

Naturalistic trust among AIs: The parable of the thesis advisor’s theorem

Benya15 Dec 2013 8:32 UTC

36 points

20 comments6 min readLW link

The Solomonoff Prior is Malign

Mark Xu14 Oct 2020 1:33 UTC

148 points

52 comments16 min readLW link 3 reviews

Twenty-three AI alignment research project definitions

rmoehn3 Feb 2020 22:21 UTC

23 points

0 comments6 min readLW link

When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors

Stuart_Armstrong19 Dec 2019 13:55 UTC

24 points

18 comments7 min readLW link

[Question] As a Washed Up Former Data Scientist and Machine Learning Researcher What Direction Should I Go In Now?

Darklight19 Oct 2020 20:13 UTC

13 points

7 comments3 min readLW link

Artificial Mysterious Intelligence

Eliezer Yudkowsky7 Dec 2008 20:05 UTC

29 points

24 comments5 min readLW link

A Premature Word on AI

Eliezer Yudkowsky31 May 2008 17:48 UTC

26 points

69 comments8 min readLW link

Let’s reimplement EURISKO!

cousin_it11 Jun 2009 16:28 UTC

23 points

162 comments1 min readLW link

Corrigibility thoughts III: manipulating versus deceiving

Stuart_Armstrong18 Jan 2017 15:57 UTC

3 points

0 comments1 min readLW link

[Question] [Meta] Do you want AIS Webinars?

Linda Linsefors21 Mar 2020 16:01 UTC

18 points

7 comments1 min readLW link

New article from Oren Etzioni

Aryeh Englander25 Feb 2020 15:25 UTC

19 points

19 comments2 min readLW link

Singletons Rule OK

Eliezer Yudkowsky30 Nov 2008 16:45 UTC

20 points

47 comments5 min readLW link

“On the Impossibility of Supersized Machines”

crmflynn31 Mar 2017 23:32 UTC

24 points

4 comments1 min readLW link

(philpapers.org)

Nonsentient Optimizers

Eliezer Yudkowsky27 Dec 2008 2:32 UTC

34 points

48 comments6 min readLW link

Building Something Smarter

Eliezer Yudkowsky2 Nov 2008 17:00 UTC

22 points

57 comments4 min readLW link

Let’s Read: an essay on AI Theology

Yuxi_Liu4 Jul 2019 7:50 UTC

22 points

9 comments7 min readLW link

Wanted: Python open source volunteers

Eliezer Yudkowsky11 Mar 2009 4:59 UTC

16 points

13 comments1 min readLW link

Equilibrium and prior selection problems in multipolar deployment

JesseClifton2 Apr 2020 20:06 UTC

20 points

11 comments11 min readLW link

[Question] The Simulation Epiphany Problem

Koen.Holtman31 Oct 2019 22:12 UTC

15 points

13 comments4 min readLW link

Changing accepted public opinion and Skynet

Roko22 May 2009 11:05 UTC

17 points

71 comments2 min readLW link

Introducing CADIE

MBlume1 Apr 2009 7:32 UTC

0 points

8 comments1 min readLW link

Deepmind Plans for Rat-Level AI

moridinamael18 Aug 2016 16:26 UTC

34 points

9 comments1 min readLW link

“Robot scientists can think for themselves”

CronoDAS2 Apr 2009 21:16 UTC

−1 points

11 comments1 min readLW link

Automating reasoning about the future at Ought

jungofthewon9 Nov 2020 21:51 UTC

17 points

0 comments1 min readLW link

(ought.org)

Neural program synthesis is a dangerous technology

syllogism12 Jan 2018 16:19 UTC

10 points

6 comments2 min readLW link

New, Brief Popular-Level Introduction to AI Risks and Superintelligence

LyleN23 Jan 2015 15:43 UTC

33 points

3 comments1 min readLW link

In the beginning, Dartmouth created the AI and the hype

Stuart_Armstrong24 Jan 2013 16:49 UTC

33 points

22 comments1 min readLW link

Fundamental Philosophical Problems Inherent in AI discourse

AlexSadler16 Sep 2018 21:03 UTC

23 points

1 comment17 min readLW link

Research Priorities for Artificial Intelligence: An Open Letter

jimrandomh11 Jan 2015 19:52 UTC

38 points

11 comments1 min readLW link

[Question] How can I help research Friendly AI?

avichapman9 Jul 2019 0:15 UTC

22 points

3 comments1 min readLW link

FAI Research Constraints and AGI Side Effects

JustinShovelain3 Jun 2015 19:25 UTC

26 points

59 comments7 min readLW link

[Question] How to deal with a misleading conference talk about AI risk?

rmoehn27 Jun 2019 21:04 UTC

21 points

13 comments4 min readLW link

Implications of Quantum Computing for Artificial Intelligence Alignment Research

Jsevillamol and PabloAMC

22 Aug 2019 10:33 UTC

24 points

3 comments13 min readLW link

[Question] How can labour productivity growth be an indicator of automation?

Polytopos16 Nov 2020 21:16 UTC

2 points

5 comments1 min readLW link

[Question] Should I do it?

MrLight19 Nov 2020 1:08 UTC

−3 points

16 comments2 min readLW link

My intellectual influences

Richard_Ngo22 Nov 2020 18:00 UTC

92 points

1 comment5 min readLW link

(thinkingcomplete.blogspot.com)

Delegated agents in practice: How companies might end up selling AI services that act on behalf of consumers and coalitions, and what this implies for safety research

Remmelt26 Nov 2020 11:17 UTC

7 points

5 comments4 min readLW link

SETI Predictions

hippke30 Nov 2020 20:09 UTC

23 points

8 comments1 min readLW link

What happens when your beliefs fully propagate

Alexei14 Feb 2012 7:53 UTC

29 points

79 comments7 min readLW link

Interactive exploration of LessWrong and other large collections of documents

vpetukhov and FriendlyOwl

20 Dec 2020 19:06 UTC

49 points

9 comments10 min readLW link

[Question] Will AGI have “human” flaws?

Agustinus Theodorus23 Dec 2020 3:43 UTC

1 point

2 comments1 min readLW link

Optimum number of single points of failure

Douglas_Reay14 Mar 2018 13:30 UTC

7 points

4 comments4 min readLW link

Don’t put all your eggs in one basket

Douglas_Reay15 Mar 2018 8:07 UTC

5 points

0 comments7 min readLW link

Defect or Cooperate

Douglas_Reay16 Mar 2018 14:12 UTC

4 points

5 comments6 min readLW link

Environments for killing AIs

Douglas_Reay17 Mar 2018 15:23 UTC

3 points

1 comment9 min readLW link

The advantage of not being open-ended

Douglas_Reay18 Mar 2018 13:50 UTC

7 points

2 comments6 min readLW link

Metamorphosis

Douglas_Reay12 Apr 2018 21:53 UTC

2 points

0 comments4 min readLW link

Believable Promises

Douglas_Reay16 Apr 2018 16:17 UTC

5 points

0 comments5 min readLW link

Trustworthy Computing

Douglas_Reay10 Apr 2018 7:55 UTC

9 points

1 comment6 min readLW link

Edge of the Cliff

akaTrickster5 Jan 2021 17:21 UTC

1 point

0 comments5 min readLW link

[Question] How is reinforcement learning possible in non-sentient agents?

SomeoneKind5 Jan 2021 20:57 UTC

3 points

5 comments1 min readLW link

AI Alignment Using Reverse Simulation

Sven Nilsen12 Jan 2021 20:48 UTC

1 point

0 comments1 min readLW link

A toy model of the control problem

Stuart_Armstrong16 Sep 2015 14:59 UTC

36 points

24 comments3 min readLW link

On the nature of purpose

Nora_Ammann22 Jan 2021 8:30 UTC

28 points

15 comments9 min readLW link

Learning Normativity: Language

Bunthut5 Feb 2021 22:26 UTC

14 points

4 comments8 min readLW link

Singularity&phase transition-2. A priori probability and ways to check.

Valentin20268 Feb 2021 2:21 UTC

1 point

0 comments3 min readLW link

Nonperson Predicates

Eliezer Yudkowsky27 Dec 2008 1:47 UTC

52 points

176 comments6 min readLW link

Mapping the Conceptual Territory in AI Existential Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC

15 points

0 comments26 min readLW link

2021-03-01 National Library of Medicine Presentation: “Atlas of AI: Mapping the social and economic forces behind AI”

IrenicTruth17 Feb 2021 18:23 UTC

1 point

0 comments2 min readLW link

Chaotic era: avoid or survive?

Valentin202622 Feb 2021 1:34 UTC

3 points

3 comments2 min readLW link

Suffering-Focused Ethics in the Infinite Universe. How can we redeem ourselves if Multiverse Immortality is real and subjective death is impossible.

Szymon Kucharski24 Feb 2021 21:02 UTC

−3 points

4 comments70 min readLW link

AIDungeon 3.1

Yair Halberstadt1 Mar 2021 5:56 UTC

2 points

0 comments2 min readLW link

Physicalism implies experience never dies. So what am I going to experience after it does?

Szymon Kucharski14 Mar 2021 14:45 UTC

−2 points

1 comment30 min readLW link

An Antropic Argument for Post-singularity Antinatalism

monkaap16 Mar 2021 17:40 UTC

3 points

4 comments3 min readLW link

[Question] Is a Self-Iterating AGI Vulnerable to Thompson-style Trojans?

sxae25 Mar 2021 14:46 UTC

15 points

7 comments3 min readLW link

AI oracles on blockchain

Caravaggio6 Apr 2021 20:13 UTC

5 points

0 comments3 min readLW link

What if AGI is near?

Wulky Wilkinsen14 Apr 2021 0:05 UTC

11 points

5 comments1 min readLW link

Review of “Why AI is Harder Than We Think”

electroswing30 Apr 2021 18:14 UTC

40 points

10 comments8 min readLW link

Thoughts on the Alignment Implications of Scaling Language Models

leogao2 Jun 2021 21:32 UTC

79 points

11 comments17 min readLW link

[Question] Suppose $1 billion is given to AI Safety. How should it be spent?

hunterglenn15 May 2021 23:24 UTC

23 points

2 comments1 min readLW link

Controlling Intelligent Agents The Only Way We Know How: Ideal Bureaucratic Structure (IBS)

Justin Bullock24 May 2021 12:53 UTC

11 points

11 comments6 min readLW link

Curated conversations with brilliant rationalists

spencerg28 May 2021 14:23 UTC

153 points

18 comments6 min readLW link

Security Mindset and Ordinary Paranoia

Eliezer Yudkowsky25 Nov 2017 17:53 UTC

98 points

24 comments29 min readLW link

The Anti-Carter Basilisk

Jon Gilbert26 May 2021 22:56 UTC

0 points

0 comments2 min readLW link

Parameter counts in Machine Learning

Jsevillamol and Pablo Villalobos

19 Jun 2021 16:04 UTC

47 points

16 comments7 min readLW link

Irrational Modesty

Tomás B.20 Jun 2021 19:38 UTC

132 points

7 comments1 min readLW link

[Question] Thoughts on a “Sequences Inspired” PhD Topic

goose00017 Jun 2021 20:36 UTC

7 points

2 comments2 min readLW link

Some alternatives to “Friendly AI”

lukeprog15 Jun 2014 19:53 UTC

30 points

44 comments2 min readLW link

Intelligence without Consciousness

Andrew Vlahos7 Jul 2021 5:27 UTC

13 points

5 comments1 min readLW link

[Question] What would it look like if it looked like AGI was very near?

Tomás B.12 Jul 2021 15:22 UTC

52 points

25 comments1 min readLW link

Is the argument that AI is an xrisk valid?

MACannon19 Jul 2021 13:20 UTC

5 points

62 comments1 min readLW link

(onlinelibrary.wiley.com)

[Question] Jaynesian interpretation—How does “estimating probabilities” make sense?

Haziq Muhammad21 Jul 2021 21:36 UTC

4 points

40 comments1 min readLW link

The biological intelligence explosion

Rob Lucas25 Jul 2021 13:08 UTC

8 points

6 comments4 min readLW link

[Question] Do Bayesians like Bayesian model Averaging?

Haziq Muhammad2 Aug 2021 12:24 UTC

4 points

13 comments1 min readLW link

[Question] Question about Test-sets and Bayesian machine learning

Haziq Muhammad9 Aug 2021 17:16 UTC

2 points

8 comments1 min readLW link

[Question] Halpern’s paper—A refutation of Cox’s theorem?

Haziq Muhammad11 Aug 2021 9:25 UTC

11 points

7 comments1 min readLW link

New GPT-3 competitor

Quintin Pope12 Aug 2021 7:05 UTC

32 points

10 comments1 min readLW link

[Question] Jaynes-Cox Probability: Are plausibilities objective?

Haziq Muhammad12 Aug 2021 14:23 UTC

9 points

17 comments1 min readLW link

A gentle apocalypse

pchvykov16 Aug 2021 5:03 UTC

3 points

5 comments3 min readLW link

[Question] Is it worth making a database for moral predictions?

Jonas Hallgren16 Aug 2021 14:51 UTC

1 point

0 comments2 min readLW link

Cynical explanations of FAI critics (including myself)

Wei Dai13 Aug 2012 21:19 UTC

31 points

49 comments1 min readLW link

[Question] Has Van Horn fixed Cox’s theorem?

Haziq Muhammad29 Aug 2021 18:36 UTC

9 points

1 comment1 min readLW link

The Governance Problem and the “Pretty Good” X-Risk

Zach Stein-Perlman29 Aug 2021 18:00 UTC

5 points

2 comments11 min readLW link

Limits of and to (artificial) Intelligence

MoritzG25 Aug 2019 22:16 UTC

1 point

3 comments7 min readLW link

Grokking the Intentional Stance

jbkjr31 Aug 2021 15:49 UTC

41 points

20 comments20 min readLW link

Intelligence, Fast and Slow

Mateusz Mazurkiewicz1 Sep 2021 19:52 UTC

−3 points

2 comments2 min readLW link

[Question] Is LessWrong dead without Cox’s theorem?

Haziq Muhammad4 Sep 2021 5:45 UTC

−2 points

88 comments1 min readLW link

Alignment via manually implementing the utility function

Chantiel7 Sep 2021 20:20 UTC

1 point

6 comments2 min readLW link

Pivot!

Carlos Ramirez12 Sep 2021 20:39 UTC

−19 points

5 comments1 min readLW link

The Metaethics and Normative Ethics of AGI Value Alignment: Many Questions, Some Implications

Eleos Arete Citrini16 Sep 2021 16:13 UTC

6 points

0 comments8 min readLW link

Why will AI be dangerous?

Legionnaire4 Feb 2022 23:41 UTC

37 points

14 comments1 min readLW link

Occam’s Razor and the Universal Prior

Peter Chatain3 Oct 2021 3:23 UTC

22 points

5 comments21 min readLW link

We’re Redwood Research, we do applied alignment research, AMA

Nate Thomas6 Oct 2021 5:51 UTC

56 points

3 comments2 min readLW link

(forum.effectivealtruism.org)

[LINK] Wait But Why—The AI Revolution Part 2

Adam Zerner4 Feb 2015 16:02 UTC

27 points

88 comments1 min readLW link

Slate Star Codex Notes on the Asilomar Conference on Beneficial AI

Gunnar_Zarncke7 Feb 2017 12:14 UTC

24 points

8 comments1 min readLW link

(slatestarcodex.com)

Three Approaches to “Friendliness”

Wei Dai17 Jul 2013 7:46 UTC

32 points

86 comments3 min readLW link

P₂B: Plan to P₂B Better

Ramana Kumar and Daniel Kokotajlo

24 Oct 2021 15:21 UTC

33 points

14 comments6 min readLW link

A Roadmap to a Post-Scarcity Economy

lorepieri30 Oct 2021 9:04 UTC

3 points

3 comments1 min readLW link

What is the link between altruism and intelligence?

Ruralvisitor833 Nov 2021 23:59 UTC

3 points

13 comments1 min readLW link

Modeling the impact of safety agendas

Ben Cottier5 Nov 2021 19:46 UTC

51 points

6 comments10 min readLW link

[Question] Does anyone know what Marvin Minsky is talking about here?

delton13719 Nov 2021 0:56 UTC

1 point

6 comments3 min readLW link

Integrating Three Models of (Human) Cognition

jbkjr23 Nov 2021 1:06 UTC

29 points

4 comments32 min readLW link

[Question] I currently translate AGI-related texts to Russian. Is that useful?

Tapatakt27 Nov 2021 17:51 UTC

29 points

7 comments1 min readLW link

Question/Issue with the 5/10 Problem

acgt29 Nov 2021 10:45 UTC

6 points

3 comments3 min readLW link

Can solipsism be disproven?

nx20594 Dec 2021 8:24 UTC

−2 points

5 comments2 min readLW link

[Question] Misc. questions about EfficientZero

Daniel Kokotajlo4 Dec 2021 19:45 UTC

51 points

17 comments1 min readLW link

Framing approaches to alignment and the hard problem of AI cognition

ryan_greenblatt15 Dec 2021 19:06 UTC

8 points

15 comments27 min readLW link

HIRING: Inform and shape a new project on AI safety at Partnership on AI

madhu_lika7 Dec 2021 19:37 UTC

1 point

0 comments1 min readLW link

What role should evolutionary analogies play in understanding AI takeoff speeds?

anson.ho11 Dec 2021 1:19 UTC

14 points

0 comments42 min readLW link

Motivations, Natural Selection, and Curriculum Engineering

Oliver Sourbut16 Dec 2021 1:07 UTC

16 points

0 comments42 min readLW link

Emergent modularity and safety

Richard_Ngo21 Oct 2021 1:54 UTC

31 points

15 comments3 min readLW link

Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI

bayesian_kitten16 Dec 2021 22:41 UTC

22 points

10 comments21 min readLW link

Universality and the “Filter”

maggiehayes16 Dec 2021 0:47 UTC

10 points

3 comments11 min readLW link

[Question] Can you prove that 0 = 1?

purplelight4 Feb 2022 21:31 UTC

−10 points

4 comments1 min readLW link

Expectations Influence Reality (and AI)

purplelight4 Feb 2022 21:31 UTC

0 points

3 comments7 min readLW link

[Question] What questions do you have about doing work on AI safety?

peterbarnett21 Dec 2021 16:36 UTC

13 points

8 comments1 min readLW link

Reviews of “Is power-seeking AI an existential risk?”

Joe Carlsmith16 Dec 2021 20:48 UTC

76 points

20 comments1 min readLW link

Eliciting Latent Knowledge Via Hypothetical Sensors

John_Maxwell30 Dec 2021 15:53 UTC

38 points

2 comments6 min readLW link

Lateral Thinking (AI safety HPMOR fanfic)

SlytherinsMonster2 Jan 2022 23:50 UTC

75 points

9 comments5 min readLW link

SONN : What’s Next ?

D𝜋9 Jan 2022 8:15 UTC

−17 points

3 comments1 min readLW link

An Open Philanthropy grant proposal: Causal representation learning of human preferences

PabloAMC11 Jan 2022 11:28 UTC

19 points

6 comments8 min readLW link

Action: Help expand funding for AI Safety by coordinating on NSF response

Evan R. Murphy19 Jan 2022 22:47 UTC

23 points

8 comments3 min readLW link

Emotions = Reward Functions

jpyykko20 Jan 2022 18:46 UTC

16 points

10 comments5 min readLW link

[Question] Is AI Alignment a pseudoscience?

mocny-chlapik23 Jan 2022 10:32 UTC

21 points

41 comments1 min readLW link

Deconfusing Deception

J Bostock29 Jan 2022 16:43 UTC

26 points

6 comments2 min readLW link

Revisiting Brave New World Revisited (Chapter 3)

Justin Bullock1 Feb 2022 17:17 UTC

5 points

0 comments10 min readLW link

[Question] Do mesa-optimization problems correlate with low-slack?

sudo4 Feb 2022 21:11 UTC

1 point

1 comment1 min readLW link

Can the laws of physics/nature prevent hell?

superads916 Feb 2022 20:39 UTC

−7 points

10 comments2 min readLW link

Ngo and Yudkowsky on scientific reasoning and pivotal acts

Eliezer Yudkowsky and Richard_Ngo

21 Feb 2022 20:54 UTC

51 points

13 comments35 min readLW link

Better a Brave New World than a dead one

Yitz25 Feb 2022 23:11 UTC

8 points

5 comments4 min readLW link

Being an individual alignment grantmaker

A_donor28 Feb 2022 20:02 UTC

64 points

5 comments2 min readLW link

How to develop safe superintelligence

martillopart1 Mar 2022 21:57 UTC

−5 points

3 comments13 min readLW link

Deep Dives: My Advice for Pursuing Work in Research

scasper11 Mar 2022 17:56 UTC

21 points

2 comments3 min readLW link

One possible approach to develop the best possible general learning algorithm

martillopart14 Mar 2022 19:24 UTC

3 points

0 comments7 min readLW link

[Question] Our time in history as evidence for simulation theory?

Garrett Garzonie18 Mar 2022 3:35 UTC

3 points

2 comments1 min readLW link

The weakest arguments for and against human level AI

Stuart_Armstrong15 Aug 2012 11:04 UTC

22 points

34 comments1 min readLW link

Christiano and Yudkowsky on AI predictions and human intelligence

Eliezer Yudkowsky23 Feb 2022 21:34 UTC

69 points

35 comments42 min readLW link

Even more curated conversations with brilliant rationalists

spencerg21 Mar 2022 23:49 UTC

57 points

0 comments15 min readLW link

Manhattan project for aligned AI

Chris van Merwijk27 Mar 2022 11:41 UTC

34 points

6 comments2 min readLW link

Gears-Level Mental Models of Transformer Interpretability

KevinRoWang29 Mar 2022 20:09 UTC

56 points

4 comments6 min readLW link

Meta wants to use AI to write Wikipedia articles; I am Nervous™

Yitz30 Mar 2022 19:05 UTC

14 points

12 comments1 min readLW link

[Question] If AGI were coming in a year, what should we do?

MichaelStJules1 Apr 2022 0:41 UTC

20 points

16 comments1 min readLW link

On Agent Incentives to Manipulate Human Feedback in Multi-Agent Reward Learning Scenarios

Francis Rhys Ward3 Apr 2022 18:20 UTC

27 points

11 comments8 min readLW link

[Question] How to write a LW sequence to learn a topic?

PabloAMC3 Apr 2022 20:09 UTC

3 points

2 comments1 min readLW link

Save Humanity! Breed Sapient Octopuses!

Yair Halberstadt5 Apr 2022 18:39 UTC

54 points

17 comments1 min readLW link

What Should We Optimize—A Conversation

Johannes C. Mayer7 Apr 2022 3:47 UTC

1 point

0 comments14 min readLW link

The Explanatory Gap of AI

David Valdman7 Apr 2022 18:28 UTC

1 point

0 comments4 min readLW link

Progress report 3: clustering transformer neurons

Nathan Helm-Burger5 Apr 2022 23:13 UTC

5 points

0 comments2 min readLW link

Godshatter Versus Legibility: A Fundamentally Different Approach To AI Alignment

LukeOnline9 Apr 2022 21:43 UTC

11 points

14 comments7 min readLW link

Is Fisherian Runaway Gradient Hacking?

Ryan Kidd10 Apr 2022 13:47 UTC

15 points

7 comments4 min readLW link

The Glitch And Notes On Digital Beings

Ghvst11 Apr 2022 19:46 UTC

−4 points

0 comments2 min readLW link

(ghvsted.com)

Post-history is written by the martyrs

Veedrac11 Apr 2022 15:45 UTC

37 points

2 comments19 min readLW link

(www.royalroad.com)

An AI-in-a-box success model

azsantosk11 Apr 2022 22:28 UTC

16 points

1 comment10 min readLW link

Rationalist Should Win. Not Dying with Dignity and Funding WBE.

CitizenTen12 Apr 2022 2:14 UTC

23 points

15 comments5 min readLW link

Reward model hacking as a challenge for reward learning

Erik Jenner12 Apr 2022 9:39 UTC

25 points

1 comment9 min readLW link

Is technical AI alignment research a net positive?

cranberry_bear12 Apr 2022 13:07 UTC

4 points

2 comments2 min readLW link

Another list of theories of impact for interpretability

Beth Barnes13 Apr 2022 13:29 UTC

32 points

1 comment5 min readLW link

Some reasons why a predictor wants to be a consequentialist

Lauro Langosco15 Apr 2022 15:02 UTC

23 points

16 comments5 min readLW link

Redwood Research is hiring for several roles (Operations and Technical)

Jessica W and billzito

14 Apr 2022 16:57 UTC

29 points

0 comments1 min readLW link

[Question] Convince me that humanity isn’t doomed by AGI

Yitz15 Apr 2022 17:26 UTC

60 points

53 comments1 min readLW link

Another argument that you will let the AI out of the box

Garrett Baker19 Apr 2022 21:54 UTC

8 points

16 comments2 min readLW link

For every choice of AGI difficulty, conditioning on gradual take-off implies shorter timelines.

Francis Rhys Ward21 Apr 2022 7:44 UTC

29 points

13 comments3 min readLW link

Reflections on My Own Missing Mood

Lone Pine21 Apr 2022 16:19 UTC

51 points

25 comments5 min readLW link

Key questions about artificial sentience: an opinionated guide

Robbo25 Apr 2022 12:09 UTC

45 points

31 comments18 min readLW link

[Question] What is being improved in recursive self improvement?

Lone Pine25 Apr 2022 18:30 UTC

7 points

7 comments1 min readLW link

Why Copilot Accelerates Timelines

Michaël Trazzi26 Apr 2022 22:06 UTC

35 points

14 comments7 min readLW link

[Question] Is it desirable for the first AGI to be conscious?

Charbel-Raphaël1 May 2022 21:29 UTC

5 points

12 comments1 min readLW link

[Question] What Was Your Best / Most Successful DALL-E 2 Prompt?

Evidential4 May 2022 3:16 UTC

1 point

0 comments1 min readLW link

Negotiating Up and Down the Simulation Hierarchy: Why We Might Survive the Unaligned Singularity

David Udell4 May 2022 4:21 UTC

24 points

16 comments2 min readLW link

High-stakes alignment via adversarial training [Redwood Research report]

dmz, LawrenceC and Nate Thomas

5 May 2022 0:59 UTC

136 points

29 comments9 min readLW link

Deriving Conditional Expected Utility from Pareto-Efficient Decisions

Thomas Kwa5 May 2022 3:21 UTC

23 points

1 comment6 min readLW link

Transcripts of interviews with AI researchers

Vael Gates9 May 2022 5:57 UTC

160 points

8 comments2 min readLW link

Agency As a Natural Abstraction

Thane Ruthenis13 May 2022 18:02 UTC

55 points

9 comments13 min readLW link

Predicting the Elections with Deep Learning—Part 1 - Results

Quentin Chenevier14 May 2022 12:54 UTC

0 points

0 comments1 min readLW link

On saving one’s world

Rob Bensinger17 May 2022 19:53 UTC

190 points

5 comments1 min readLW link

In defence of flailing

acylhalide18 Jun 2022 5:26 UTC

10 points

14 comments4 min readLW link

Reshaping the AI Industry

Thane Ruthenis29 May 2022 22:54 UTC

143 points

34 comments21 min readLW link

Science for the Possible World

Zechen Zhang23 May 2022 14:01 UTC

7 points

0 comments3 min readLW link

Synthetic Media and The Future of Film

ifalpha24 May 2022 5:54 UTC

35 points

13 comments8 min readLW link

Explaining inner alignment to myself

Jeremy Gillen24 May 2022 23:10 UTC

9 points

2 comments10 min readLW link

A discussion of the paper, “Large Language Models are Zero-Shot Reasoners”

HiroSakuraba26 May 2022 15:55 UTC

7 points

0 comments4 min readLW link

On inner and outer alignment, and their confusion

Nina Panickssery26 May 2022 21:56 UTC

6 points

7 comments4 min readLW link

RL with KL penalties is better seen as Bayesian inference

Tomek Korbak and Ethan Perez

25 May 2022 9:23 UTC

90 points

15 comments12 min readLW link

Bits of Optimization Can Only Be Lost Over A Distance

johnswentworth23 May 2022 18:55 UTC

26 points

15 comments2 min readLW link

Gradations of Agency

Daniel Kokotajlo23 May 2022 1:10 UTC

40 points

6 comments5 min readLW link

Utilitarianism

C S SRUTHI28 May 2022 19:35 UTC

0 points

1 comment1 min readLW link

Distilled—AGI Safety from First Principles

Harrison G29 May 2022 0:57 UTC

8 points

1 comment14 min readLW link

Multiple AIs in boxes, evaluating each other’s alignment

Moebius31429 May 2022 8:36 UTC

7 points

0 comments14 min readLW link

The impact you might have working on AI safety

Fabien Roger29 May 2022 16:31 UTC

5 points

1 comment4 min readLW link

My SERI MATS Application

Daniel Paleka30 May 2022 2:04 UTC

16 points

0 comments8 min readLW link

[Question] A terrifying variant of Boltzmann’s brains problem

Zeruel01730 May 2022 20:08 UTC

5 points

12 comments4 min readLW link

The Reverse Basilisk

Dunning K.30 May 2022 23:10 UTC

15 points

23 comments2 min readLW link

The Hard Intelligence Hypothesis and Its Bearing on Succession Induced Foom

DragonGod31 May 2022 19:04 UTC

10 points

7 comments4 min readLW link

Machines vs Memes Part 1: AI Alignment and Memetics

Harriet Farlow31 May 2022 22:03 UTC

16 points

0 comments6 min readLW link

[Question] What will happen when an all-reaching AGI starts attempting to fix human character flaws?

Michael Bright1 Jun 2022 18:45 UTC

1 point

6 comments1 min readLW link

New cooperation mechanism—quadratic funding without a matching pool

Filip Sondej5 Jun 2022 13:55 UTC

11 points

0 comments5 min readLW link

Miriam Yevick on why both symbols and networks are necessary for artificial minds

Bill Benzon6 Jun 2022 8:34 UTC

1 point

0 comments4 min readLW link

Six Dimensions of Operational Adequacy in AGI Projects

Eliezer Yudkowsky30 May 2022 17:00 UTC

270 points

65 comments13 min readLW link

Grokking “Forecasting TAI with biological anchors”

anson.ho6 Jun 2022 18:58 UTC

34 points

0 comments14 min readLW link

Who models the models that model models? An exploration of GPT-3′s in-context model fitting ability

Lovre7 Jun 2022 19:37 UTC

112 points

14 comments9 min readLW link

Pitching an Alignment Softball

mu_(negative)7 Jun 2022 4:10 UTC

47 points

13 comments10 min readLW link

[Question] Confused Thoughts on AI Afterlife (seriously)

Epirito7 Jun 2022 14:37 UTC

−6 points

6 comments1 min readLW link

Transformer Research Questions from Stained Glass Windows

StefanHex8 Jun 2022 12:38 UTC

4 points

0 comments2 min readLW link

Eliciting Latent Knowledge (ELK) - Distillation/Summary

Marius Hobbhahn8 Jun 2022 13:18 UTC

49 points

2 comments21 min readLW link

Towards Gears-Level Understanding of Agency

Thane Ruthenis16 Jun 2022 22:00 UTC

24 points

4 comments18 min readLW link

Vael Gates: Risks from Advanced AI (June 2022)

Vael Gates14 Jun 2022 0:54 UTC

38 points

2 comments30 min readLW link

Exploring Mild Behaviour in Embedded Agents

Megan Kinniment27 Jun 2022 18:56 UTC

21 points

3 comments18 min readLW link

Operationalizing two tasks in Gary Marcus’s AGI challenge

Bill Benzon9 Jun 2022 18:31 UTC

10 points

3 comments8 min readLW link

A plausible story about AI risk.

DeLesley Hutchins10 Jun 2022 2:08 UTC

14 points

1 comment4 min readLW link

I No Longer Believe Intelligence to be “Magical”

DragonGod10 Jun 2022 8:58 UTC

31 points

34 comments6 min readLW link

[Question] Why don’t you introduce really impressive people you personally know to AI alignment (more often)?

Verden11 Jun 2022 15:59 UTC

33 points

15 comments1 min readLW link

Godzilla Strategies

johnswentworth11 Jun 2022 15:44 UTC

151 points

65 comments3 min readLW link

Intuitive Explanation of AIXI

Thomas Larsen12 Jun 2022 21:41 UTC

13 points

0 comments5 min readLW link

Training Trace Priors

Adam Jermyn13 Jun 2022 14:22 UTC

12 points

17 comments4 min readLW link

Why multi-agent safety is important

Akbir Khan14 Jun 2022 9:23 UTC

8 points

2 comments10 min readLW link

Contra EY: Can AGI destroy us without trial & error?

Nikita Sokolsky13 Jun 2022 18:26 UTC

124 points

76 comments15 min readLW link

A Modest Pivotal Act

anonymousaisafety13 Jun 2022 19:24 UTC

−15 points

1 comment5 min readLW link

OpenAI: GPT-based LLMs show ability to discriminate between its own wrong answers, but inability to explain how/why it makes that discrimination, even as model scales

Aditya Jain13 Jun 2022 23:33 UTC

14 points

5 comments1 min readLW link

(openai.com)

Resources I send to AI researchers about AI safety

Vael Gates14 Jun 2022 2:24 UTC

62 points

12 comments10 min readLW link

Investigating causal understanding in LLMs

Marius Hobbhahn and Tom Lieberum

14 Jun 2022 13:57 UTC

28 points

4 comments13 min readLW link

[Question] How Do You Quantify [Physics Interfacing] Real World Capabilities?

DragonGod14 Jun 2022 14:49 UTC

17 points

1 comment4 min readLW link

Cryptographic Life: How to transcend in a sub-lightspeed world via Homomorphic encryption

Golol14 Jun 2022 19:22 UTC

1 point

0 comments3 min readLW link

Alignment Risk Doesn’t Require Superintelligence

JustisMills15 Jun 2022 3:12 UTC

35 points

4 comments2 min readLW link

Multigate Priors

Adam Jermyn15 Jun 2022 19:30 UTC

4 points

0 comments3 min readLW link

Infohazards and inferential distances

acylhalide16 Jun 2022 7:59 UTC

8 points

0 comments6 min readLW link

Apply to the Machine Learning For Good bootcamp in France

Alexandre Variengien17 Jun 2022 7:32 UTC

10 points

0 comments1 min readLW link

Adaptation Executors and the Telos Margin

Plinthist20 Jun 2022 13:06 UTC

2 points

8 comments5 min readLW link

Causal confusion as an argument against the scaling hypothesis

RobertKirk and David Scott Krueger (formerly: capybaralet)

20 Jun 2022 10:54 UTC

83 points

30 comments18 min readLW link

[Question] What is the most probable AI?

Zeruel01720 Jun 2022 23:26 UTC

−2 points

0 comments3 min readLW link

Reflection Mechanisms as an Alignment target: A survey

Marius Hobbhahn, elandgre and Beth Barnes

22 Jun 2022 15:05 UTC

28 points

1 comment14 min readLW link

The Limits of Automation

milkandcigarettes23 Jun 2022 18:03 UTC

5 points

1 comment5 min readLW link

(milkandcigarettes.com)

Conversation with Eliezer: What do you want the system to do?

Akash25 Jun 2022 17:36 UTC

112 points

38 comments2 min readLW link

[Yann Lecun] A Path Towards Autonomous Machine Intelligence

DragonGod27 Jun 2022 19:24 UTC

38 points

12 comments1 min readLW link

(openreview.net)

Yann LeCun, A Path Towards Autonomous Machine Intelligence [link]

Bill Benzon27 Jun 2022 23:29 UTC

5 points

1 comment1 min readLW link

Doom doubts—is inner alignment a likely problem?

Crissman28 Jun 2022 12:42 UTC

6 points

7 comments1 min readLW link

What success looks like

Marius Hobbhahn, MaxRa, JasperGeh and Yannick_Muehlhaeuser

28 Jun 2022 14:38 UTC

19 points

4 comments1 min readLW link

(forum.effectivealtruism.org)

Latent Adversarial Training

Adam Jermyn29 Jun 2022 20:04 UTC

24 points

9 comments5 min readLW link

Hedonistic Isotopes:

Trozxzr30 Jun 2022 16:49 UTC

1 point

0 comments1 min readLW link

[Question] What about transhumans and beyond?

AlignmentMirror2 Jul 2022 13:58 UTC

7 points

6 comments1 min readLW link

New US Senate Bill on X-Risk Mitigation [Linkpost]

Evan R. Murphy4 Jul 2022 1:25 UTC

35 points

12 comments1 min readLW link

(www.hsgac.senate.gov)

When is it appropriate to use statistical models and probabilities for decision making ?

Younes Kamel5 Jul 2022 12:34 UTC

10 points

7 comments4 min readLW link

(youneskamel.substack.com)

How humanity would respond to slow takeoff, with takeaways from the entire COVID-19 pandemic

Noosphere896 Jul 2022 17:52 UTC

4 points

1 comment2 min readLW link

Four Societal Interventions to Improve our AGI Position

Rafael Cosman6 Jul 2022 18:32 UTC

−6 points

2 comments6 min readLW link

(rafaelcosman.com)

Deep neural networks are not opaque.

jem-mosig6 Jul 2022 18:03 UTC

22 points

14 comments3 min readLW link

Cooperation with and between AGI\’s

PeterMcCluskey7 Jul 2022 16:45 UTC

10 points

3 comments10 min readLW link

(www.bayesianinvestor.com)

Making it harder for an AGI to “trick” us, with STVs

Tor Økland Barstad9 Jul 2022 14:42 UTC

14 points

5 comments22 min readLW link

Grouped Loss may disfavor discontinuous capabilities

Adam Jermyn9 Jul 2022 17:22 UTC

14 points

2 comments4 min readLW link

We are now at the point of deepfake job interviews

trevor10 Jul 2022 3:37 UTC

6 points

0 comments1 min readLW link

(www.businessinsider.com)

Acceptability Verification: A Research Agenda

David Udell and evhub

12 Jul 2022 20:11 UTC

43 points

0 comments1 min readLW link

(docs.google.com)

Finding Skeletons on Rashomon Ridge

David Udell, Peter S. Park and NickyP

24 Jul 2022 22:31 UTC

30 points

2 comments7 min readLW link

A note about differential technological development

So8res15 Jul 2022 4:46 UTC

178 points

31 comments6 min readLW link

How Interpretability can be Impactful

Connall Garrod18 Jul 2022 0:06 UTC

18 points

0 comments37 min readLW link

AI Hiroshima (Does A Vivid Example Of Destruction Forestall Apocalypse?)

Sable18 Jul 2022 12:06 UTC

4 points

4 comments2 min readLW link

Bounded complexity of solving ELK and its implications

Rubi J. Hudson19 Jul 2022 6:56 UTC

10 points

4 comments18 min readLW link

Abram Demski’s ELK thoughts and proposal—distillation

Rubi J. Hudson19 Jul 2022 6:57 UTC

15 points

4 comments16 min readLW link

Help ARC evaluate capabilities of current language models (still need people)

Beth Barnes19 Jul 2022 4:55 UTC

94 points

6 comments2 min readLW link

A Critique of AI Alignment Pessimism

ExCeph19 Jul 2022 2:28 UTC

8 points

1 comment9 min readLW link

Modelling Deception

Garrett Baker18 Jul 2022 21:21 UTC

15 points

0 comments7 min readLW link

Enlightenment Values in a Vulnerable World

Maxwell Tabarrok20 Jul 2022 19:52 UTC

15 points

6 comments31 min readLW link

(maximumprogress.substack.com)

AI Safety Cheatsheet / Quick Reference

Zohar Jackson20 Jul 2022 9:39 UTC

3 points

0 comments1 min readLW link

(github.com)

Countering arguments against working on AI safety

Rauno Arike20 Jul 2022 18:23 UTC

6 points

2 comments7 min readLW link

Why AGI Timeline Research/Discourse Might Be Overrated

Noosphere8920 Jul 2022 20:26 UTC

5 points

0 comments1 min readLW link

(forum.effectivealtruism.org)

Connor Leahy on Dying with Dignity, EleutherAI and Conjecture

Michaël Trazzi22 Jul 2022 18:44 UTC

176 points

29 comments14 min readLW link

(theinsideview.ai)

Brainstorm of things that could force an AI team to burn their lead

So8res24 Jul 2022 23:58 UTC

103 points

4 comments13 min readLW link

Alignment being impossible might be better than it being really difficult

Martín Soto25 Jul 2022 23:57 UTC

12 points

2 comments2 min readLW link

AI ethics vs AI alignment

Wei Dai26 Jul 2022 13:08 UTC

4 points

1 comment1 min readLW link

NeurIPS ML Safety Workshop 2022

Dan H26 Jul 2022 15:28 UTC

72 points

2 comments1 min readLW link

(neurips2022.mlsafety.org)

Quantum Advantage in Learning from Experiments

Dennis Towne27 Jul 2022 15:49 UTC

5 points

5 comments1 min readLW link

(ai.googleblog.com)

AGI ruin scenarios are likely (and disjunctive)

So8res27 Jul 2022 3:21 UTC

148 points

37 comments6 min readLW link

A Quick Note on AI Scaling Asymptotes

alyssavance25 May 2022 2:55 UTC

43 points

6 comments1 min readLW link

[Question] How likely do you think worse-than-extinction type fates to be?

span11 Aug 2022 4:08 UTC

3 points

3 comments1 min readLW link

[Question] I want to donate some money (not much, just what I can afford) to AGI Alignment research, to whatever organization has the best chance of making sure that AGI goes well and doesn’t kill us all. What are my best options, where can I make the most difference per dollar?

lumenwrites2 Aug 2022 12:08 UTC

15 points

9 comments1 min readLW link

Law-Following AI 4: Don’t Rely on Vicarious Liability

Cullen2 Aug 2022 23:26 UTC

5 points

2 comments3 min readLW link

Externalized reasoning oversight: a research direction for language model alignment

tamera3 Aug 2022 12:03 UTC

103 points

22 comments6 min readLW link

Transformer language models are doing something more general

Numendil3 Aug 2022 21:13 UTC

44 points

6 comments2 min readLW link

Three pillars for avoiding AGI catastrophe: Technical alignment, deployment decisions, and coordination

Alex Lintz3 Aug 2022 23:15 UTC

17 points

0 comments12 min readLW link

Surprised by ELK report’s counterexample to Debate, IDA

Evan R. Murphy4 Aug 2022 2:12 UTC

18 points

0 comments5 min readLW link

Bias towards simple functions; application to alignment?

DavidHolmes18 Aug 2022 16:15 UTC

3 points

7 comments2 min readLW link

What do ML researchers think about AI in 2022?

KatjaGrace4 Aug 2022 15:40 UTC

217 points

33 comments3 min readLW link

(aiimpacts.org)

Deontology and Tool AI

Nathan11235 Aug 2022 5:20 UTC

4 points

5 comments6 min readLW link

Bridging Expected Utility Maximization and Optimization

Whispermute5 Aug 2022 8:18 UTC

23 points

5 comments14 min readLW link

Counterfactuals are Confusing because of an Ontological Shift

Chris_Leong5 Aug 2022 19:03 UTC

17 points

35 comments2 min readLW link

A Data limited future

Donald Hobson6 Aug 2022 14:56 UTC

52 points

25 comments2 min readLW link

A Community for Understanding Consciousness: Raising r/MathPie

Navjotツ7 Aug 2022 8:17 UTC

−12 points

0 comments3 min readLW link

(www.reddit.com)

Complexity No Bar to AI (Or, why Computational Complexity matters less than you think for real life problems)

Noosphere897 Aug 2022 19:55 UTC

17 points

14 comments3 min readLW link

(www.gwern.net)

A sufficiently paranoid paperclip maximizer

RomanS8 Aug 2022 11:17 UTC

17 points

10 comments2 min readLW link

Steganography in Chain of Thought Reasoning

A Ray8 Aug 2022 3:47 UTC

49 points

13 comments6 min readLW link

Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

johnswentworth8 Aug 2022 18:05 UTC

111 points

8 comments3 min readLW link

How (not) to choose a research project

Garrett Baker, CatGoddess and Johannes C. Mayer

9 Aug 2022 0:26 UTC

76 points

11 comments7 min readLW link

Team Shard Status Report

David Udell9 Aug 2022 5:33 UTC

38 points

8 comments3 min readLW link

[Question] How would two superintelligent AIs interact, if they are unaligned with each other?

Nathan11239 Aug 2022 18:58 UTC

4 points

6 comments1 min readLW link

The Host Minds of HBO’s Westworld.

Nerret12 Aug 2022 18:53 UTC

1 point

0 comments3 min readLW link

Anti-squatted AI x-risk domains index

plex12 Aug 2022 12:01 UTC

50 points

3 comments1 min readLW link

The Dumbest Possible Gets There First

Artaxerxes13 Aug 2022 10:20 UTC

35 points

7 comments2 min readLW link

[Question] The OpenAI playground for GPT-3 is a terrible interface. Is there any great local (or web) app for exploring/learning with language models?

aviv13 Aug 2022 16:34 UTC

2 points

1 comment1 min readLW link

I missed the crux of the alignment problem the whole time

zeshen13 Aug 2022 10:11 UTC

53 points

7 comments3 min readLW link

An Uncanny Prison

Nathan112313 Aug 2022 21:40 UTC

3 points

3 comments2 min readLW link

[Question] What is the probability that a superintelligent, sentient AGI is actually infeasible?

Nathan112314 Aug 2022 22:41 UTC

−3 points

6 comments1 min readLW link

Reinforcement Learning Goal Misgeneralization: Can we guess what kind of goals are selected by default?

StefanHex and Julian_R

25 Oct 2022 20:48 UTC

9 points

1 comment4 min readLW link

What’s General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?

johnswentworth15 Aug 2022 22:48 UTC

103 points

15 comments10 min readLW link

Discovering Agents

zac_kenton18 Aug 2022 17:33 UTC

56 points

8 comments6 min readLW link

Interpretability Tools Are an Attack Channel

Thane Ruthenis17 Aug 2022 18:47 UTC

42 points

22 comments1 min readLW link

Conditioning, Prompts, and Fine-Tuning

Adam Jermyn17 Aug 2022 20:52 UTC

32 points

9 comments4 min readLW link

Debate AI and the Decision to Release an AI

Chris_Leong17 Jan 2019 14:36 UTC

9 points

18 comments3 min readLW link

What’s the Least Impressive Thing GPT-4 Won’t be Able to Do

Algon20 Aug 2022 19:48 UTC

75 points

80 comments1 min readLW link

The Alignment Problem Needs More Positive Fiction

Netcentrica21 Aug 2022 22:01 UTC

4 points

2 comments5 min readLW link

AI alignment as “navigating the space of intelligent behaviour”

Nora_Ammann23 Aug 2022 13:28 UTC

18 points

0 comments6 min readLW link

AGI Timelines Are Mostly Not Strategically Relevant To Alignment

johnswentworth23 Aug 2022 20:15 UTC

44 points

35 comments1 min readLW link

[Question] Would you ask a genie to give you the solution to alignment?

sudo24 Aug 2022 1:29 UTC

6 points

1 comment1 min readLW link

Ethan Perez on the Inverse Scaling Prize, Language Feedback and Red Teaming

Michaël Trazzi24 Aug 2022 16:35 UTC

25 points

0 comments3 min readLW link

(theinsideview.ai)

Preparing for the apocalypse might help prevent it

Ocracoke25 Aug 2022 0:18 UTC

1 point

1 comment1 min readLW link

Your posts should be on arXiv

JanB25 Aug 2022 10:35 UTC

136 points

39 comments3 min readLW link

The Solomonoff prior is malign. It’s not a big deal.

Charlie Steiner25 Aug 2022 8:25 UTC

38 points

9 comments7 min readLW link

AI strategy nearcasting

HoldenKarnofsky25 Aug 2022 17:26 UTC

79 points

3 comments9 min readLW link

Common misconceptions about OpenAI

Jacob_Hilton25 Aug 2022 14:02 UTC

226 points

138 comments5 min readLW link

AI Risk in Terms of Unstable Nuclear Software

Thane Ruthenis26 Aug 2022 18:49 UTC

29 points

1 comment6 min readLW link

What’s the Most Impressive Thing That GPT-4 Could Plausibly Do?

bayesed26 Aug 2022 15:34 UTC

23 points

24 comments1 min readLW link

Taking the parameters which seem to matter and rotating them until they don’t

Garrett Baker26 Aug 2022 18:26 UTC

117 points

48 comments1 min readLW link

Annual AGI Benchmarking Event

Lawrence Phillips27 Aug 2022 0:06 UTC

24 points

3 comments2 min readLW link

(www.metaculus.com)

Is there a benefit in low capability AI Alignment research?

Letti26 Aug 2022 23:51 UTC

1 point

1 comment2 min readLW link

Help Understanding Preferences And Evil

Netcentrica27 Aug 2022 3:42 UTC

6 points

7 comments2 min readLW link

Solving Alignment by “solving” semantics

Q Home27 Aug 2022 4:17 UTC

15 points

10 comments26 min readLW link

An Introduction to Current Theories of Consciousness

hohenheim28 Aug 2022 17:55 UTC

59 points

44 comments49 min readLW link

New Canada AI Safety & Governance community

Wyatt Tessari L'Allié29 Aug 2022 18:45 UTC

21 points

0 comments1 min readLW link

Are Generative World Models a Mesa-Optimization Risk?

Thane Ruthenis29 Aug 2022 18:37 UTC

12 points

2 comments3 min readLW link

How might we align transformative AI if it’s developed very soon?

HoldenKarnofsky29 Aug 2022 15:42 UTC

107 points

17 comments45 min readLW link

Worlds Where Iterative Design Fails

johnswentworth30 Aug 2022 20:48 UTC

144 points

26 comments10 min readLW link

[Question] How might we make better use of AI capabilities research for alignment purposes?

ghostwheel31 Aug 2022 4:19 UTC

11 points

4 comments1 min readLW link

ML Model Attribution Challenge [Linkpost]

aogara30 Aug 2022 19:34 UTC

11 points

0 comments1 min readLW link

(mlmac.io)

I Tripped and Became GPT! (And How This Updated My Timelines)

Frankophone1 Sep 2022 17:56 UTC

31 points

0 comments4 min readLW link

[Question] Can someone explain to me why most researchers think alignment is probably something that is humanly tractable?

iamthouthouarti3 Sep 2022 1:12 UTC

32 points

11 comments1 min readLW link

An Update on Academia vs. Industry (one year into my faculty job)

David Scott Krueger (formerly: capybaralet)3 Sep 2022 20:43 UTC

118 points

18 comments4 min readLW link

Framing AI Childhoods

David Udell6 Sep 2022 23:40 UTC

37 points

8 comments4 min readLW link

A Game About AI Alignment (& Meta-Ethics): What Are the Must Haves?

JonathanErhardt5 Sep 2022 7:55 UTC

18 points

13 comments2 min readLW link

Is training data going to be diluted by AI-generated content?

Hannes Thurnherr7 Sep 2022 18:13 UTC

10 points

7 comments1 min readLW link

Turning WhatsApp Chat Data into Prompt-Response Form for Fine-Tuning

hatta_afiq8 Sep 2022 20:05 UTC

1 point

0 comments1 min readLW link

[An email with a bunch of links I sent an experienced ML researcher interested in learning about Alignment / x-safety.]

David Scott Krueger (formerly: capybaralet)8 Sep 2022 22:28 UTC

46 points

1 comment5 min readLW link

Monitoring for deceptive alignment

evhub8 Sep 2022 23:07 UTC

118 points

7 comments9 min readLW link

Samotsvety’s AI risk forecasts

elifland9 Sep 2022 4:01 UTC

44 points

0 comments4 min readLW link

Ought will host a factored cognition “Lab Meeting”

jungofthewon and stuhlmueller

9 Sep 2022 23:46 UTC

35 points

1 comment1 min readLW link

AI Risk Intro 1: Advanced AI Might Be Very Bad

CallumMcDougall and L Rudolf L

11 Sep 2022 10:57 UTC

43 points

13 comments30 min readLW link

An investigation into when agents may be incentivized to manipulate our beliefs.

Felix Hofstätter13 Sep 2022 17:08 UTC

15 points

0 comments14 min readLW link

Risk aversion and GPT-3

hatta_afiq13 Sep 2022 20:50 UTC

1 point

0 comments1 min readLW link

[Question] Would a Misaligned SSI Really Kill Us All?

DragonGod14 Sep 2022 12:15 UTC

6 points

7 comments6 min readLW link

[Question] Why Do People Think Humans Are Stupid?

DragonGod14 Sep 2022 13:55 UTC

21 points

39 comments3 min readLW link

Precise P(doom) isn’t very important for prioritization or strategy

harsimony14 Sep 2022 17:19 UTC

18 points

6 comments1 min readLW link

Coordinate-Free Interpretability Theory

johnswentworth14 Sep 2022 23:33 UTC

41 points

14 comments5 min readLW link

Capability and Agency as Cornerstones of AI risk — My current model

wilm15 Sep 2022 8:25 UTC

10 points

4 comments12 min readLW link

[Question] Are Human Brains Universal?

DragonGod15 Sep 2022 15:15 UTC

16 points

28 comments5 min readLW link

Should AI learn human values, human norms or something else?

Q Home17 Sep 2022 6:19 UTC

5 points

2 comments4 min readLW link

The ELK Framing I’ve Used

sudo19 Sep 2022 10:28 UTC

4 points

1 comment1 min readLW link

[Question] If we have Human-level chatbots, won’t we end up being ruled by possible people?

Erlja Jkdf.20 Sep 2022 13:59 UTC

5 points

13 comments1 min readLW link

Character alignment

p.b.20 Sep 2022 8:27 UTC

22 points

0 comments2 min readLW link

Cryptocurrency Exploits Show the Importance of Proactive Policies for AI X-Risk

eSpencer20 Sep 2022 17:53 UTC

1 point

0 comments4 min readLW link

Doing oversight from the very start of training seems hard

peterbarnett20 Sep 2022 17:21 UTC

14 points

3 comments3 min readLW link

Trends in Training Dataset Sizes

Pablo Villalobos21 Sep 2022 15:47 UTC

24 points

2 comments5 min readLW link

(epochai.org)

Two reasons we might be closer to solving alignment than it seems

KatWoods and AmberDawn

24 Sep 2022 20:00 UTC

56 points

9 comments4 min readLW link

Funding is All You Need: Getting into Grad School by Hacking the NSF GRFP Fellowship

hapanin22 Sep 2022 21:39 UTC

93 points

9 comments12 min readLW link

[Question] Papers to start getting into NLP-focused alignment research

Feraidoon24 Sep 2022 23:53 UTC

6 points

0 comments1 min readLW link

How to Study Unsafe AGI’s safely (and why we might have no choice)

Punoxysm7 Mar 2014 7:24 UTC

10 points

47 comments5 min readLW link

On Generality

Oren Montano26 Sep 2022 4:06 UTC

2 points

0 comments5 min readLW link

Oren’s Field Guide of Bad AGI Outcomes

Oren Montano26 Sep 2022 4:06 UTC

0 points

0 comments1 min readLW link

Summary of ML Safety Course

zeshen27 Sep 2022 13:05 UTC

6 points

0 comments6 min readLW link

My Thoughts on the ML Safety Course

zeshen27 Sep 2022 13:15 UTC

49 points

3 comments17 min readLW link

Reward IS the Optimization Target

Carn28 Sep 2022 17:59 UTC

−1 points

3 comments5 min readLW link

A Library and Tutorial for Factored Cognition with Language Models

stuhlmueller, Luke Stebbing, justin_dan and goodgravy

28 Sep 2022 18:15 UTC

47 points

0 comments1 min readLW link

Will Values and Competition Decouple?

interstice28 Sep 2022 16:27 UTC

15 points

11 comments17 min readLW link

Make-A-Video by Meta AI

P.29 Sep 2022 17:07 UTC

9 points

4 comments1 min readLW link

(makeavideo.studio)

Open application to become an AI safety project mentor

Charbel-Raphaël29 Sep 2022 11:27 UTC

7 points

0 comments1 min readLW link

(docs.google.com)

It matters when the first sharp left turn happens

Adam Jermyn29 Sep 2022 20:12 UTC

35 points

9 comments4 min readLW link

Eli’s review of “Is power-seeking AI an existential risk?”

elifland30 Sep 2022 12:21 UTC

58 points

0 comments3 min readLW link

(docs.google.com)

[Question] Rank the following based on likelihood to nullify AI-risk

Aorou30 Sep 2022 11:15 UTC

3 points

1 comment4 min readLW link

Distribution Shifts and The Importance of AI Safety

Leon Lang29 Sep 2022 22:38 UTC

17 points

2 comments12 min readLW link

[Question] What Is the Idea Behind (Un-)Supervised Learning and Reinforcement Learning?

Morpheus30 Sep 2022 16:48 UTC

9 points

6 comments2 min readLW link

(Structural) Stability of Coupled Optimizers

Paul Bricman30 Sep 2022 11:28 UTC

25 points

0 comments10 min readLW link

Where I currently disagree with Ryan Greenblatt’s version of the ELK approach

So8res29 Sep 2022 21:18 UTC

63 points

7 comments5 min readLW link

Paper: Large Language Models Can Self-improve [Linkpost]

Evan R. Murphy2 Oct 2022 1:29 UTC

52 points

14 comments1 min readLW link

(openreview.net)

[Question] Is there a culture overhang?

Aleksi Liimatainen3 Oct 2022 7:26 UTC

18 points

4 comments1 min readLW link

Visualizing Learned Representations of Rice Disease

muhia_bee3 Oct 2022 9:09 UTC

7 points

0 comments4 min readLW link

(indecisive-sand-24a.notion.site)

If you want to learn technical AI safety, here’s a list of AI safety courses, reading lists, and resources

KatWoods3 Oct 2022 12:43 UTC

12 points

3 comments1 min readLW link

Frontline of AGI Alignment

SD Marlow4 Oct 2022 3:47 UTC

−10 points

0 comments1 min readLW link

(robothouse.substack.com)

Humans aren’t fitness maximizers

So8res4 Oct 2022 1:31 UTC

52 points

45 comments5 min readLW link

Smoke without fire is scary

Adam Jermyn4 Oct 2022 21:08 UTC

49 points

22 comments4 min readLW link

CHAI, Assistance Games, And Fully-Updated Deference [Scott Alexander]

lberglund4 Oct 2022 17:04 UTC

21 points

1 comment17 min readLW link

(astralcodexten.substack.com)

Generative, Episodic Objectives for Safe AI

Michael Glass5 Oct 2022 23:18 UTC

11 points

3 comments8 min readLW link

[Linkpost] “Blueprint for an AI Bill of Rights”—Office of Science and Technology Policy, USA (2022)

Fer32dwt34r3dfsz5 Oct 2022 16:42 UTC

8 points

4 comments2 min readLW link

(www.whitehouse.gov)

The Answer

Alex Beyman5 Oct 2022 21:23 UTC

−3 points

0 comments4 min readLW link

The probability that Artificial General Intelligence will be developed by 2043 is extremely low.

cveres6 Oct 2022 18:05 UTC

−14 points

8 comments1 min readLW link

The Shape of Things to Come

Alex Beyman7 Oct 2022 16:11 UTC

12 points

3 comments8 min readLW link

The Slow Reveal

Alex Beyman9 Oct 2022 3:16 UTC

3 points

0 comments24 min readLW link

What does it mean for an AGI to be ‘safe’?

So8res7 Oct 2022 4:13 UTC

72 points

32 comments3 min readLW link

Boolean Primitives for Coupled Optimizers

Paul Bricman7 Oct 2022 18:02 UTC

9 points

0 comments8 min readLW link

Analysis: US restricts GPU sales to China

aogara7 Oct 2022 18:38 UTC

94 points

58 comments5 min readLW link

[Question] Broken Links for the Audio Version of 2021 MIRI Conversations

Krieger8 Oct 2022 16:16 UTC

1 point

1 comment1 min readLW link

Don’t leave your fingerprints on the future

So8res8 Oct 2022 0:35 UTC

93 points

32 comments5 min readLW link

Let’s talk about uncontrollable AI

Karl von Wendt9 Oct 2022 10:34 UTC

12 points

6 comments3 min readLW link

Lessons learned from talking to >100 academics about AI safety

Marius Hobbhahn10 Oct 2022 13:16 UTC

207 points

16 comments12 min readLW link

When reporting AI timelines, be clear who you’re (not) deferring to

Sam Clarke10 Oct 2022 14:24 UTC

37 points

3 comments1 min readLW link

Natural Categories Update

Logan Zoellner10 Oct 2022 15:19 UTC

29 points

6 comments2 min readLW link

Updates and Clarifications

SD Marlow11 Oct 2022 5:34 UTC

−5 points

1 comment1 min readLW link

My argument against AGI

cveres12 Oct 2022 6:33 UTC

3 points

5 comments1 min readLW link

Instrumental convergence in single-agent systems

Edouard Harris and simonsdsuo

12 Oct 2022 12:24 UTC

27 points

4 comments8 min readLW link

(www.gladstone.ai)

A strange twist on the road to AGI

cveres12 Oct 2022 23:27 UTC

−8 points

0 comments1 min readLW link

Perfect Enemy

Alex Beyman13 Oct 2022 8:23 UTC

−2 points

0 comments46 min readLW link

A stubborn unbeliever finally gets the depth of the AI alignment problem

aelwood13 Oct 2022 15:16 UTC

17 points

8 comments3 min readLW link

(pursuingreality.substack.com)

Misalignment-by-default in multi-agent systems

Edouard Harris and simonsdsuo

13 Oct 2022 15:38 UTC

17 points

8 comments20 min readLW link

(www.gladstone.ai)

Niceness is unnatural

So8res13 Oct 2022 1:30 UTC

98 points

18 comments8 min readLW link

The Vitalik Buterin Fellowship in AI Existential Safety is open for applications!

Cynthia Chen13 Oct 2022 18:32 UTC

21 points

0 comments1 min readLW link

Greed Is the Root of This Evil

Thane Ruthenis13 Oct 2022 20:40 UTC

21 points

4 comments8 min readLW link

Contra shard theory, in the context of the diamond maximizer problem

So8res13 Oct 2022 23:51 UTC

84 points

16 comments2 min readLW link

Anthropomorphic AI and Sandboxed Virtual Universes

jacob_cannell3 Sep 2010 19:02 UTC

4 points

124 comments5 min readLW link

Instrumental convergence: scale and physical interactions

Edouard Harris and simonsdsuo

14 Oct 2022 15:50 UTC

15 points

0 comments17 min readLW link

(www.gladstone.ai)

Provably Honest—A First Step

Srijanak De5 Nov 2022 19:18 UTC

10 points

2 comments8 min readLW link

They gave LLMs access to physics simulators

ryan_b17 Oct 2022 21:21 UTC

50 points

18 comments1 min readLW link

(arxiv.org)

Decision theory does not imply that we get to have nice things

So8res18 Oct 2022 3:04 UTC

142 points

53 comments26 min readLW link

[Question] How easy is it to supervise processes vs outcomes?

Noosphere8918 Oct 2022 17:48 UTC

3 points

0 comments1 min readLW link

How To Make Prediction Markets Useful For Alignment Work

johnswentworth18 Oct 2022 19:01 UTC

86 points

18 comments2 min readLW link

The reward function is already how well you manipulate humans

Kerry19 Oct 2022 1:52 UTC

20 points

9 comments2 min readLW link

Cooperators are more powerful than agents

Ivan Vendrov21 Oct 2022 20:02 UTC

14 points

7 comments3 min readLW link

Logical Decision Theories: Our final failsafe?

Noosphere8925 Oct 2022 12:51 UTC

−6 points

8 comments1 min readLW link

(www.lesswrong.com)

[Question] Simple question about corrigibility and values in AI.

jmh22 Oct 2022 2:59 UTC

6 points

1 comment1 min readLW link

Newsletter for Alignment Research: The ML Safety Updates

Esben Kran22 Oct 2022 16:17 UTC

14 points

0 comments1 min readLW link

“Originality is nothing but judicious imitation”—Voltaire

Vestozia23 Oct 2022 19:00 UTC

0 points

0 comments13 min readLW link

AI researchers announce NeuroAI agenda

Cameron Berg24 Oct 2022 0:14 UTC

37 points

12 comments6 min readLW link

(arxiv.org)

AGI in our lifetimes is wishful thinking

niknoble24 Oct 2022 11:53 UTC

−4 points

21 comments8 min readLW link

question-answer counterfactual intervals

Tamsin Leake24 Oct 2022 13:08 UTC

8 points

0 comments4 min readLW link

(carado.moe)

Why some people believe in AGI, but I don’t.

cveres26 Oct 2022 3:09 UTC

−15 points

6 comments1 min readLW link

[Question] Is the Orthogonality Thesis true for humans?

Noosphere8927 Oct 2022 14:41 UTC

12 points

18 comments1 min readLW link

Worldview iPeople—Future Fund’s AI Worldview Prize

Toni MUENDEL28 Oct 2022 1:53 UTC

−22 points

4 comments9 min readLW link

Causal scrubbing: Appendix

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:58 UTC

16 points

0 comments20 min readLW link

Beyond Kolmogorov and Shannon

Alexander Gietelink Oldenziel and Adam Shai

25 Oct 2022 15:13 UTC

60 points

14 comments5 min readLW link

Method of statements: an alternative to taboo

Q Home16 Nov 2022 10:57 UTC

7 points

0 comments41 min readLW link

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:58 UTC

130 points

9 comments20 min readLW link

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

KevinRoWang, Alexandre Variengien, Arthur Conmy, Buck and jsteinhardt

28 Oct 2022 23:55 UTC

86 points

5 comments9 min readLW link

(arxiv.org)

Causal scrubbing: results on a paren balance checker

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:59 UTC

26 points

0 comments30 min readLW link

AI as a Civilizational Risk Part 1/6: Historical Priors

PashaKamyshev29 Oct 2022 21:59 UTC

2 points

2 comments7 min readLW link

AI as a Civilizational Risk Part 2/6: Behavioral Modification

PashaKamyshev30 Oct 2022 16:57 UTC

9 points

0 comments10 min readLW link

AI as a Civilizational Risk Part 3/6: Anti-economy and Signal Pollution

PashaKamyshev31 Oct 2022 17:03 UTC

7 points

4 comments14 min readLW link

AI as a Civilizational Risk Part 4/6: Bioweapons and Philosophy of Modification

PashaKamyshev1 Nov 2022 20:50 UTC

7 points

1 comment8 min readLW link

AI as a Civilizational Risk Part 5/6: Relationship between C-risk and X-risk

PashaKamyshev3 Nov 2022 2:19 UTC

2 points

0 comments7 min readLW link

AI as a Civilizational Risk Part 6/6: What can be done

PashaKamyshev3 Nov 2022 19:48 UTC

2 points

3 comments4 min readLW link

Am I secretly excited for AI getting weird?

porby29 Oct 2022 22:16 UTC

98 points

4 comments4 min readLW link

“Normal” is the equilibrium state of past optimization processes

Alex_Altair30 Oct 2022 19:03 UTC

77 points

5 comments5 min readLW link

love, not competition

Tamsin Leake30 Oct 2022 19:44 UTC

31 points

20 comments1 min readLW link

(carado.moe)

My (naive) take on Risks from Learned Optimization

Artyom Karpov31 Oct 2022 10:59 UTC

7 points

0 comments5 min readLW link

Embedding safety in ML development

zeshen31 Oct 2022 12:27 UTC

24 points

1 comment18 min readLW link

Auditing games for high-level interpretability

Paul Colognese1 Nov 2022 10:44 UTC

28 points

1 comment7 min readLW link

publishing alignment research and infohazards

Tamsin Leake31 Oct 2022 18:02 UTC

69 points

10 comments1 min readLW link

(carado.moe)

Caution when interpreting Deepmind’s In-context RL paper

Sam Marks1 Nov 2022 2:42 UTC

104 points

6 comments4 min readLW link

AGI and the future: Is a future with AGI and humans alive evidence that AGI is not a threat to our existence?

LetUsTalk1 Nov 2022 7:37 UTC

4 points

8 comments1 min readLW link

Threat Model Literature Review

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

1 Nov 2022 11:03 UTC

55 points

4 comments25 min readLW link

Clarifying AI X-risk

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

1 Nov 2022 11:03 UTC

102 points

23 comments4 min readLW link

a casual intro to AI doom and alignment

Tamsin Leake1 Nov 2022 16:38 UTC

12 points

0 comments4 min readLW link

(carado.moe)

[Question] Which Issues in Conceptual Alignment have been Formalised or Observed (or not)?

ojorgensen1 Nov 2022 22:32 UTC

4 points

0 comments1 min readLW link

Questions about Value Lock-in, Paternalism, and Empowerment

Sam F. Brown16 Nov 2022 15:33 UTC

12 points

2 comments12 min readLW link

(sambrown.eu)

Why do we post our AI safety plans on the Internet?

Peter S. Park3 Nov 2022 16:02 UTC

3 points

4 comments11 min readLW link

Mechanistic Interpretability as Reverse Engineering (follow-up to “cars and elephants”)

David Scott Krueger (formerly: capybaralet)3 Nov 2022 23:19 UTC

28 points

3 comments1 min readLW link

[Question] Are alignment researchers devoting enough time to improving their research capacity?

Carson Jones4 Nov 2022 0:58 UTC

13 points

3 comments3 min readLW link

[Question] Don’t you think RLHF solves outer alignment?

Charbel-Raphaël4 Nov 2022 0:36 UTC

2 points

19 comments1 min readLW link

A newcomer’s guide to the technical AI safety field

zeshen4 Nov 2022 14:29 UTC

30 points

1 comment10 min readLW link

Toy Models and Tegum Products

Adam Jermyn4 Nov 2022 18:51 UTC

27 points

7 comments5 min readLW link

For ELK truth is mostly a distraction

c.trout4 Nov 2022 21:14 UTC

32 points

0 comments21 min readLW link

Interpreting systems as solving POMDPs: a step towards a formal understanding of agency [paper link]

the gears to ascension5 Nov 2022 1:06 UTC

12 points

2 comments1 min readLW link

(www.semanticscholar.org)

When can a mimic surprise you? Why generative models handle seemingly ill-posed problems

David Johnston5 Nov 2022 13:19 UTC

8 points

4 comments16 min readLW link

The Slippery Slope from DALLE-2 to Deepfake Anarchy

scasper5 Nov 2022 14:53 UTC

16 points

9 comments11 min readLW link

[Question] Can we get around Godel’s Incompleteness theorems and Turing undecidable problems via infinite computers?

Noosphere895 Nov 2022 18:01 UTC

−10 points

12 comments1 min readLW link

Recommend HAIST resources for assessing the value of RLHF-related alignment research

Sam Marks and Xander Davies

5 Nov 2022 20:58 UTC

26 points

9 comments3 min readLW link

[Question] Has anyone increased their AGI timelines?

Darren McKee6 Nov 2022 0:03 UTC

38 points

13 comments1 min readLW link

Applying superintelligence without collusion

Eric Drexler8 Nov 2022 18:08 UTC

88 points

57 comments4 min readLW link

A philosopher’s critique of RLHF

ThomasW7 Nov 2022 2:42 UTC

55 points

8 comments2 min readLW link

4 Key Assumptions in AI Safety

Prometheus7 Nov 2022 10:50 UTC

20 points

5 comments7 min readLW link

Hacker-AI – Does it already exist?

Erland Wittkotter7 Nov 2022 14:01 UTC

3 points

11 comments11 min readLW link

Loss of control of AI is not a likely source of AI x-risk

squek7 Nov 2022 18:44 UTC

−6 points

0 comments5 min readLW link

Mysteries of mode collapse

janus8 Nov 2022 10:37 UTC

213 points

35 comments14 min readLW link

Some advice on independent research

Marius Hobbhahn8 Nov 2022 14:46 UTC

41 points

4 comments10 min readLW link

A first success story for Outer Alignment: InstructGPT

Noosphere898 Nov 2022 22:52 UTC

6 points

1 comment1 min readLW link

(openai.com)

A caveat to the Orthogonality Thesis

Wuschel Schulz9 Nov 2022 15:06 UTC

36 points

10 comments2 min readLW link

Trying to Make a Treacherous Mesa-Optimizer

MadHatter9 Nov 2022 18:07 UTC

87 points

13 comments4 min readLW link

(attentionspan.blog)

Is full self-driving an AGI-complete problem?

kraemahz10 Nov 2022 2:04 UTC

5 points

5 comments1 min readLW link

The harnessing of complexity

geduardo10 Nov 2022 18:44 UTC

6 points

2 comments3 min readLW link

[Question] I there a demo of “You can’t fetch the coffee if you’re dead”?

Ram Rachum10 Nov 2022 18:41 UTC

8 points

9 comments1 min readLW link

LessWrong Poll on AGI

Niclas Kupper10 Nov 2022 13:13 UTC

12 points

6 comments1 min readLW link

Value Formation: An Overarching Model

Thane Ruthenis15 Nov 2022 17:16 UTC

27 points

9 comments34 min readLW link

[simulation] 4chan user claiming to be the attorney hired by Google’s sentient chatbot LaMDA shares wild details of encounter

janus10 Nov 2022 21:39 UTC

11 points

1 comment13 min readLW link

(generative.ink)

Why I’m Working On Model Agnostic Interpretability

Jessica Rumbelow11 Nov 2022 9:24 UTC

28 points

9 comments2 min readLW link

Are funding options for AI Safety threatened? W45

Steinthal, Esben Kran and Sabrina Zaki

11 Nov 2022 13:00 UTC

7 points

0 comments3 min readLW link

(newsletter.apartresearch.com)

How likely are malign priors over objectives? [aborted WIP]

David Johnston11 Nov 2022 5:36 UTC

−2 points

0 comments8 min readLW link

Is AI Gain-of-Function research a thing?

MadHatter12 Nov 2022 2:33 UTC

8 points

2 comments2 min readLW link

Vanessa Kosoy’s PreDCA, distilled

Martín Soto12 Nov 2022 11:38 UTC

16 points

17 comments5 min readLW link

fully aligned singleton as a solution to everything

Tamsin Leake12 Nov 2022 18:19 UTC

6 points

2 comments2 min readLW link

(carado.moe)

Ways to buy time

Akash, OliviaJ and Thomas Larsen

12 Nov 2022 19:31 UTC

26 points

21 comments12 min readLW link

Characterizing Intrinsic Compositionality in Transformers with Tree Projections

Ulisse Mini13 Nov 2022 9:46 UTC

12 points

2 comments1 min readLW link

(arxiv.org)

I (with the help of a few more people) am planning to create an introduction to AI Safety that a smart teenager can understand. What am I missing?

Tapatakt14 Nov 2022 16:12 UTC

3 points

5 comments1 min readLW link

Will we run out of ML data? Evidence from projecting dataset size trends

Pablo Villalobos14 Nov 2022 16:42 UTC

74 points

12 comments2 min readLW link

(epochai.org)

The limited upside of interpretability

Peter S. Park15 Nov 2022 18:46 UTC

13 points

11 comments1 min readLW link

[Question] Is the speed of training large models going to increase significantly in the near future due to Cerebras Andromeda?

Amal 15 Nov 2022 22:50 UTC

11 points

11 comments1 min readLW link

Unpacking “Shard Theory” as Hunch, Question, Theory, and Insight

Jacy Reese Anthis16 Nov 2022 13:54 UTC

29 points

9 comments2 min readLW link

The two conceptions of Active Inference: an intelligence architecture and a theory of agency

Roman Leventov16 Nov 2022 9:30 UTC

7 points

0 comments4 min readLW link

Engineering Monosemanticity in Toy Models

Adam Jermyn, evhub and Nicholas Schiefer

18 Nov 2022 1:43 UTC

72 points

6 comments3 min readLW link

(arxiv.org)

[Question] Is there any policy for a fair treatment of AIs whose friendliness is in doubt?

nahoj18 Nov 2022 19:01 UTC

15 points

9 comments1 min readLW link

The Ground Truth Problem (Or, Why Evaluating Interpretability Methods Is Hard)

Jessica Rumbelow17 Nov 2022 11:06 UTC

26 points

2 comments2 min readLW link

Massive Scaling Should be Frowned Upon

harsimony17 Nov 2022 8:43 UTC

7 points

6 comments5 min readLW link

How AI Fails Us: A non-technical view of the Alignment Problem

testingthewaters18 Nov 2022 19:02 UTC

7 points

0 comments2 min readLW link

(ethics.harvard.edu)

LLMs may capture key components of human agency

catubc17 Nov 2022 20:14 UTC

21 points

0 comments4 min readLW link

AGIs may value intrinsic rewards more than extrinsic ones

catubc17 Nov 2022 21:49 UTC

8 points

6 comments4 min readLW link

The economy as an analogy for advanced AI systems

rosehadshar and particlemania

15 Nov 2022 11:16 UTC

26 points

0 comments5 min readLW link

Cognitive science and failed AI forecasts

Eleni Angelou24 Nov 2022 21:02 UTC

0 points

0 comments2 min readLW link

A Short Dialogue on the Meaning of Reward Functions

Leon Lang, Quintin Pope and peligrietzer

19 Nov 2022 21:04 UTC

40 points

0 comments3 min readLW link

[Question] Updates on scaling laws for foundation models from ′ Transcending Scaling Laws with 0.1% Extra Compute’

Nick_Greig18 Nov 2022 12:46 UTC

15 points

2 comments1 min readLW link

Distillation of “How Likely Is Deceptive Alignment?”

NickGabs18 Nov 2022 16:31 UTC

20 points

3 comments10 min readLW link

The Disastrously Confident And Inaccurate AI

Sharat Jacob Jacob18 Nov 2022 19:06 UTC

13 points

0 comments13 min readLW link

generalized wireheading

Tamsin Leake18 Nov 2022 20:18 UTC

21 points

7 comments2 min readLW link

(carado.moe)

By Default, GPTs Think In Plain Sight

Fabien Roger19 Nov 2022 19:15 UTC

60 points

16 comments9 min readLW link

ARC paper: Formalizing the presumption of independence

Erik Jenner20 Nov 2022 1:22 UTC

88 points

2 comments2 min readLW link

(arxiv.org)

Planes are still decades away from displacing most bird jobs

guzey25 Nov 2022 16:49 UTC

156 points

13 comments3 min readLW link

Scott Aaronson on “Reform AI Alignment”

shminux20 Nov 2022 22:20 UTC

39 points

17 comments1 min readLW link

(scottaaronson.blog)

How Should AIS Relate To Its Funders? W46

Steinthal, Esben Kran and Sabrina Zaki

21 Nov 2022 15:58 UTC

6 points

1 comment3 min readLW link

(newsletter.apartresearch.com)

Benefits/Risks of Scott Aaronson’s Orthodox/Reform Framing for AI Alignment

Jeremyy21 Nov 2022 17:54 UTC

2 points

1 comment1 min readLW link

[Hebbian Natural Abstractions] Introduction

Samuel Nellessen and Jan

21 Nov 2022 20:34 UTC

34 points

3 comments4 min readLW link

(www.snellessen.com)

Miscellaneous First-Pass Alignment Thoughts

NickGabs21 Nov 2022 21:23 UTC

12 points

4 comments10 min readLW link

Meta AI announces Cicero: Human-Level Diplomacy play (with dialogue)

Jacy Reese Anthis22 Nov 2022 16:50 UTC

95 points

64 comments1 min readLW link

(www.science.org)

Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

Akash and OliviaJ

22 Nov 2022 22:19 UTC

69 points

20 comments4 min readLW link

Brute-forcing the universe: a non-standard shot at diamond alignment

Martín Soto22 Nov 2022 22:36 UTC

6 points

0 comments20 min readLW link

Simulators, constraints, and goal agnosticism: porbynotes vol. 1

porby23 Nov 2022 4:22 UTC

36 points

2 comments35 min readLW link

Sets of objectives for a multi-objective RL agent to optimize

Ben Smith and Roland Pihlakas

23 Nov 2022 6:49 UTC

11 points

0 comments8 min readLW link

Human-level Diplomacy was my fire alarm

Lao Mein23 Nov 2022 10:05 UTC

51 points

15 comments3 min readLW link

Ex nihilo

Hopkins Stanley23 Nov 2022 14:38 UTC

1 point

0 comments1 min readLW link

Corrigibility Via Thought-Process Deference

Thane Ruthenis24 Nov 2022 17:06 UTC

13 points

5 comments9 min readLW link

Conjecture: a retrospective after 8 months of work

Connor Leahy, Sid Black, Gabriel Alfour and Chris Scammell

23 Nov 2022 17:10 UTC

183 points

9 comments8 min readLW link

Conjecture Second Hiring Round

Connor Leahy, Sid Black, Gabriel Alfour and Chris Scammell

23 Nov 2022 17:11 UTC

85 points

0 comments1 min readLW link

Injecting some numbers into the AGI debate—by Boaz Barak

Jsevillamol23 Nov 2022 16:10 UTC

12 points

0 comments3 min readLW link

(windowsontheory.org)

Human-level Full-Press Diplomacy (some bare facts).

Cleo Nardo22 Nov 2022 20:59 UTC

50 points

7 comments3 min readLW link

When AI solves a game, focus on the game’s mechanics, not its theme.

Cleo Nardo23 Nov 2022 19:16 UTC

81 points

7 comments2 min readLW link

[Question] What is the best source to explain short AI timelines to a skeptical person?

trevor23 Nov 2022 5:19 UTC

4 points

4 comments1 min readLW link

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy and Megan Kinniment

5 Dec 2022 20:28 UTC

37 points

16 comments10 min readLW link

The man and the tool

pedroalvarado25 Nov 2022 19:51 UTC

1 point

0 comments4 min readLW link

Gliders in Language Models

Alexandre Variengien25 Nov 2022 0:38 UTC

27 points

11 comments10 min readLW link

The AI Safety community has four main work groups, Strategy, Governance, Technical and Movement Building

peterslattery25 Nov 2022 3:45 UTC

0 points

0 comments6 min readLW link

Using mechanistic interpretability to find in-distribution failure in toy transformers

Charlie George28 Nov 2022 19:39 UTC

6 points

0 comments4 min readLW link

Intuitions by ML researchers may get progressively worse concerning likely candidates for transformative AI

Viktor Rehnberg25 Nov 2022 15:49 UTC

7 points

0 comments2 min readLW link

Guardian AI (Misaligned systems are all around us.)

Jessica Rumbelow25 Nov 2022 15:55 UTC

15 points

6 comments2 min readLW link

Three Alignment Schemas & Their Problems

Shoshannah Tekofsky26 Nov 2022 4:25 UTC

16 points

1 comment6 min readLW link

Reward Is Not Necessary: How To Create A Compositional Self-Preserving Agent For Life-Long Learning

Capybasilisk27 Nov 2022 14:05 UTC

3 points

0 comments1 min readLW link

(arxiv.org)

Review: LOVE in a simbox

PeterMcCluskey27 Nov 2022 17:41 UTC

32 points

4 comments9 min readLW link

(bayesianinvestor.com)

Superintelligent AI is necessary for an amazing future, but far from sufficient

So8res31 Oct 2022 21:16 UTC

115 points

46 comments34 min readLW link

[Question] How to correct for multiplicity with AI-generated models?

Lao Mein28 Nov 2022 3:51 UTC

4 points

0 comments1 min readLW link

Is Constructor Theory a useful tool for AI alignment?

A.H.29 Nov 2022 12:35 UTC

11 points

8 comments26 min readLW link

Multi-Component Learning and S-Curves

Adam Jermyn and Buck

30 Nov 2022 1:37 UTC

57 points

24 comments7 min readLW link

Subsets and quotients in interpretability

Erik Jenner2 Dec 2022 23:13 UTC

24 points

1 comment7 min readLW link

Neglected cause: automated fraud detection in academia through image analysis

Lao Mein30 Nov 2022 5:52 UTC

10 points

1 comment2 min readLW link

AGI Impossible due to Energy Constrains

TheKlaus30 Nov 2022 18:48 UTC

−8 points

13 comments1 min readLW link

Master plan spec: needs audit (logic and cooperative AI)

Quinn30 Nov 2022 6:10 UTC

12 points

5 comments7 min readLW link

AI takeover tabletop RPG: “The Treacherous Turn”

Daniel Kokotajlo30 Nov 2022 7:16 UTC

51 points

3 comments1 min readLW link

Has AI gone too far?

Boston Anderson30 Nov 2022 18:49 UTC

−15 points

3 comments1 min readLW link

Seeking submissions for short AI-safety course proposals

Sergio1 Dec 2022 0:32 UTC

3 points

0 comments1 min readLW link

Did ChatGPT just gaslight me?

ThomasW1 Dec 2022 5:41 UTC

123 points

45 comments9 min readLW link

(equonc.substack.com)

Safe Development of Hacker-AI Countermeasures – What if we are too late?

Erland Wittkotter1 Dec 2022 7:59 UTC

3 points

0 comments14 min readLW link

Research request (alignment strategy): Deep dive on “making AI solve alignment for us”

JanB1 Dec 2022 14:55 UTC

16 points

3 comments1 min readLW link

[LINK] - ChatGPT discussion

JanB1 Dec 2022 15:04 UTC

13 points

7 comments1 min readLW link

(openai.com)

ChatGPT: First Impressions

specbug1 Dec 2022 16:36 UTC

18 points

2 comments13 min readLW link

(sixeleven.in)

Re-Examining LayerNorm

Eric Winsor1 Dec 2022 22:20 UTC

100 points

8 comments5 min readLW link

Update on Harvard AI Safety Team and MIT AI Alignment

Xander Davies, Sam Marks, kaivu, tlevin, eleni, maxnadeau, Naomi Bashkansky and Oam Patel

2 Dec 2022 0:56 UTC

56 points

4 comments8 min readLW link

Deconfusing Direct vs Amortised Optimization

beren2 Dec 2022 11:30 UTC

48 points

6 comments10 min readLW link

[ASoT] Finetuning, RL, and GPT’s world prior

Jozdien2 Dec 2022 16:33 UTC

31 points

8 comments5 min readLW link

Takeoff speeds, the chimps analogy, and the Cultural Intelligence Hypothesis

NickGabs2 Dec 2022 19:14 UTC

14 points

2 comments4 min readLW link

Non-Technical Preparation for Hacker-AI and Cyberwar 2.0+

Erland Wittkotter19 Dec 2022 11:42 UTC

2 points

0 comments25 min readLW link

Apply for the ML Upskilling Winter Camp in Cambridge, UK [2-10 Jan]

hannah wing-yee2 Dec 2022 20:45 UTC

3 points

0 comments2 min readLW link

Research Principles for 6 Months of AI Alignment Studies

Shoshannah Tekofsky2 Dec 2022 22:55 UTC

22 points

3 comments6 min readLW link

Chat GPT’s views on Metaphysics and Ethics

Cole Killian3 Dec 2022 18:12 UTC

5 points

3 comments1 min readLW link

(twitter.com)

[Question] Will the first AGI agent have been designed as an agent (in addition to an AGI)?

nahoj3 Dec 2022 20:32 UTC

1 point

8 comments1 min readLW link

Could an AI be Religious?

mk544 Dec 2022 5:00 UTC

−12 points

14 comments1 min readLW link

ChatGPT seems overconfident to me

qbolec4 Dec 2022 8:03 UTC

19 points

3 comments16 min readLW link

AI can exploit safety plans posted on the Internet

Peter S. Park4 Dec 2022 12:17 UTC

−19 points

4 comments1 min readLW link

Race to the Top: Benchmarks for AI Safety

Isabella Duan4 Dec 2022 18:48 UTC

12 points

2 comments1 min readLW link

Take 3: No indescribable heavenworlds.

Charlie Steiner4 Dec 2022 2:48 UTC

21 points

12 comments2 min readLW link

ChatGPT is settling the Chinese Room argument

averros4 Dec 2022 20:25 UTC

−7 points

4 comments1 min readLW link

AGI as a Black Swan Event

Stephen McAleese4 Dec 2022 23:00 UTC

8 points

8 comments7 min readLW link

Probably good projects for the AI safety ecosystem

Ryan Kidd5 Dec 2022 2:26 UTC

73 points

15 comments2 min readLW link

A ChatGPT story about ChatGPT doom

SurfingOrca5 Dec 2022 5:40 UTC

6 points

3 comments4 min readLW link

Aligned Behavior is not Evidence of Alignment Past a Certain Level of Intelligence

Ronny Fernandez5 Dec 2022 15:19 UTC

19 points

5 comments7 min readLW link

Is the “Valley of Confused Abstractions” real?

jacquesthibs5 Dec 2022 13:36 UTC

15 points

9 comments2 min readLW link

Analysis of AI Safety surveys for field-building insights

Ash Jafari5 Dec 2022 19:21 UTC

10 points

2 comments4 min readLW link

Testing Ways to Bypass ChatGPT’s Safety Features

Robert_AIZI5 Dec 2022 18:50 UTC

6 points

2 comments5 min readLW link

(aizi.substack.com)

ChatGPT on Spielberg’s A.I. and AI Alignment

Bill Benzon5 Dec 2022 21:10 UTC

5 points

0 comments4 min readLW link

Shh, don’t tell the AI it’s likely to be evil

naterush6 Dec 2022 3:35 UTC

19 points

9 comments1 min readLW link

Neural networks biased towards geometrically simple functions?

DavidHolmes8 Dec 2022 16:16 UTC

16 points

2 comments3 min readLW link

Things roll downhill

awenonian6 Dec 2022 15:27 UTC

19 points

0 comments1 min readLW link

ChatGPT and the Human Race

Ben Reilly6 Dec 2022 21:38 UTC

6 points

1 comment3 min readLW link

AI Safety in a Vulnerable World: Requesting Feedback on Preliminary Thoughts

Jordan Arel6 Dec 2022 22:35 UTC

3 points

2 comments3 min readLW link

In defense of probably wrong mechanistic models

evhub6 Dec 2022 23:24 UTC

41 points

10 comments2 min readLW link

ChatGPT: “An error occurred. If this issue persists...”

Bill Benzon7 Dec 2022 15:41 UTC

5 points

11 comments3 min readLW link

Where to be an AI Safety Professor

scasper7 Dec 2022 7:09 UTC

30 points

12 comments2 min readLW link

Thoughts on AGI organizations and capabilities work

Rob Bensinger and So8res

7 Dec 2022 19:46 UTC

94 points

17 comments5 min readLW link

Riffing on the agent type

Quinn8 Dec 2022 0:19 UTC

16 points

0 comments4 min readLW link

Of pumpkins, the Falcon Heavy, and Groucho Marx: High-Level discourse structure in ChatGPT

Bill Benzon8 Dec 2022 22:25 UTC

2 points

0 comments8 min readLW link

Why I’m Sceptical of Foom

DragonGod8 Dec 2022 10:01 UTC

19 points

26 comments3 min readLW link

If Wentworth is right about natural abstractions, it would be bad for alignment

Wuschel Schulz8 Dec 2022 15:19 UTC

27 points

5 comments4 min readLW link

Take 7: You should talk about “the human’s utility function” less.

Charlie Steiner8 Dec 2022 8:14 UTC

47 points

22 comments2 min readLW link

Notes on OpenAI’s alignment plan

Alex Flint8 Dec 2022 19:13 UTC

47 points

5 comments7 min readLW link

We need to make scary AIs

Igor Ivanov9 Dec 2022 10:04 UTC

3 points

8 comments5 min readLW link

I Believe we are in a Hardware Overhang

nem8 Dec 2022 23:18 UTC

8 points

0 comments1 min readLW link

[Question] What are your thoughts on the future of AI-assisted software development?

RomanHauksson9 Dec 2022 10:04 UTC

4 points

2 comments1 min readLW link

ChatGPT’s Misalignment Isn’t What You Think

stavros9 Dec 2022 11:11 UTC

3 points

12 comments1 min readLW link

Simulators and Mindcrime

DragonGod9 Dec 2022 15:20 UTC

0 points

4 comments3 min readLW link

Working towards AI alignment is better

Johannes C. Mayer9 Dec 2022 15:39 UTC

7 points

2 comments2 min readLW link

[Question] How would you improve ChatGPT’s filtering?

Noah Scales10 Dec 2022 8:05 UTC

9 points

6 comments1 min readLW link

Inspiration as a Scarce Resource

zenbu zenbu zenbu zenbu10 Dec 2022 15:23 UTC

7 points

0 comments4 min readLW link

(inflorescence.substack.com)

Poll Results on AGI

Niclas Kupper10 Dec 2022 21:25 UTC

10 points

0 comments2 min readLW link

The Opportunity and Risks of Learning Human Values In-Context

Past Account10 Dec 2022 21:40 UTC

1 point

4 comments5 min readLW link

High level discourse structure in ChatGPT: Part 2 [Quasi-symbolic?]

Bill Benzon10 Dec 2022 22:26 UTC

7 points

0 comments6 min readLW link

ChatGPT goes through a wormhole hole in our Shandyesque universe [virtual wacky weed]

Bill Benzon11 Dec 2022 11:59 UTC

−1 points

2 comments3 min readLW link

Questions about AI that bother me

Eleni Angelou11 Dec 2022 18:14 UTC

11 points

2 comments2 min readLW link

Reflections on the PIBBSS Fellowship 2022

Nora_Ammann and particlemania

11 Dec 2022 21:53 UTC

31 points

0 comments18 min readLW link

Benchmarks for Comparing Human and AI Intelligence

MrThink11 Dec 2022 22:06 UTC

8 points

4 comments2 min readLW link

a rough sketch of formal aligned AI using QACI

Tamsin Leake11 Dec 2022 23:40 UTC

14 points

0 comments4 min readLW link

(carado.moe)

Trivial GPT-3.5 limitation workaround

Dave Lindbergh12 Dec 2022 8:42 UTC

5 points

4 comments1 min readLW link

[Question] Thought experiment. If human minds could be harnessed into one universal consciousness of humanity, would we discover things that have been quite difficult to reach with the means of modern science? And would the consciousness of humanity be more comprehensive than the future power of artificial intelligence?

lotta liedes12 Dec 2022 14:43 UTC

−1 points

0 comments1 min readLW link

Meaningful things are those the universe possesses a semantics for

Abhimanyu Pallavi Sudhir12 Dec 2022 16:03 UTC

7 points

14 comments14 min readLW link

Let’s go meta: Grammatical knowledge and self-referential sentences [ChatGPT]

Bill Benzon12 Dec 2022 21:50 UTC

5 points

0 comments9 min readLW link

[Question] Are lawsuits against AGI companies extending AGI timelines?

SlowingAGI13 Dec 2022 6:00 UTC

1 point

1 comment1 min readLW link

An exploration of GPT-2′s embedding weights

Adam Scherlis13 Dec 2022 0:46 UTC

26 points

2 comments10 min readLW link

Revisiting algorithmic progress

Tamay and Ege Erdil

13 Dec 2022 1:39 UTC

92 points

8 comments2 min readLW link

(arxiv.org)

Alignment with argument-networks and assessment-predictions

Tor Økland Barstad13 Dec 2022 2:17 UTC

7 points

3 comments45 min readLW link

Limits of Superintelligence

Aleksei Petrenko13 Dec 2022 12:19 UTC

1 point

0 comments1 min readLW link

[Question] Best introductory overviews of AGI safety?

JakubK13 Dec 2022 19:01 UTC

14 points

5 comments2 min readLW link

(forum.effectivealtruism.org)

Seeking participants for study of AI safety researchers

joelegardner13 Dec 2022 21:58 UTC

2 points

0 comments1 min readLW link

Assessing the Capabilities of ChatGPT through Success Rates

Past Account13 Dec 2022 21:16 UTC

5 points

0 comments2 min readLW link

Discovering Latent Knowledge in Language Models Without Supervision

Xodarap14 Dec 2022 12:32 UTC

45 points

1 comment1 min readLW link

(arxiv.org)

all claw, no world — and other thoughts on the universal distribution

Tamsin Leake14 Dec 2022 18:55 UTC

14 points

0 comments7 min readLW link

(carado.moe)

Contrary to List of Lethality’s point 22, alignment’s door number 2

False Name14 Dec 2022 22:01 UTC

0 points

1 comment22 min readLW link

ChatGPT has a HAL Problem

Paul Anderson14 Dec 2022 21:31 UTC

1 point

0 comments1 min readLW link

How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

Collin15 Dec 2022 18:22 UTC

124 points

18 comments16 min readLW link

Avoiding Psychopathic AI

Cameron Berg19 Dec 2022 17:01 UTC

28 points

3 comments20 min readLW link

We’ve stepped over the threshold into the Fourth Arena, but don’t recognize it

Bill Benzon15 Dec 2022 20:22 UTC

2 points

0 comments7 min readLW link

AI Safety Movement Builders should help the community to optimise three factors: contributors, contributions and coordination

peterslattery15 Dec 2022 22:50 UTC

4 points

0 comments6 min readLW link

Proper scoring rules don’t guarantee predicting fixed points

Johannes Treutlein, Rubi J. Hudson and Caspar Oesterheld

16 Dec 2022 18:22 UTC

55 points

5 comments21 min readLW link

A learned agent is not the same as a learning agent

Ben Amitay16 Dec 2022 17:27 UTC

4 points

4 comments2 min readLW link

Abstract concepts and metalingual definition: Does ChatGPT understand justice and charity?

Bill Benzon16 Dec 2022 21:01 UTC

2 points

0 comments13 min readLW link

Using Information Theory to tackle AI Alignment: A Practical Approach

Daniel Salami17 Dec 2022 1:37 UTC

6 points

4 comments8 min readLW link

Looking for an alignment tutor

JanB17 Dec 2022 19:08 UTC

15 points

2 comments1 min readLW link

What we owe the microbiome

weverka17 Dec 2022 19:40 UTC

2 points

0 comments1 min readLW link

(forum.effectivealtruism.org)

Bad at Arithmetic, Promising at Math

cohenmacaulay18 Dec 2022 5:40 UTC

91 points

17 comments20 min readLW link

AGI is here, but nobody wants it. Why should we even care?

MGow20 Dec 2022 19:14 UTC

−20 points

0 comments17 min readLW link

Hacker-AI and Cyberwar 2.0+

Erland Wittkotter19 Dec 2022 11:46 UTC

2 points

0 comments15 min readLW link

Does ChatGPT’s performance warrant working on a tutor for children? [It’s time to take it to the lab.]

Bill Benzon19 Dec 2022 15:12 UTC

13 points

2 comments4 min readLW link

(new-savanna.blogspot.com)

Results from a survey on tool use and workflows in alignment research

jacquesthibs, Jan, janus and Logan Riggs

19 Dec 2022 15:19 UTC

50 points

2 comments19 min readLW link

Proliferating Education

Haris Rashid20 Dec 2022 19:22 UTC

−1 points

2 comments5 min readLW link

(www.harisrab.com)

[Question] Will research in AI risk jinx it? Consequences of training AI on AI risk arguments

Yann Dubois19 Dec 2022 22:42 UTC

5 points

6 comments1 min readLW link

AGI Timelines in Governance: Different Strategies for Different Timeframes

simeon_c and AmberDawn

19 Dec 2022 21:31 UTC

47 points

15 comments10 min readLW link

(Extremely) Naive Gradient Hacking Doesn’t Work

ojorgensen20 Dec 2022 14:35 UTC

6 points

0 comments6 min readLW link

An Open Agency Architecture for Safe Transformative AI

davidad20 Dec 2022 13:04 UTC

18 points

12 comments4 min readLW link

Properties of current AIs and some predictions of the evolution of AI from the perspective of scale-free theories of agency and regulative development

Roman Leventov20 Dec 2022 17:13 UTC

7 points

0 comments36 min readLW link

I believe some AI doomers are overconfident

FTPickle20 Dec 2022 17:09 UTC

10 points

14 comments2 min readLW link

Performing an SVD on a time-series matrix of gradient updates on an MNIST network produces 92.5 singular values

Garrett Baker21 Dec 2022 0:44 UTC

8 points

10 comments5 min readLW link

CIRL Corrigibility is Fragile

Rachel Freedman and AdamGleave

21 Dec 2022 1:40 UTC

21 points

1 comment12 min readLW link

New AI risk intro from Vox [link post]

JakubK21 Dec 2022 6:00 UTC

5 points

1 comment2 min readLW link

(www.vox.com)

habryka 3 Oct 2021 4:21 UTC
2 points
Some of the recent edits changed some of the links to no longer have the ?showPostCount=true&useTagName=true query parameters in the links, which changes how they are displayed and makes the display inconsistent. Seems like we should fix this.
- plex 4 Oct 2021 16:50 UTC
  1 point
  0
  Parent
  Yep, that was me adding some new ones without the parameter (though I think I didn’t remove it from any which already had it), did not know that was needed, fixed now (and fixed on portal page).
plex 29 Aug 2021 15:58 UTC
2 points
I think Basic Alignment Theory should be renamed, very little of it is basic. I propose either Alignment Theory or Conceptual Alignment (credit to @adamshimi for the name).