AI

Core TagLast edit: Jan 23, 2025, 12:13 PM by Dakara

Artificial Intelligence is the study of creating intelligence in algorithms. AI Alignment is the task of ensuring [powerful] AI system are aligned with human values and interests. The central concern is that a powerful enough AI, if not designed and implemented with sufficient understanding, would optimize something unintended by its creators and pose an existential threat to the future of humanity. This is known as the AI alignment problem.

Common terms in this space are superintelligence, AI Alignment, AI Safety, Friendly AI, Transformative AI, human-level-intelligence, AI Governance, and Beneficial AI. This entry and the associated tag roughly encompass all of these topics: anything part of the broad cluster of understanding AI and its future impacts on our civilization deserves this tag.

AI Alignment

There are narrow conceptions of alignment, where you’re trying to get it to do something like cure Alzheimer’s disease without destroying the rest of the world. And there’s much more ambitious notions of alignment, where you’re trying to get it to do the right thing and achieve a happy intergalactic civilization.

But both the narrow and the ambitious alignment have in common that you’re trying to have the AI do that thing rather than making a lot of paperclips.

An overview of 11 proposals for building safe advanced AI

evhubMay 29, 2020, 8:38 PM

194 points

36 comments38 min readLW link 2 reviews

There’s No Fire Alarm for Artificial General Intelligence

Eliezer YudkowskyOct 13, 2017, 9:38 PM

124 points

71 comments25 min readLW link

Superintelligence FAQ

Scott AlexanderSep 20, 2016, 7:00 PM

92 points

16 comments27 min readLW link

Risks from Learned Optimization: Introduction

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

May 31, 2019, 11:44 PM

166 points

42 comments12 min readLW link 3 reviews

Embedded Agents

abramdemski and Scott Garrabrant

Oct 29, 2018, 7:53 PM

198 points

41 comments1 min readLW link 2 reviews

What failure looks like

paulfchristianoMar 17, 2019, 8:18 PM

319 points

49 comments8 min readLW link 2 reviews

The Rocket Alignment Problem

Eliezer YudkowskyOct 4, 2018, 12:38 AM

198 points

42 comments15 min readLW link 2 reviews

Challenges to Christiano’s capability amplification proposal

Eliezer YudkowskyMay 19, 2018, 6:18 PM

115 points

54 comments23 min readLW link 1 review

Embedded Agency (full-text version)

Scott Garrabrant and abramdemski

Nov 15, 2018, 7:49 PM

143 points

15 comments54 min readLW link

A space of proposals for building safe advanced AI

Richard_NgoJul 10, 2020, 4:58 PM

55 points

4 comments4 min readLW link

Biology-Inspired AGI Timelines: The Trick That Never Works

Eliezer YudkowskyDec 1, 2021, 10:35 PM

181 points

143 comments65 min readLW link

PreDCA: vanessa kosoy’s alignment protocol

Tamsin LeakeAug 20, 2022, 10:03 AM

46 points

8 comments7 min readLW link

(carado.moe)

larger language models may disappoint you [or, an eternally unfinished draft]

nostalgebraistNov 26, 2021, 11:08 PM

237 points

29 comments31 min readLW link 1 review

Deepmind’s Gopher—more powerful than GPT-3

hathDec 8, 2021, 5:06 PM

86 points

27 comments1 min readLW link

(deepmind.com)

Project proposal: Testing the IBP definition of agent

Jeremy Gillen, Thomas Larsen and JamesH

Aug 9, 2022, 1:09 AM

21 points

4 comments2 min readLW link

Goodhart Taxonomy

Scott GarrabrantDec 30, 2017, 4:38 PM

180 points

33 comments10 min readLW link

AI Alignment 2018-19 Review

Rohin ShahJan 28, 2020, 2:19 AM

125 points

6 comments35 min readLW link

Some AI research areas and their relevance to existential safety

Andrew_CritchNov 19, 2020, 3:18 AM

199 points

40 comments50 min readLW link 2 reviews

Moravec’s Paradox Comes From The Availability Heuristic

james.lucassenOct 20, 2021, 6:23 AM

32 points

2 comments2 min readLW link

(jlucassen.com)

Inference cost limits the impact of ever larger models

SoerenMindOct 23, 2021, 10:51 AM

36 points

28 comments2 min readLW link

[Linkpost] Chinese government’s guidelines on AI

RomanSDec 10, 2021, 9:10 PM

61 points

14 comments1 min readLW link

That Alien Message

Eliezer YudkowskyMay 22, 2008, 5:55 AM

304 points

173 comments10 min readLW link

Epistemological Framing for AI Alignment Research

adamShimiMar 8, 2021, 10:05 PM

53 points

7 comments9 min readLW link

EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

gwernNov 2, 2021, 2:32 AM

134 points

52 comments1 min readLW link

(arxiv.org)

Discussion with Eliezer Yudkowsky on AGI interventions

Rob Bensinger and Eliezer Yudkowsky

Nov 11, 2021, 3:01 AM

325 points

257 comments34 min readLW link

Shulman and Yudkowsky on AI progress

Eliezer Yudkowsky and CarlShulman

Dec 3, 2021, 8:05 PM

90 points

16 comments20 min readLW link

Future ML Systems Will Be Qualitatively Different

jsteinhardtJan 11, 2022, 7:50 PM

113 points

10 comments5 min readLW link

(bounded-regret.ghost.io)

[Linkpost] TrojanNet: Embedding Hidden Trojan Horse Models in Neural Networks

Gunnar_ZarnckeFeb 11, 2022, 1:17 AM

13 points

1 comment1 min readLW link

Briefly thinking through some analogs of debate

Eli TyreSep 11, 2022, 12:02 PM

20 points

3 comments4 min readLW link

Robustness to Scale

Scott GarrabrantFeb 21, 2018, 10:55 PM

109 points

22 comments2 min readLW link 1 review

Chris Olah’s views on AGI safety

evhubNov 1, 2019, 8:13 PM

197 points

38 comments12 min readLW link 2 reviews

[AN #96]: Buck and I discuss/argue about AI Alignment

Rohin ShahApr 22, 2020, 5:20 PM

17 points

4 comments10 min readLW link

(mailchi.mp)

Matt Botvinick on the spontaneous emergence of learning algorithms

Adam SchollAug 12, 2020, 7:47 AM

147 points

87 comments5 min readLW link

A descriptive, not prescriptive, overview of current AI Alignment Research

Jan, Logan Riggs, jacquesthibs and janus

Jun 6, 2022, 9:59 PM

126 points

21 comments7 min readLW link

Coherence arguments do not entail goal-directed behavior

Rohin ShahDec 3, 2018, 3:26 AM

101 points

69 comments7 min readLW link 3 reviews

Alignment By Default

johnswentworthAug 12, 2020, 6:54 PM

153 points

92 comments11 min readLW link 2 reviews

Book review: “A Thousand Brains” by Jeff Hawkins

Steven ByrnesMar 4, 2021, 5:10 AM

110 points

18 comments19 min readLW link

Modelling Transformative AI Risks (MTAIR) Project: Introduction

Davidmanheim and Aryeh Englander

Aug 16, 2021, 7:12 AM

89 points

0 comments9 min readLW link

Infra-Bayesian physicalism: a formal theory of naturalized induction

Vanessa KosoyNov 30, 2021, 10:25 PM

98 points

20 comments42 min readLW link 1 review

What an actually pessimistic containment strategy looks like

lcApr 5, 2022, 12:19 AM

554 points

136 comments6 min readLW link

Why I think strong general AI is coming soon

porbySep 28, 2022, 5:40 AM

269 points

126 comments34 min readLW link

AlphaGo Zero and the Foom Debate

Eliezer YudkowskyOct 21, 2017, 2:18 AM

89 points

17 comments3 min readLW link

Tradeoff between desirable properties for baseline choices in impact measures

VikaJul 4, 2020, 11:56 AM

37 points

24 comments5 min readLW link

Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns

stuhlmuellerJul 21, 2020, 8:06 PM

80 points

40 comments3 min readLW link

the scaling “inconsistency”: openAI’s new insight

nostalgebraistNov 7, 2020, 7:40 AM

146 points

14 comments9 min readLW link

(nostalgebraist.tumblr.com)

2019 Review Rewrite: Seeking Power is Often Robustly Instrumental in MDPs

TurnTroutDec 23, 2020, 5:16 PM

35 points

0 comments4 min readLW link

(www.lesswrong.com)

Bootstrapped Alignment

Gordon Seidoh WorleyFeb 27, 2021, 3:46 PM

19 points

12 comments2 min readLW link

Multimodal Neurons in Artificial Neural Networks

Kaj_SotalaMar 5, 2021, 9:01 AM

57 points

2 comments2 min readLW link

(distill.pub)

Review of “Fun with +12 OOMs of Compute”

adamShimi, Joe Collman and Gyrodiot

Mar 28, 2021, 2:55 PM

60 points

20 comments8 min readLW link

Draft report on existential risk from power-seeking AI

Joe CarlsmithApr 28, 2021, 9:41 PM

80 points

23 comments1 min readLW link

Rogue AGI Embodies Valuable Intellectual Property

Mark Xu and CarlShulman

Jun 3, 2021, 8:37 PM

70 points

9 comments3 min readLW link

DeepMind: Generally capable agents emerge from open-ended play

Daniel KokotajloJul 27, 2021, 2:19 PM

247 points

53 comments2 min readLW link

(deepmind.com)

Analogies and General Priors on Intelligence

riceissa and Sammy Martin

Aug 20, 2021, 9:03 PM

57 points

12 comments14 min readLW link

We’re already in AI takeoff

ValentineMar 8, 2022, 11:09 PM

120 points

115 comments7 min readLW link

It Looks Like You’re Trying To Take Over The World

gwernMar 9, 2022, 4:35 PM

386 points

125 comments1 min readLW link

(www.gwern.net)

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. MurphyMay 12, 2022, 8:01 PM

45 points

0 comments59 min readLW link

Why all the fuss about recursive self-improvement?

So8resJun 12, 2022, 8:53 PM

150 points

62 comments7 min readLW link

AI Safety bounty for practical homomorphic encryption

acylhalideAug 19, 2022, 12:27 PM

29 points

9 comments4 min readLW link

Paper: Discovering novel algorithms with AlphaTensor [Deepmind]

LawrenceCOct 5, 2022, 4:20 PM

80 points

18 comments1 min readLW link

(www.deepmind.com)

The Teacup Test

lsusrOct 8, 2022, 4:25 AM

71 points

28 comments2 min readLW link

Discontinuous progress in history: an update

KatjaGraceApr 14, 2020, 12:00 AM

179 points

25 comments31 min readLW link 1 review

(aiimpacts.org)

Replication Dynamics Bridge to RL in Thermodynamic Limit

Past AccountMay 18, 2020, 1:02 AM

6 points

1 comment2 min readLW link

The ground of optimization

Alex FlintJun 20, 2020, 12:38 AM

218 points

74 comments27 min readLW link 1 review

Modelling Continuous Progress

Sammy MartinJun 23, 2020, 6:06 PM

29 points

3 comments7 min readLW link

Reframing Superintelligence: Comprehensive AI Services as General Intelligence

Rohin ShahJan 8, 2019, 7:12 AM

118 points

75 comments5 min readLW link 2 reviews

(www.fhi.ox.ac.uk)

Classification of AI alignment research: deconfusion, “good enough” non-superintelligent AI alignment, superintelligent AI alignment

philip_bJul 14, 2020, 10:48 PM

35 points

25 comments3 min readLW link

Collection of GPT-3 results

Kaj_SotalaJul 18, 2020, 8:04 PM

89 points

24 comments1 min readLW link

(twitter.com)

Hiring engineers and researchers to help align GPT-3

paulfchristianoOct 1, 2020, 6:54 PM

206 points

14 comments3 min readLW link

The date of AI Takeover is not the day the AI takes over

Daniel KokotajloOct 22, 2020, 10:41 AM

116 points

32 comments2 min readLW link 1 review

[Question] What could one do with truly unlimited computational power?

YitzNov 11, 2020, 10:03 AM

30 points

22 comments2 min readLW link

AGI Predictions

Amandango and Ben Pace

Nov 21, 2020, 3:46 AM

110 points

36 comments4 min readLW link

[Question] What are the best precedents for industries failing to invest in valuable AI research?

Daniel KokotajloDec 14, 2020, 11:57 PM

18 points

17 comments1 min readLW link

Extrapolating GPT-N performance

Lukas FinnvedenDec 18, 2020, 9:41 PM

103 points

31 comments25 min readLW link 1 review

Debate update: Obfuscated arguments problem

Beth BarnesDec 23, 2020, 3:24 AM

125 points

21 comments16 min readLW link

Literature Review on Goal-Directedness

adamShimi, Michele Campolo and Joe Collman

Jan 18, 2021, 11:15 AM

69 points

21 comments31 min readLW link

[Question] How will OpenAI + GitHub’s Copilot affect programming?

smountjoyJun 29, 2021, 4:42 PM

55 points

23 comments1 min readLW link

Modeling Risks From Learned Optimization

Ben CottierOct 12, 2021, 8:54 PM

44 points

0 comments12 min readLW link

Truthful AI: Developing and governing AI that does not lie

Owain_Evans, owencb and Lukas Finnveden

Oct 18, 2021, 6:37 PM

81 points

9 comments10 min readLW link

EfficientZero: How It Works

1a3ornNov 26, 2021, 3:17 PM

273 points

42 comments29 min readLW link

Theoretical Neuroscience For Alignment Theory

Cameron BergDec 7, 2021, 9:50 PM

62 points

19 comments23 min readLW link

Magna Alta Doctrina

jacob_cannellDec 11, 2021, 9:54 PM

37 points

7 comments28 min readLW link

DL towards the unaligned Recursive Self-Optimization attractor

jacob_cannellDec 18, 2021, 2:15 AM

32 points

22 comments4 min readLW link

Regularization Causes Modularity Causes Generalization

dkirmaniJan 1, 2022, 11:34 PM

49 points

7 comments3 min readLW link

Is General Intelligence “Compact”?

DragonGodJul 4, 2022, 1:27 PM

21 points

6 comments22 min readLW link

The Tree of Life: Stanford AI Alignment Theory of Change

Gabe MJul 2, 2022, 6:36 PM

22 points

0 comments14 min readLW link

Shard Theory: An Overview

David UdellAug 11, 2022, 5:44 AM

135 points

34 comments10 min readLW link

How evolution succeeds and fails at value alignment

OcracokeAug 21, 2022, 7:14 AM

21 points

2 comments4 min readLW link

An Untrollable Mathematician Illustrated

abramdemskiMar 20, 2018, 12:00 AM

155 points

38 comments1 min readLW link 1 review

Conditions for Mesa-Optimization

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

Jun 1, 2019, 8:52 PM

75 points

48 comments12 min readLW link

Thoughts on Human Models

Ramana Kumar and Scott Garrabrant

Feb 21, 2019, 9:10 AM

124 points

32 comments10 min readLW link 1 review

Inner alignment in the brain

Steven ByrnesApr 22, 2020, 1:14 PM

76 points

16 comments16 min readLW link

Problem relaxation as a tactic

TurnTroutApr 22, 2020, 11:44 PM

113 points

8 comments7 min readLW link

[Question] How should potential AI alignment researchers gauge whether the field is right for them?

TurnTroutMay 6, 2020, 12:24 PM

20 points

5 comments1 min readLW link

Specification gaming: the flip side of AI ingenuity

Vika, Vlad Mikulik, Matthew Rahtz, tom4everitt, Zac Kenton and janleike

May 6, 2020, 11:51 PM

46 points

8 comments6 min readLW link

Lessons from Isaac: Pitfalls of Reason

adamShimiMay 8, 2020, 8:44 PM

9 points

0 comments8 min readLW link

Corrigibility as outside view

TurnTroutMay 8, 2020, 9:56 PM

36 points

11 comments4 min readLW link

[Question] How to choose a PhD with AI Safety in mind

kwiat.devMay 15, 2020, 10:19 PM

9 points

1 comment1 min readLW link

Reward functions and updating assumptions can hide a multitude of sins

Stuart_ArmstrongMay 18, 2020, 3:18 PM

16 points

2 comments9 min readLW link

Possible takeaways from the coronavirus pandemic for slow AI takeoff

VikaMay 31, 2020, 5:51 PM

135 points

36 comments3 min readLW link 1 review

Focus: you are allowed to be bad at accomplishing your goals

adamShimiJun 3, 2020, 9:04 PM

19 points

17 comments3 min readLW link

Reply to Paul Christiano on Inaccessible Information

Alex FlintJun 5, 2020, 9:10 AM

77 points

15 comments6 min readLW link

Our take on CHAI’s research agenda in under 1500 words

Alex FlintJun 17, 2020, 12:24 PM

112 points

19 comments5 min readLW link

[Question] Question on GPT-3 Excel Demo

Zhitao HouJun 22, 2020, 8:31 PM

0 points

2 comments1 min readLW link

Dynamic inconsistency of the inaction and initial state baseline

Stuart_ArmstrongJul 7, 2020, 12:02 PM

30 points

8 comments2 min readLW link

Cortés, Pizarro, and Afonso as Precedents for Takeover

Daniel KokotajloMar 1, 2020, 3:49 AM

145 points

75 comments11 min readLW link 1 review

[Question] What problem would you like to see Reinforcement Learning applied to?

Julian SchrittwieserJul 8, 2020, 2:40 AM

43 points

4 comments1 min readLW link

My current framework for thinking about AGI timelines

zhukeepaMar 30, 2020, 1:23 AM

107 points

5 comments3 min readLW link

[Question] To what extent is GPT-3 capable of reasoning?

TurnTroutJul 20, 2020, 5:10 PM

70 points

74 comments16 min readLW link

Replicating the replication crisis with GPT-3?

skybrianJul 22, 2020, 9:20 PM

29 points

10 comments1 min readLW link

Can you get AGI from a Transformer?

Steven ByrnesJul 23, 2020, 3:27 PM

114 points

39 comments12 min readLW link

Writing with GPT-3

Jacob FalkovichJul 24, 2020, 3:22 PM

42 points

0 comments4 min readLW link

Inner Alignment: Explain like I’m 12 Edition

Rafael HarthAug 1, 2020, 3:24 PM

175 points

46 comments13 min readLW link 2 reviews

Developmental Stages of GPTs

orthonormalJul 26, 2020, 10:03 PM

140 points

74 comments7 min readLW link 1 review

Generalizing the Power-Seeking Theorems

TurnTroutJul 27, 2020, 12:28 AM

40 points

6 comments4 min readLW link

Are we in an AI overhang?

Andy JonesJul 27, 2020, 12:48 PM

255 points

109 comments4 min readLW link

[Question] What specific dangers arise when asking GPT-N to write an Alignment Forum post?

Matthew BarnettJul 28, 2020, 2:56 AM

44 points

14 comments1 min readLW link

[Question] Probability that other architectures will scale as well as Transformers?

Daniel KokotajloJul 28, 2020, 7:36 PM

22 points

4 comments1 min readLW link

What a 20-year-lead in military tech might look like

Daniel KokotajloJul 29, 2020, 8:10 PM

68 points

44 comments16 min readLW link

[Question] What if memes are common in highly capable minds?

Daniel KokotajloJul 30, 2020, 8:45 PM

36 points

15 comments2 min readLW link

Three mental images from thinking about AGI debate & corrigibility

Steven ByrnesAug 3, 2020, 2:29 PM

55 points

35 comments4 min readLW link

Solving Key Alignment Problems Group

Logan RiggsAug 3, 2020, 7:30 PM

19 points

7 comments2 min readLW link

How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe?

AnirandisSep 10, 2020, 12:40 AM

19 points

20 comments2 min readLW link

My computational framework for the brain

Steven ByrnesSep 14, 2020, 2:19 PM

144 points

26 comments13 min readLW link 1 review

[Question] Where is human level on text prediction? (GPTs task)

Daniel KokotajloSep 20, 2020, 9:00 AM

27 points

19 comments1 min readLW link

Needed: AI infohazard policy

Vanessa KosoySep 21, 2020, 3:26 PM

61 points

17 comments2 min readLW link

The Colliding Exponentials of AI

VermillionOct 14, 2020, 11:31 PM

27 points

16 comments5 min readLW link

“Little glimpses of empathy” as the foundation for social emotions

Steven ByrnesOct 22, 2020, 11:02 AM

31 points

1 comment5 min readLW link

Introduction to Cartesian Frames

Scott GarrabrantOct 22, 2020, 1:00 PM

145 points

29 comments22 min readLW link 1 review

“Cartesian Frames” Talk #2 this Sunday at 2pm (PT)

Rob BensingerOct 28, 2020, 1:59 PM

30 points

0 comments1 min readLW link

Does SGD Produce Deceptive Alignment?

Mark XuNov 6, 2020, 11:48 PM

85 points

6 comments16 min readLW link

[Question] How can I bet on short timelines?

Daniel KokotajloNov 7, 2020, 12:44 PM

43 points

16 comments2 min readLW link

Non-Obstruction: A Simple Concept Motivating Corrigibility

TurnTroutNov 21, 2020, 7:35 PM

67 points

19 comments19 min readLW link

Cartesian Frames Definitions

Rob BensingerNov 8, 2020, 12:44 PM

25 points

0 comments4 min readLW link

Communication Prior as Alignment Strategy

johnswentworthNov 12, 2020, 10:06 PM

40 points

8 comments6 min readLW link

How Roodman’s GWP model translates to TAI timelines

Daniel KokotajloNov 16, 2020, 2:05 PM

22 points

5 comments3 min readLW link

Normativity

abramdemskiNov 18, 2020, 4:52 PM

46 points

11 comments9 min readLW link

Inner Alignment in Salt-Starved Rats

Steven ByrnesNov 19, 2020, 2:40 AM

136 points

39 comments11 min readLW link 2 reviews

Continuing the takeoffs debate

Richard_NgoNov 23, 2020, 3:58 PM

67 points

13 comments9 min readLW link

The next AI winter will be due to energy costs

hippkeNov 24, 2020, 4:53 PM

57 points

7 comments2 min readLW link

Recursive Quantilizers II

abramdemskiDec 2, 2020, 3:26 PM

30 points

15 comments13 min readLW link

Supervised learning in the brain, part 4: compression / filtering

Steven ByrnesDec 5, 2020, 5:06 PM

12 points

0 comments5 min readLW link

Conservatism in neocortex-like AGIs

Steven ByrnesDec 8, 2020, 4:37 PM

22 points

5 comments8 min readLW link

Avoiding Side Effects in Complex Environments

TurnTrout and nealeratzlaff

Dec 12, 2020, 12:34 AM

62 points

9 comments2 min readLW link

(avoiding-side-effects.github.io)

The Power of Annealing

meanderingmooseDec 14, 2020, 11:02 AM

25 points

6 comments5 min readLW link

[link] The AI Girlfriend Seducing China’s Lonely Men

Kaj_SotalaDec 14, 2020, 8:18 PM

34 points

11 comments1 min readLW link

(www.sixthtone.com)

Operationalizing compatibility with strategy-stealing

evhubDec 24, 2020, 10:36 PM

41 points

6 comments4 min readLW link

Defusing AGI Danger

Mark XuDec 24, 2020, 10:58 PM

48 points

9 comments9 min readLW link

Multi-dimensional rewards for AGI interpretability and control

Steven ByrnesJan 4, 2021, 3:08 AM

19 points

8 comments10 min readLW link

DALL-E by OpenAI

Daniel KokotajloJan 5, 2021, 8:05 PM

97 points

22 comments1 min readLW link

Review of ‘But exactly how complex and fragile?’

TurnTroutJan 6, 2021, 6:39 PM

55 points

0 comments8 min readLW link

The Case for a Journal of AI Alignment

adamShimiJan 9, 2021, 6:13 PM

45 points

32 comments4 min readLW link

Transparency and AGI safety

jylin04Jan 11, 2021, 6:51 PM

52 points

12 comments30 min readLW link

Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain

Daniel KokotajloJan 18, 2021, 12:08 PM

184 points

85 comments14 min readLW link 1 review

Infra-Bayesianism Unwrapped

adamShimiJan 20, 2021, 1:35 PM

41 points

0 comments24 min readLW link

Optimal play in human-judged Debate usually won’t answer your question

Joe CollmanJan 27, 2021, 7:34 AM

33 points

12 comments12 min readLW link

Creating AGI Safety Interlocks

Koen.HoltmanFeb 5, 2021, 12:01 PM

7 points

4 comments8 min readLW link

Timeline of AI safety

riceissaFeb 7, 2021, 10:29 PM

63 points

6 comments2 min readLW link

(timelines.issarice.com)

Tournesol, YouTube and AI Risk

adamShimiFeb 12, 2021, 6:56 PM

36 points

13 comments4 min readLW link

Internet Encyclopedia of Philosophy on Ethics of Artificial Intelligence

Kaj_SotalaFeb 20, 2021, 1:54 PM

15 points

1 comment4 min readLW link

(iep.utm.edu)

Behavioral Sufficient Statistics for Goal-Directedness

adamShimiMar 11, 2021, 3:01 PM

21 points

12 comments9 min readLW link

A simple way to make GPT-3 follow instructions

Quintin PopeMar 8, 2021, 2:57 AM

11 points

5 comments4 min readLW link

Towards a Mechanistic Understanding of Goal-Directedness

Mark XuMar 9, 2021, 8:17 PM

45 points

1 comment5 min readLW link

AXRP Episode 5 - Infra-Bayesianism with Vanessa Kosoy

DanielFilanMar 10, 2021, 4:30 AM

33 points

12 comments35 min readLW link

Comments on “The Singularity is Nowhere Near”

Steven ByrnesMar 16, 2021, 11:59 PM

50 points

6 comments8 min readLW link

Is RL involved in sensory processing?

Steven ByrnesMar 18, 2021, 1:57 PM

21 points

21 comments5 min readLW link

Against evolution as an analogy for how humans will create AGI

Steven ByrnesMar 23, 2021, 12:29 PM

44 points

25 comments25 min readLW link

My AGI Threat Model: Misaligned Model-Based RL Agent

Steven ByrnesMar 25, 2021, 1:45 PM

66 points

40 comments16 min readLW link

Coherence arguments imply a force for goal-directed behavior

KatjaGraceMar 26, 2021, 4:10 PM

88 points

27 comments14 min readLW link

(aiimpacts.org)

Transparency Trichotomy

Mark XuMar 28, 2021, 8:26 PM

25 points

2 comments7 min readLW link

Hardware is already ready for the singularity. Algorithm knowledge is the only barrier.

Andrew VlahosMar 30, 2021, 10:48 PM

16 points

3 comments3 min readLW link

Ben Goertzel’s “Kinds of Minds”

JoshuaFoxApr 11, 2021, 12:41 PM

12 points

4 comments1 min readLW link

Updating the Lottery Ticket Hypothesis

johnswentworthApr 18, 2021, 9:45 PM

73 points

41 comments2 min readLW link

Three reasons to expect long AI timelines

Matthew BarnettApr 22, 2021, 6:44 PM

68 points

29 comments11 min readLW link

(matthewbarnett.substack.com)

Beware over-use of the agent model

Alex FlintApr 25, 2021, 10:19 PM

28 points

10 comments5 min readLW link 1 review

Agents Over Cartesian World Models

Mark Xu and evhub

Apr 27, 2021, 2:06 AM

62 points

3 comments27 min readLW link

Less Realistic Tales of Doom

Mark XuMay 6, 2021, 11:01 PM

110 points

13 comments4 min readLW link

Challenge: know everything that the best go bot knows about go

DanielFilanMay 11, 2021, 5:10 AM

48 points

93 comments2 min readLW link

(danielfilan.com)

Formal Inner Alignment, Prospectus

abramdemskiMay 12, 2021, 7:57 PM

91 points

57 comments16 min readLW link

Agency in Conway’s Game of Life

Alex FlintMay 13, 2021, 1:07 AM

97 points

81 comments9 min readLW link 1 review

Knowledge Neurons in Pretrained Transformers

evhubMay 17, 2021, 10:54 PM

98 points

7 comments2 min readLW link

(arxiv.org)

Decoupling deliberation from competition

paulfchristianoMay 25, 2021, 6:50 PM

72 points

16 comments9 min readLW link

(ai-alignment.com)

Power dynamics as a blind spot or blurry spot in our collective world-modeling, especially around AI

Andrew_CritchJun 1, 2021, 6:45 PM

176 points

26 comments6 min readLW link

Game-theoretic Alignment in terms of Attainable Utility

midco and TurnTrout

Jun 8, 2021, 12:36 PM

20 points

2 comments9 min readLW link

Beijing Academy of Artificial Intelligence announces 1,75 trillion parameters model, Wu Dao 2.0

OzyrusJun 3, 2021, 12:07 PM

23 points

9 comments1 min readLW link

(www.engadget.com)

An Intuitive Guide to Garrabrant Induction

Mark XuJun 3, 2021, 10:21 PM

115 points

18 comments24 min readLW link

Conservative Agency with Multiple Stakeholders

TurnTroutJun 8, 2021, 12:30 AM

31 points

0 comments3 min readLW link

Supplement to “Big picture of phasic dopamine”

Steven ByrnesJun 8, 2021, 1:08 PM

13 points

2 comments9 min readLW link

Looking Deeper at Deconfusion

adamShimiJun 13, 2021, 9:29 PM

57 points

13 comments15 min readLW link

[Question] Open problem: how can we quantify player alignment in 2x2 normal-form games?

TurnTroutJun 16, 2021, 2:09 AM

23 points

59 comments1 min readLW link

Reward Is Not Enough

Steven ByrnesJun 16, 2021, 1:52 PM

105 points

18 comments10 min readLW link

Environmental Structure Can Cause Instrumental Convergence

TurnTroutJun 22, 2021, 10:26 PM

71 points

44 comments16 min readLW link

(arxiv.org)

AXRP Episode 9 - Finite Factored Sets with Scott Garrabrant

DanielFilanJun 24, 2021, 10:10 PM

56 points

2 comments58 min readLW link

Musings on general systems alignment

Alex FlintJun 30, 2021, 6:16 PM

31 points

11 comments3 min readLW link

Thoughts on safety in predictive learning

Steven ByrnesJun 30, 2021, 7:17 PM

18 points

17 comments19 min readLW link

The More Power At Stake, The Stronger Instrumental Convergence Gets For Optimal Policies

TurnTroutJul 11, 2021, 5:36 PM

45 points

7 comments6 min readLW link

A world in which the alignment problem seems lower-stakes

TurnTroutJul 8, 2021, 2:31 AM

19 points

17 comments2 min readLW link

Fractional progress estimates for AI timelines and implied resource requirements

Mark Xu and CarlShulman

Jul 15, 2021, 6:43 PM

55 points

6 comments7 min readLW link

Experimentation with AI-generated images (VQGAN+CLIP) | Solarpunk airships fleeing a dragon

Kaj_SotalaJul 15, 2021, 11:00 AM

44 points

4 comments2 min readLW link

(kajsotala.fi)

Seeking Power is Convergently Instrumental in a Broad Class of Environments

TurnTroutAug 8, 2021, 2:02 AM

41 points

15 comments8 min readLW link

LCDT, A Myopic Decision Theory

adamShimi and evhub

Aug 3, 2021, 10:41 PM

50 points

51 comments15 min readLW link

When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives

TurnTroutAug 9, 2021, 5:22 PM

52 points

4 comments5 min readLW link

Two AI-risk-related game design ideas

Daniel KokotajloAug 5, 2021, 1:36 PM

47 points

9 comments5 min readLW link

Research agenda update

Steven ByrnesAug 6, 2021, 7:24 PM

54 points

40 comments7 min readLW link

What 2026 looks like

Daniel KokotajloAug 6, 2021, 4:14 PM

371 points

109 comments16 min readLW link 1 review

Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

TurnTroutNov 18, 2021, 1:54 AM

69 points

8 comments17 min readLW link

(www.overleaf.com)

Dopamine-supervised learning in mammals & fruit flies

Steven ByrnesAug 10, 2021, 4:13 PM

16 points

6 comments8 min readLW link

Free course review — Reliable and Interpretable Artificial Intelligence (ETH Zurich)

Jan CzechowskiAug 10, 2021, 4:36 PM

7 points

0 comments3 min readLW link

Technical Predictions Related to AI Safety

lsusrAug 13, 2021, 12:29 AM

28 points

12 comments8 min readLW link

Provide feedback on Open Philanthropy’s AI alignment RFP

abergal and Nick_Beckstead

Aug 20, 2021, 7:52 PM

56 points

6 comments1 min readLW link

AI Safety Papers: An App for the TAI Safety Database

ozziegooenAug 21, 2021, 2:02 AM

74 points

13 comments2 min readLW link

Randal Koene on brain understanding before whole brain emulation

Steven ByrnesAug 23, 2021, 8:59 PM

36 points

12 comments3 min readLW link

MIRI/OP exchange about decision theory

Rob BensingerAug 25, 2021, 10:44 PM

47 points

7 comments10 min readLW link

Goodhart Ethology

Charlie SteinerSep 17, 2021, 5:31 PM

18 points

4 comments14 min readLW link

[Question] What are good alignment conference papers?

adamShimiAug 28, 2021, 1:35 PM

12 points

2 comments1 min readLW link

Brain-Computer Interfaces and AI Alignment

niplavAug 28, 2021, 7:48 PM

31 points

6 comments11 min readLW link

Superintelligent Introspection: A Counter-argument to the Orthogonality Thesis

DirectedEvolutionAug 29, 2021, 4:53 AM

3 points

18 comments4 min readLW link

Alignment Research = Conceptual Alignment Research + Applied Alignment Research

adamShimiAug 30, 2021, 9:13 PM

37 points

14 comments5 min readLW link

AXRP Episode 11 - Attainable Utility and Power with Alex Turner

DanielFilanSep 25, 2021, 9:10 PM

19 points

5 comments52 min readLW link

Is progress in ML-assisted theorem-proving beneficial?

mako yassSep 28, 2021, 1:54 AM

10 points

3 comments1 min readLW link

Takeoff Speeds and Discontinuities

Sammy Martin and Daniel_Eth

Sep 30, 2021, 1:50 PM

62 points

1 comment15 min readLW link

My take on Vanessa Kosoy’s take on AGI safety

Steven ByrnesSep 30, 2021, 12:23 PM

84 points

10 comments31 min readLW link

[Prediction] We are in an Algorithmic Overhang

lsusrSep 29, 2021, 11:40 PM

31 points

14 comments1 min readLW link

Interview with Skynet

lsusrSep 30, 2021, 2:20 AM

49 points

1 comment2 min readLW link

AI learns betrayal and how to avoid it

Stuart_ArmstrongSep 30, 2021, 9:39 AM

30 points

4 comments2 min readLW link

The Dark Side of Cognition Hypothesis

Cameron BergOct 3, 2021, 8:10 PM

19 points

1 comment16 min readLW link

[Question] How to think about and deal with OpenAI

Rafael HarthOct 9, 2021, 1:10 PM

107 points

71 comments1 min readLW link

NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG

OzyrusOct 11, 2021, 3:28 PM

51 points

36 comments1 min readLW link

(developer.nvidia.com)

Postmodern Warfare

lsusrOct 25, 2021, 9:02 AM

61 points

25 comments2 min readLW link

A very crude deception eval is already passed

Beth BarnesOct 29, 2021, 5:57 PM

105 points

8 comments2 min readLW link

Study Guide

johnswentworthNov 6, 2021, 1:23 AM

220 points

41 comments16 min readLW link

Re: Attempted Gears Analysis of AGI Intervention Discussion With Eliezer

lsusrNov 15, 2021, 10:02 AM

20 points

8 comments15 min readLW link

Ngo and Yudkowsky on alignment difficulty

Eliezer Yudkowsky and Richard_Ngo

Nov 15, 2021, 8:31 PM

235 points

143 comments99 min readLW link

Corrigibility Can Be VNM-Incoherent

TurnTroutNov 20, 2021, 12:30 AM

64 points

24 comments7 min readLW link

Visible Thoughts Project and Bounty Announcement

So8resNov 30, 2021, 12:19 AM

245 points

104 comments13 min readLW link

Interpreting Yudkowsky on Deep vs Shallow Knowledge

adamShimiDec 5, 2021, 5:32 PM

100 points

32 comments24 min readLW link

Are there alternative to solving value transfer and extrapolation?

Stuart_ArmstrongDec 6, 2021, 6:53 PM

19 points

7 comments5 min readLW link

Considerations on interaction between AI and expected value of the future

Beth BarnesDec 7, 2021, 2:46 AM

64 points

28 comments4 min readLW link

Some thoughts on why adversarial training might be useful

Beth BarnesDec 8, 2021, 1:28 AM

9 points

5 comments3 min readLW link

The Plan

johnswentworthDec 10, 2021, 11:41 PM

235 points

77 comments14 min readLW link

Moore’s Law, AI, and the pace of progress

VeedracDec 11, 2021, 3:02 AM

120 points

39 comments24 min readLW link

Summary of the Acausal Attack Issue for AIXI

DiffractorDec 13, 2021, 8:16 AM

14 points

6 comments4 min readLW link

Consequentialism & corrigibility

Steven ByrnesDec 14, 2021, 1:23 PM

60 points

27 comments7 min readLW link

Should we rely on the speed prior for safety?

Marc CarauleanuDec 14, 2021, 8:45 PM

14 points

6 comments5 min readLW link

The Case for Radical Optimism about Interpretability

Quintin PopeDec 16, 2021, 11:38 PM

57 points

16 comments8 min readLW link 1 review

Researcher incentives cause smoother progress on benchmarks

ryan_greenblattDec 21, 2021, 4:13 AM

20 points

4 comments1 min readLW link

Self-Organised Neural Networks: A simple, natural and efficient way to intelligence

D𝜋Jan 1, 2022, 11:24 PM

41 points

51 comments44 min readLW link

Prizes for ELK proposals

paulfchristianoJan 3, 2022, 8:23 PM

141 points

156 comments7 min readLW link

D𝜋′s Spiking Network

lsusrJan 4, 2022, 4:08 AM

50 points

37 comments4 min readLW link

More Is Different for AI

jsteinhardtJan 4, 2022, 7:30 PM

137 points

22 comments3 min readLW link

(bounded-regret.ghost.io)

Instrumental Convergence For Realistic Agent Objectives

TurnTroutJan 22, 2022, 12:41 AM

35 points

9 comments9 min readLW link

What’s Up With Confusingly Pervasive Consequentialism?

RaemonJan 20, 2022, 7:22 PM

169 points

88 comments4 min readLW link

[Intro to brain-like-AGI safety] 1. What’s the problem & Why work on it now?

Steven ByrnesJan 26, 2022, 3:23 PM

119 points

19 comments23 min readLW link

Arguments about Highly Reliable Agent Designs as a Useful Path to Artificial Intelligence Safety

riceissa and Davidmanheim

Jan 27, 2022, 1:13 PM

27 points

0 comments1 min readLW link

(arxiv.org)

Competitive programming with AlphaCode

AlgonFeb 2, 2022, 4:49 PM

58 points

37 comments15 min readLW link

(deepmind.com)

Thoughts on AGI safety from the top

jylin04Feb 2, 2022, 8:06 PM

35 points

3 comments32 min readLW link

Paradigm-building from first principles: Effective altruism, AGI, and alignment

Cameron BergFeb 8, 2022, 4:12 PM

24 points

5 comments14 min readLW link

[Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering

Steven ByrnesFeb 9, 2022, 1:09 PM

59 points

3 comments24 min readLW link

[Intro to brain-like-AGI safety] 4. The “short-term predictor”

Steven ByrnesFeb 16, 2022, 1:12 PM

51 points

11 comments13 min readLW link

ELK Proposal: Thinking Via A Human Imitator

TurnTroutFeb 22, 2022, 1:52 AM

28 points

6 comments11 min readLW link

Why I’m co-founding Aligned AI

Stuart_ArmstrongFeb 17, 2022, 7:55 PM

93 points

54 comments3 min readLW link

Implications of automated ontology identification

Alex Flint, adamShimi and Robert Miles

Feb 18, 2022, 3:30 AM

67 points

29 comments23 min readLW link

Alignment research exercises

Richard_NgoFeb 21, 2022, 8:24 PM

146 points

17 comments8 min readLW link

[Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning

Steven ByrnesFeb 23, 2022, 2:44 PM

41 points

25 comments21 min readLW link

How do new models from OpenAI, DeepMind and Anthropic perform on TruthfulQA?

Owain_EvansFeb 26, 2022, 12:46 PM

42 points

3 comments11 min readLW link

Estimating Brain-Equivalent Compute from Image Recognition Algorithms

Gunnar_ZarnckeFeb 27, 2022, 2:45 AM

14 points

4 comments2 min readLW link

[Link] Aligned AI AMA

Stuart_ArmstrongMar 1, 2022, 12:01 PM

18 points

0 comments1 min readLW link

[Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL

Steven ByrnesMar 2, 2022, 3:26 PM

41 points

13 comments16 min readLW link

[Question] Would (myopic) general public good producers significantly accelerate the development of AGI?

mako yassMar 2, 2022, 11:47 PM

25 points

10 comments1 min readLW link

[Intro to brain-like-AGI safety] 7. From hardcoded drives to foresighted plans: A worked example

Steven ByrnesMar 9, 2022, 2:28 PM

56 points

0 comments9 min readLW link

[Intro to brain-like-AGI safety] 9. Takeaways from neuro 2/2: On AGI motivation

Steven ByrnesMar 23, 2022, 12:48 PM

31 points

6 comments23 min readLW link

Humans pretending to be robots pretending to be human

Richard_KennawayMar 28, 2022, 3:13 PM

27 points

15 comments1 min readLW link

[Intro to brain-like-AGI safety] 10. The alignment problem

Steven ByrnesMar 30, 2022, 1:24 PM

34 points

4 comments21 min readLW link

AXRP Episode 13 - First Principles of AGI Safety with Richard Ngo

DanielFilanMar 31, 2022, 5:20 AM

24 points

1 comment48 min readLW link

Uncontrollable Super-Powerful Explosives

Sammy MartinApr 2, 2022, 8:13 PM

53 points

12 comments5 min readLW link

The case for Doing Something Else (if Alignment is doomed)

Rafael HarthApr 5, 2022, 5:52 PM

81 points

14 comments2 min readLW link

[Intro to brain-like-AGI safety] 11. Safety ≠ alignment (but they’re close!)

Steven ByrnesApr 6, 2022, 1:39 PM

25 points

1 comment10 min readLW link

Strategic Considerations Regarding Autistic/Literal AI

Chris_LeongApr 6, 2022, 2:57 PM

−1 points

2 comments2 min readLW link

DALL·E 2 by OpenAI

P.Apr 6, 2022, 2:17 PM

44 points

51 comments1 min readLW link

(openai.com)

How to train your transformer

p.b.Apr 7, 2022, 9:34 AM

6 points

0 comments8 min readLW link

AMA Conjecture, A New Alignment Startup

adamShimiApr 9, 2022, 9:43 AM

46 points

42 comments1 min readLW link

Worse than an unaligned AGI

ShmiApr 10, 2022, 3:35 AM

−1 points

12 comments1 min readLW link

[Question] Did OpenAI let GPT out of the box?

ChristianKlApr 16, 2022, 2:56 PM

4 points

12 comments1 min readLW link

Instrumental Convergence To Offer Hope?

michael_mjdApr 22, 2022, 1:56 AM

12 points

7 comments3 min readLW link

[Intro to brain-like-AGI safety] 13. Symbol grounding & human social instincts

Steven ByrnesApr 27, 2022, 1:30 PM

54 points

13 comments14 min readLW link

[Intro to brain-like-AGI safety] 14. Controlled AGI

Steven ByrnesMay 11, 2022, 1:17 PM

26 points

25 comments18 min readLW link

[Question] What’s keeping concerned capabilities gain researchers from leaving the field?

sovranMay 12, 2022, 12:16 PM

19 points

4 comments1 min readLW link

[Question] What’s keeping concerned capabilities gain researchers from leaving the field?

sovranMay 12, 2022, 12:16 PM

19 points

4 comments1 min readLW link

Reading the ethicists: A review of articles on AI in the journal Science and Engineering Ethics

Charlie SteinerMay 18, 2022, 8:52 PM

50 points

8 comments14 min readLW link

Confused why a “capabilities research is good for alignment progress” position isn’t discussed more

Kaj_SotalaJun 2, 2022, 9:41 PM

132 points

26 comments4 min readLW link

I’m trying out “asteroid mindset”

Alex_AltairJun 3, 2022, 1:35 PM

85 points

5 comments4 min readLW link

Announcing the Alignment of Complex Systems Research Group

Jan_Kulveit and technicalities

Jun 4, 2022, 4:10 AM

79 points

18 comments5 min readLW link

AGI Ruin: A List of Lethalities

Eliezer YudkowskyJun 5, 2022, 10:05 PM

725 points

653 comments30 min readLW link

Yes, AI research will be substantially curtailed if a lab causes a major disaster

lcJun 14, 2022, 10:17 PM

96 points

35 comments2 min readLW link

Lamda is not an LLM

KevinJun 19, 2022, 11:13 AM

7 points

10 comments1 min readLW link

(www.wired.com)

Google’s new text-to-image model—Parti, a demonstration of scaling benefits

KaydenJun 22, 2022, 8:00 PM

32 points

4 comments1 min readLW link

[Link] OpenAI: Learning to Play Minecraft with Video PreTraining (VPT)

Aryeh EnglanderJun 23, 2022, 4:29 PM

53 points

3 comments1 min readLW link

Announcing Epoch: A research organization investigating the road to Transformative AI

Jsevillamol, Pablo Villalobos, Tamay, lennart, Marius Hobbhahn and anson.ho

Jun 27, 2022, 1:55 PM

95 points

2 comments2 min readLW link

(epochai.org)

Paper: Forecasting world events with neural nets

Owain_Evans, Dan H and Joe Kwon

Jul 1, 2022, 7:40 PM

39 points

3 comments4 min readLW link

Naive Hypotheses on AI Alignment

Shoshannah TekofskyJul 2, 2022, 7:03 PM

89 points

29 comments5 min readLW link

Humans provide an untapped wealth of evidence about alignment

TurnTrout and Quintin Pope

Jul 14, 2022, 2:31 AM

175 points

92 comments10 min readLW link

Examples of AI Increasing AI Progress

TW123Jul 17, 2022, 8:06 PM

104 points

14 comments1 min readLW link

Forecasting ML Benchmarks in 2023

jsteinhardtJul 18, 2022, 2:50 AM

36 points

19 comments12 min readLW link

(bounded-regret.ghost.io)

Robustness to Scaling Down: More Important Than I Thought

adamShimiJul 23, 2022, 11:40 AM

37 points

5 comments3 min readLW link

Comparing Four Approaches to Inner Alignment

Lucas TeixeiraJul 29, 2022, 9:06 PM

33 points

1 comment9 min readLW link

Where are the red lines for AI?

Karl von WendtAug 5, 2022, 9:34 AM

23 points

8 comments6 min readLW link

Jack Clark on the realities of AI policy

Kaj_SotalaAug 7, 2022, 8:44 AM

66 points

3 comments3 min readLW link

(threadreaderapp.com)

GD’s Implicit Bias on Separable Data

Xander DaviesOct 17, 2022, 4:13 AM

23 points

0 comments7 min readLW link

AI Transparency: Why it’s critical and how to obtain it.

Zohar JacksonAug 14, 2022, 10:31 AM

6 points

1 comment5 min readLW link

Brain-like AGI project “aintelope”

Gunnar_ZarnckeAug 14, 2022, 4:33 PM

48 points

2 comments1 min readLW link

A Mechanistic Interpretability Analysis of Grokking

Neel Nanda and Tom Lieberum

Aug 15, 2022, 2:41 AM

338 points

39 comments42 min readLW link

(colab.research.google.com)

What if we approach AI safety like a technical engineering safety problem

zeshenAug 20, 2022, 10:29 AM

30 points

5 comments7 min readLW link

AI art isn’t “about to shake things up”. It’s already here.

Davis_KingsleyAug 22, 2022, 11:17 AM

65 points

19 comments3 min readLW link

Some conceptual alignment research projects

Richard_NgoAug 25, 2022, 10:51 PM

168 points

14 comments3 min readLW link

Levelling Up in AI Safety Research Engineering

Gabe MSep 2, 2022, 4:59 AM

40 points

7 comments17 min readLW link

The shard theory of human values

Quintin Pope and TurnTrout

Sep 4, 2022, 4:28 AM

202 points

57 comments24 min readLW link

Quintin’s alignment papers roundup—week 1

Quintin PopeSep 10, 2022, 6:39 AM

119 points

5 comments9 min readLW link

LOVE in a simbox is all you need

jacob_cannellSep 28, 2022, 6:25 PM

59 points

69 comments44 min readLW link

A shot at the diamond-alignment problem

TurnTroutOct 6, 2022, 6:29 PM

77 points

53 comments15 min readLW link

More examples of goal misgeneralization

Rohin Shah and Vikrant Varma

Oct 7, 2022, 2:38 PM

51 points

8 comments2 min readLW link

(deepmindsafetyresearch.medium.com)

[Crosspost] AlphaTensor, Taste, and the Scalability of AI

jamierumbelowOct 9, 2022, 7:42 PM

16 points

4 comments1 min readLW link

(jamieonsoftware.com)

QAPR 4: Inductive biases

Quintin PopeOct 10, 2022, 10:08 PM

63 points

2 comments18 min readLW link

Infinite Possibility Space and the Shutdown Problem

magfrumpOct 18, 2022, 5:37 AM

6 points

0 comments2 min readLW link

(www.magfrump.net)

Cruxes in Katja Grace’s Counterarguments

azsantoskOct 16, 2022, 8:44 AM

16 points

0 comments7 min readLW link

DeepMind on Stratego, an imperfect information game

sanxiynOct 24, 2022, 5:57 AM

15 points

9 comments1 min readLW link

(arxiv.org)

Announcing: What Future World? - Growing the AI Governance Community

DavidCorfieldNov 2, 2022, 1:24 AM

1 point

0 comments1 min readLW link

Poster Session on AI Safety

Neil CrawfordNov 12, 2022, 3:50 AM

7 points

6 comments1 min readLW link

AI will change the world, but won’t take it over by playing “3-dimensional chess”.

boazbarak and benedelman

Nov 22, 2022, 6:57 PM

103 points

86 comments24 min readLW link

A challenge for AGI organizations, and a challenge for readers

Rob Bensinger and Eliezer Yudkowsky

Dec 1, 2022, 11:11 PM

265 points

30 comments2 min readLW link

Towards Hodge-podge Alignment

Cleo NardoDec 19, 2022, 8:12 PM

65 points

26 comments9 min readLW link

[AN #94]: AI alignment as translation between humans and machines

Rohin ShahApr 8, 2020, 5:10 PM

11 points

0 comments7 min readLW link

(mailchi.mp)

[Question] What are the relative speeds of AI capabilities and AI safety?

NunoSempereApr 24, 2020, 6:21 PM

8 points

2 comments1 min readLW link

Seeking Power is Often Convergently Instrumental in MDPs

TurnTrout and Logan Riggs

Dec 5, 2019, 2:33 AM

153 points

38 comments16 min readLW link 2 reviews

(arxiv.org)

“Don’t even think about hell”

emmabMay 2, 2020, 8:06 AM

6 points

2 comments1 min readLW link

[Question] AI Boxing for Hardware-bound agents (aka the China alignment problem)

Logan ZoellnerMay 8, 2020, 3:50 PM

11 points

27 comments10 min readLW link

Pointing to a Flower

johnswentworthMay 18, 2020, 6:54 PM

59 points

18 comments9 min readLW link

Learning and manipulating learning

Stuart_ArmstrongMay 19, 2020, 1:02 PM

39 points

5 comments10 min readLW link

[Question] Why aren’t we testing general intelligence distribution?

B JacobsMay 26, 2020, 4:07 PM

25 points

7 comments1 min readLW link

OpenAI announces GPT-3

gwernMay 29, 2020, 1:49 AM

67 points

23 comments1 min readLW link

(arxiv.org)

GPT-3: a disappointing paper

nostalgebraistMay 29, 2020, 7:06 PM

65 points

44 comments8 min readLW link 1 review

Introduction to Existential Risks from Artificial Intelligence, for an EA audience

JoshuaFoxJun 2, 2020, 8:30 AM

10 points

1 comment1 min readLW link

Preparing for “The Talk” with AI projects

Daniel KokotajloJun 13, 2020, 11:01 PM

64 points

16 comments3 min readLW link

[Question] What are the high-level approaches to AI alignment?

Gordon Seidoh WorleyJun 16, 2020, 5:10 PM

12 points

13 comments1 min readLW link

Results of $1,000 Oracle contest!

Stuart_ArmstrongJun 17, 2020, 5:44 PM

58 points

2 comments1 min readLW link

[Question] Likelihood of hyperexistential catastrophe from a bug?

AnirandisJun 18, 2020, 4:23 PM

13 points

27 comments1 min readLW link

AI Benefits Post 1: Introducing “AI Benefits”

CullenJun 22, 2020, 4:59 PM

11 points

3 comments3 min readLW link

Goals and short descriptions

Michele CampoloJul 2, 2020, 5:41 PM

14 points

8 comments5 min readLW link

Research ideas to study humans with AI Safety in mind

Riccardo VolpatoJul 3, 2020, 4:01 PM

23 points

2 comments5 min readLW link

AI Benefits Post 3: Direct and Indirect Approaches to AI Benefits

CullenJul 6, 2020, 6:48 PM

8 points

0 comments2 min readLW link

Antitrust-Compliant AI Industry Self-Regulation

CullenJul 7, 2020, 8:53 PM

9 points

3 comments1 min readLW link

(cullenokeefe.com)

Should AI Be Open?

Scott AlexanderDec 17, 2015, 8:25 AM

20 points

3 comments13 min readLW link

Meta Programming GPT: A route to Superintelligence?

dmteaJul 11, 2020, 2:51 PM

10 points

7 comments4 min readLW link

The Dilemma of Worse Than Death Scenarios

arkaeikJul 10, 2018, 9:18 AM

5 points

18 comments4 min readLW link

[Question] What are the mostly likely ways AGI will emerge?

Craig QuiterJul 14, 2020, 12:58 AM

3 points

7 comments1 min readLW link

AI Benefits Post 4: Outstanding Questions on Selecting Benefits

CullenJul 14, 2020, 5:26 PM

4 points

4 comments5 min readLW link

Solving Math Problems by Relay

Ben Goldhaber and Owain_Evans

Jul 17, 2020, 3:32 PM

98 points

26 comments7 min readLW link

AI Benefits Post 5: Outstanding Questions on Governing Benefits

CullenJul 21, 2020, 4:46 PM

4 points

0 comments4 min readLW link

[Question] Why is pseudo-alignment “worse” than other ways ML can fail to generalize?

nostalgebraistJul 18, 2020, 10:54 PM

45 points

10 comments2 min readLW link

[Question] “Do Nothing” utility function, 3½ years later?

niplavJul 20, 2020, 11:09 AM

5 points

3 comments1 min readLW link

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

Rohin ShahJan 2, 2020, 6:20 PM

34 points

94 comments10 min readLW link

(mailchi.mp)

Access to AI: a human right?

dmteaJul 25, 2020, 9:38 AM

5 points

3 comments2 min readLW link

The Rise of Commonsense Reasoning

DragonGodJul 27, 2020, 7:01 PM

8 points

0 comments1 min readLW link

(www.reddit.com)

AI and Efficiency

DragonGodJul 27, 2020, 8:58 PM

9 points

1 comment1 min readLW link

(openai.com)

FHI Report: How Will National Security Considerations Affect Antitrust Decisions in AI? An Examination of Historical Precedents

CullenJul 28, 2020, 6:34 PM

2 points

0 comments1 min readLW link

(www.fhi.ox.ac.uk)

The “best predictor is malicious optimiser” problem

Donald HobsonJul 29, 2020, 11:49 AM

14 points

10 comments2 min readLW link

Sufficiently Advanced Language Models Can Do Reinforcement Learning

Past AccountAug 2, 2020, 3:32 PM

21 points

7 comments7 min readLW link

[Question] What are the most important papers/post/resources to read to understand more of GPT-3?

adamShimiAug 2, 2020, 8:53 PM

22 points

4 comments1 min readLW link

[Question] What should an Einstein-like figure in Machine Learning do?

RaziedAug 5, 2020, 11:52 PM

3 points

3 comments1 min readLW link

Book review: Architects of Intelligence by Martin Ford (2018)

OferAug 11, 2020, 5:30 PM

15 points

0 comments2 min readLW link

[Question] Will OpenAI’s work unintentionally increase existential risks related to AI?

adamShimiAug 11, 2020, 6:16 PM

50 points

56 comments1 min readLW link

Blog post: A tale of two research communities

Aryeh EnglanderAug 12, 2020, 8:41 PM

14 points

0 comments4 min readLW link

Mapping Out Alignment

Logan Riggs, adamShimi, Gurkenglas, AlexMennen and Gyrodiot

Aug 15, 2020, 1:02 AM

42 points

0 comments5 min readLW link

My Understanding of Paul Christiano’s Iterated Amplification AI Safety Research Agenda

Chi NguyenAug 15, 2020, 8:02 PM

119 points

21 comments39 min readLW link

GPT-3, belief, and consistency

skybrianAug 16, 2020, 11:12 PM

18 points

7 comments2 min readLW link

[Question] What precisely do we mean by AI alignment?

Gordon Seidoh WorleyDec 9, 2018, 2:23 AM

27 points

8 comments1 min readLW link

Thoughts on the Feasibility of Prosaic AGI Alignment?

iamthouthouartiAug 21, 2020, 11:25 PM

8 points

10 comments1 min readLW link

[Question] Forecasting Thread: AI Timelines

Amandango, Daniel Kokotajlo and Ben Pace

Aug 22, 2020, 2:33 AM

133 points

95 comments2 min readLW link

Learning human preferences: black-box, white-box, and structured white-box access

Stuart_ArmstrongAug 24, 2020, 11:42 AM

25 points

9 comments6 min readLW link

Proofs Section 2.3 (Updates, Decision Theory)

DiffractorAug 27, 2020, 7:49 AM

7 points

0 comments31 min readLW link

Proofs Section 2.2 (Isomorphism to Expectations)

DiffractorAug 27, 2020, 7:52 AM

7 points

0 comments46 min readLW link

Proofs Section 2.1 (Theorem 1, Lemmas)

DiffractorAug 27, 2020, 7:54 AM

7 points

0 comments36 min readLW link

Proofs Section 1.1 (Initial results to LF-duality)

DiffractorAug 27, 2020, 7:59 AM

7 points

0 comments20 min readLW link

Proofs Section 1.2 (Mixtures, Updates, Pushforwards)

DiffractorAug 27, 2020, 7:57 AM

7 points

0 comments14 min readLW link

Basic Inframeasure Theory

DiffractorAug 27, 2020, 8:02 AM

35 points

16 comments25 min readLW link

Belief Functions And Decision Theory

DiffractorAug 27, 2020, 8:00 AM

15 points

8 comments39 min readLW link

Technical model refinement formalism

Stuart_ArmstrongAug 27, 2020, 11:54 AM

19 points

0 comments6 min readLW link

Pong from pixels without reading “Pong from Pixels”

Ian McKenzieAug 29, 2020, 5:26 PM

15 points

1 comment7 min readLW link

Reflections on AI Timelines Forecasting Thread

AmandangoSep 1, 2020, 1:42 AM

53 points

7 comments5 min readLW link

on “learning to summarize”

nostalgebraistSep 12, 2020, 3:20 AM

25 points

13 comments8 min readLW link

(nostalgebraist.tumblr.com)

[Question] The universality of computation and mind design space

alanfSep 12, 2020, 2:58 PM

1 point

7 comments1 min readLW link

Clarifying “What failure looks like”

Sam ClarkeSep 20, 2020, 8:40 PM

95 points

14 comments17 min readLW link

Human Biases that Obscure AI Progress

Danielle EnsignSep 25, 2020, 12:24 AM

42 points

2 comments4 min readLW link

[Question] Competence vs Alignment

kwiat.devSep 30, 2020, 9:03 PM

6 points

4 comments1 min readLW link

AGI safety from first principles: Alignment

Richard_NgoOct 1, 2020, 3:13 AM

56 points

2 comments13 min readLW link

[Question] GPT-3 + GAN

stick109Oct 17, 2020, 7:58 AM

4 points

4 comments1 min readLW link

Book Review: Reinforcement Learning by Sutton and Barto

billmeiOct 20, 2020, 7:40 PM

52 points

3 comments10 min readLW link

GPT-X, Paperclip Maximizer? Analyzing AGI and Final Goals

meanderingmooseOct 22, 2020, 2:33 PM

8 points

1 comment6 min readLW link

Containing the AI… Inside a Simulated Reality

HumaneAutomationOct 31, 2020, 4:16 PM

1 point

9 comments2 min readLW link

Why those who care about catastrophic and existential risk should care about autonomous weapons

aaguirreNov 11, 2020, 3:22 PM

60 points

20 comments19 min readLW link

European Master’s Programs in Machine Learning, Artificial Intelligence, and related fields

Master Programs ML/AINov 14, 2020, 3:51 PM

32 points

8 comments1 min readLW link

Should we postpone AGI until we reach safety?

otto.bartenNov 18, 2020, 3:43 PM

27 points

36 comments3 min readLW link

Commitment and credibility in multipolar AI scenarios

anni_leskelaDec 4, 2020, 6:48 PM

25 points

3 comments18 min readLW link

[Question] AI Winter Is Coming—How to profit from it?

maximkazhenkovDec 5, 2020, 8:23 PM

10 points

7 comments1 min readLW link

Announcing the Technical AI Safety Podcast

QuinnDec 7, 2020, 6:51 PM

42 points

6 comments2 min readLW link

(technical-ai-safety.libsyn.com)

All GPT skills are translation

p.b.Dec 13, 2020, 8:06 PM

4 points

0 comments2 min readLW link

[Question] Judging AGI Output

cy6erlionDec 14, 2020, 12:43 PM

3 points

0 comments2 min readLW link

Risk Map of AI Systems

VojtaKovarik and Jan_Kulveit

Dec 15, 2020, 9:16 AM

25 points

3 comments8 min readLW link

AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

xuanJan 1, 2021, 12:08 AM

30 points

21 comments20 min readLW link

Are we all misaligned?

Mateusz MazurkiewiczJan 3, 2021, 2:42 AM

11 points

0 comments5 min readLW link

[Question] What do we really expect from a well-aligned AI?

Jan BetleyJan 4, 2021, 8:57 PM

8 points

10 comments1 min readLW link

Eight claims about multi-agent AGI safety

Richard_NgoJan 7, 2021, 1:34 PM

73 points

18 comments5 min readLW link

Imitative Generalisation (AKA ‘Learning the Prior’)

Beth BarnesJan 10, 2021, 12:30 AM

92 points

14 comments12 min readLW link

Prediction can be Outer Aligned at Optimum

Lukas FinnvedenJan 10, 2021, 6:48 PM

15 points

12 comments11 min readLW link

[Question] Poll: Which variables are most strategically relevant?

Daniel Kokotajlo and Noa Nabeshima

Jan 22, 2021, 5:17 PM

32 points

34 comments1 min readLW link

AISU 2021

Linda LinseforsJan 30, 2021, 5:40 PM

28 points

2 comments1 min readLW link

Deepmind has made a general inductor (“Making sense of sensory input”)

mako yassFeb 2, 2021, 2:54 AM

48 points

10 comments1 min readLW link

(www.sciencedirect.com)

Counterfactual Planning in AGI Systems

Koen.HoltmanFeb 3, 2021, 1:54 PM

7 points

0 comments5 min readLW link

[AN #136]: How well will GPT-N perform on downstream tasks?

Rohin ShahFeb 3, 2021, 6:10 PM

21 points

2 comments9 min readLW link

(mailchi.mp)

Formal Solution to the Inner Alignment Problem

michaelcohenFeb 18, 2021, 2:51 PM

47 points

123 comments2 min readLW link

TASP Ep 3 - Optimal Policies Tend to Seek Power

QuinnMar 11, 2021, 1:44 AM

24 points

0 comments1 min readLW link

(technical-ai-safety.libsyn.com)

Phylactery Decision Theory

BunthutApr 2, 2021, 8:55 PM

14 points

6 comments2 min readLW link

Predictive Coding has been Unified with Backpropagation

lsusrApr 2, 2021, 9:42 PM

166 points

44 comments2 min readLW link

[Question] What if we could use the theory of Mechanism Design from Game Theory as a medium achieve AI Alignment?

farari7Apr 4, 2021, 12:56 PM

4 points

0 comments1 min readLW link

Another (outer) alignment failure story

paulfchristianoApr 7, 2021, 8:12 PM

210 points

38 comments12 min readLW link

A System For Evolving Increasingly General Artificial Intelligence From Current Technologies

Tsang Chung ShuApr 8, 2021, 9:37 PM

1 point

3 comments11 min readLW link

April 2021 Deep Dive: Transformers and GPT-3

adamShimiMay 1, 2021, 11:18 AM

30 points

6 comments7 min readLW link

[Question] [timeboxed exercise] write me your model of AI human-existential safety and the alignment problems in 15 minutes

QuinnMay 4, 2021, 7:10 PM

6 points

2 comments1 min readLW link

Mostly questions about Dumb AI Kernels

HorizonHeldMay 12, 2021, 10:00 PM

1 point

1 comment9 min readLW link

Thoughts on Iterated Distillation and Amplification

WaddingtonMay 11, 2021, 9:32 PM

9 points

2 comments20 min readLW link

How do we build organisations that want to build safe AI?

sxaeMay 12, 2021, 3:08 PM

4 points

4 comments9 min readLW link

[Question] Who has argued in detail that a current AI system is phenomenally conscious?

RobboMay 14, 2021, 10:03 PM

3 points

2 comments1 min readLW link

How I Learned to Stop Worrying and Love MUM

WaddingtonMay 20, 2021, 7:57 AM

2 points

0 comments3 min readLW link

AI Safety Research Project Ideas

Owain_EvansMay 21, 2021, 1:39 PM

58 points

2 comments3 min readLW link

[Question] How one uses set theory for alignment problem?

Valentin2026May 29, 2021, 12:28 AM

8 points

6 comments1 min readLW link

Reflection of Hierarchical Relationship via Nuanced Conditioning of Game Theory Approach for AI Development and Utilization

Kyoung-cheol KimJun 4, 2021, 7:20 AM

2 points

2 comments9 min readLW link

Review of “Learning Normativity: A Research Agenda”

Gyrodiot, adamShimi and Joe Collman

Jun 6, 2021, 1:33 PM

34 points

0 comments6 min readLW link

Hardware for Transformative AI

MrThinkJun 22, 2021, 6:13 PM

17 points

7 comments2 min readLW link

Alex Turner’s Research, Comprehensive Information Gathering

adamShimiJun 23, 2021, 9:44 AM

15 points

3 comments3 min readLW link

Discussion: Objective Robustness and Inner Alignment Terminology

jbkjr and Lauro Langosco

Jun 23, 2021, 11:25 PM

70 points

7 comments9 min readLW link

The Language of Bird

johnswentworthJun 27, 2021, 4:44 AM

44 points

9 comments2 min readLW link

[Question] What are some claims or opinions about multi-multi delegation you’ve seen in the memeplex that you think deserve scrutiny?

QuinnJun 27, 2021, 5:44 PM

17 points

6 comments2 min readLW link

An examination of Metaculus’ resolved AI predictions and their implications for AI timelines

CharlesDJul 20, 2021, 9:08 AM

28 points

0 comments7 min readLW link

[Question] How should my timelines influence my career choice?

Tom LieberumAug 3, 2021, 10:14 AM

13 points

10 comments1 min readLW link

What is the problem?

Carlos RamirezAug 11, 2021, 10:33 PM

7 points

0 comments6 min readLW link

OpenAI Codex: First Impressions

specbugAug 13, 2021, 4:52 PM

49 points

8 comments4 min readLW link

(sixeleven.in)

[Question] 1h-volunteers needed for a small AI Safety-related research project

PabloAMCAug 16, 2021, 5:53 PM

2 points

0 comments1 min readLW link

Extraction of human preferences 👨→🤖

arunraja-hubAug 24, 2021, 4:34 PM

18 points

2 comments5 min readLW link

Call for research on evaluating alignment (funding + advice available)

Beth BarnesAug 31, 2021, 11:28 PM

105 points

11 comments5 min readLW link

Obstacles to gradient hacking

leogaoSep 5, 2021, 10:42 PM

21 points

11 comments4 min readLW link

[Question] Conditional on the first AGI being aligned correctly, is a good outcome even still likely?

iamthouthouartiSep 6, 2021, 5:30 PM

2 points

1 comment1 min readLW link

Distinguishing AI takeover scenarios

Sam Clarke and Sammy Martin

Sep 8, 2021, 4:19 PM

67 points

11 comments14 min readLW link

Paths To High-Level Machine Intelligence

Daniel_EthSep 10, 2021, 1:21 PM

67 points

8 comments33 min readLW link

How truthful is GPT-3? A benchmark for language models

Owain_EvansSep 16, 2021, 10:09 AM

56 points

24 comments6 min readLW link

Investigating AI Takeover Scenarios

Sammy MartinSep 17, 2021, 6:47 PM

27 points

1 comment27 min readLW link

A sufficiently paranoid non-Friendly AGI might self-modify itself to become Friendly

RomanSSep 22, 2021, 6:29 AM

5 points

2 comments1 min readLW link

Towards Deconfusing Gradient Hacking

leogaoOct 24, 2021, 12:43 AM

25 points

1 comment12 min readLW link

A brief review of the reasons multi-objective RL could be important in AI Safety Research

Ben SmithSep 29, 2021, 5:09 PM

27 points

8 comments10 min readLW link

Meta learning to gradient hack

Quintin PopeOct 1, 2021, 7:25 PM

54 points

11 comments3 min readLW link

Proposal: Scaling laws for RL generalization

axiomanOct 1, 2021, 9:32 PM

14 points

10 comments11 min readLW link

A Framework of Prediction Technologies

isaduanOct 3, 2021, 10:26 AM

8 points

2 comments9 min readLW link

AI Prediction Services and Risks of War

isaduanOct 3, 2021, 10:26 AM

3 points

2 comments10 min readLW link

Possible Worlds after Prediction Take-off

isaduanOct 3, 2021, 10:26 AM

5 points

0 comments4 min readLW link

[Proposal] Method of locating useful subnets in large models

Quintin PopeOct 13, 2021, 8:52 PM

9 points

0 comments2 min readLW link

Commentary on “AGI Safety From First Principles by Richard Ngo, September 2020”

Robert KralischOct 14, 2021, 3:11 PM

3 points

0 comments20 min readLW link

The AGI needs to be honest

rokosbasiliskOct 16, 2021, 7:24 PM

2 points

12 comments2 min readLW link

“Redundant” AI Alignment

Mckay JensenOct 16, 2021, 9:32 PM

12 points

3 comments1 min readLW link

(quevivasbien.github.io)

[MLSN #1]: ICLR Safety Paper Roundup

Dan_HOct 18, 2021, 3:19 PM

59 points

1 comment2 min readLW link

AMA on Truthful AI: Owen Cotton-Barratt, Owain Evans & co-authors

Owain_EvansOct 22, 2021, 4:23 PM

31 points

15 comments1 min readLW link

Hegel vs. GPT-3

BezziOct 27, 2021, 5:55 AM

9 points

21 comments2 min readLW link

Google announces Pathways: new generation multitask AI Architecture

OzyrusOct 29, 2021, 11:55 AM

6 points

1 comment1 min readLW link

(blog.google)

What is the most evil AI that we could build, today?

ThomasJNov 1, 2021, 7:58 PM

−2 points

14 comments1 min readLW link

Why we need prosocial agents

Akbir KhanNov 2, 2021, 3:19 PM

6 points

0 comments2 min readLW link

Possible research directions to improve the mechanistic explanation of neural networks

delton137Nov 9, 2021, 2:36 AM

29 points

8 comments9 min readLW link

What are red flags for Neural Network suffering?

Marius HobbhahnNov 8, 2021, 12:51 PM

26 points

15 comments12 min readLW link

Using Brain-Computer Interfaces to get more data for AI alignment

RobboNov 7, 2021, 12:00 AM

35 points

10 comments7 min readLW link

Hardcode the AGI to need our approval indefinitely?

MichaelStJulesNov 11, 2021, 7:04 AM

2 points

2 comments1 min readLW link

Stop button: towards a causal solution

tailcalledNov 12, 2021, 7:09 PM

23 points

37 comments9 min readLW link

A FLI postdoctoral grant application: AI alignment via causal analysis and design of agents

PabloAMCNov 13, 2021, 1:44 AM

4 points

0 comments7 min readLW link

What would we do if alignment were futile?

Grant DemareeNov 14, 2021, 8:09 AM

73 points

43 comments3 min readLW link

Attempted Gears Analysis of AGI Intervention Discussion With Eliezer

ZviNov 15, 2021, 3:50 AM

204 points

48 comments16 min readLW link

(thezvi.wordpress.com)

A positive case for how we might succeed at prosaic AI alignment

evhubNov 16, 2021, 1:49 AM

78 points

47 comments6 min readLW link

Super intelligent AIs that don’t require alignment

Yair HalberstadtNov 16, 2021, 7:55 PM

10 points

2 comments6 min readLW link

Some real examples of gradient hacking

Oliver SourbutNov 22, 2021, 12:11 AM

15 points

8 comments2 min readLW link

[linkpost] Acquisition of Chess Knowledge in AlphaZero

Quintin PopeNov 23, 2021, 7:55 AM

8 points

1 comment1 min readLW link

AI Tracker: monitoring current and near-future risks from superscale models

Edouard Harris and Jeremie Harris

Nov 23, 2021, 7:16 PM

64 points

13 comments3 min readLW link

(aitracker.org)

AI Safety Needs Great Engineers

Andy JonesNov 23, 2021, 3:40 PM

78 points

45 comments4 min readLW link

HIRING: Inform and shape a new project on AI safety at Partnership on AI

Madhulika SrikumarNov 24, 2021, 8:27 AM

6 points

0 comments1 min readLW link

How to measure FLOP/s for Neural Networks empirically?

Marius HobbhahnNov 29, 2021, 3:18 PM

16 points

5 comments7 min readLW link

AI Governance Fundamentals—Curriculum and Application

MauNov 30, 2021, 2:19 AM

17 points

0 comments16 min readLW link

Behavior Cloning is Miscalibrated

leogaoDec 5, 2021, 1:36 AM

53 points

3 comments3 min readLW link

ML Alignment Theory Program under Evan Hubinger

ozhang, evhub and Victor W

Dec 6, 2021, 12:03 AM

82 points

3 comments2 min readLW link

Information bottleneck for counterfactual corrigibility

tailcalledDec 6, 2021, 5:11 PM

8 points

1 comment7 min readLW link

Modeling Failure Modes of High-Level Machine Intelligence

Ben Cottier, Daniel_Eth and Sammy Martin

Dec 6, 2021, 1:54 PM

54 points

1 comment12 min readLW link

Finding the multiple ground truths of CoinRun and image classification

Stuart_ArmstrongDec 8, 2021, 6:13 PM

15 points

3 comments2 min readLW link

[Question] What alignment-related concepts should be better known in the broader ML community?

Lauro LangoscoDec 9, 2021, 8:44 PM

6 points

4 comments1 min readLW link

Understanding Gradient Hacking

peterbarnettDec 10, 2021, 3:58 PM

30 points

5 comments30 min readLW link

What’s the backward-forward FLOP ratio for Neural Networks?

Marius Hobbhahn and Jsevillamol

Dec 13, 2021, 8:54 AM

17 points

8 comments10 min readLW link

My Overview of the AI Alignment Landscape: A Bird’s Eye View

Neel NandaDec 15, 2021, 11:44 PM

111 points

9 comments15 min readLW link

Disentangling Perspectives On Strategy-Stealing in AI Safety

shawnghuDec 18, 2021, 8:13 PM

20 points

1 comment11 min readLW link

Demanding and Designing Aligned Cognitive Architectures

Koen.HoltmanDec 21, 2021, 5:32 PM

8 points

5 comments5 min readLW link

Potential gears level explanations of smooth progress

ryan_greenblattDec 22, 2021, 6:05 PM

4 points

2 comments2 min readLW link

Transformer Circuits

evhubDec 22, 2021, 9:09 PM

142 points

4 comments3 min readLW link

(transformer-circuits.pub)

Gradient Hacking via Schelling Goals

Adam ScherlisDec 28, 2021, 8:38 PM

33 points

4 comments4 min readLW link

Reader-generated Essays

Henrik KarlssonJan 3, 2022, 8:56 AM

17 points

0 comments6 min readLW link

(escapingflatland.substack.com)

Brain Efficiency: Much More than You Wanted to Know

jacob_cannellJan 6, 2022, 3:38 AM

195 points

87 comments28 min readLW link

Understanding the two-head strategy for teaching ML to answer questions honestly

Adam ScherlisJan 11, 2022, 11:24 PM

28 points

1 comment10 min readLW link

Plan B in AI Safety approach

avturchinJan 13, 2022, 12:03 PM

33 points

9 comments2 min readLW link

Truthful LMs as a warm-up for aligned AGI

Jacob_HiltonJan 17, 2022, 4:49 PM

65 points

14 comments13 min readLW link

How I’m thinking about GPT-N

delton137Jan 17, 2022, 5:11 PM

46 points

21 comments18 min readLW link

Alignment Problems All the Way Down

peterbarnettJan 22, 2022, 12:19 AM

26 points

7 comments10 min readLW link

[Question] How feasible/costly would it be to train a very large AI model on distributed clusters of GPUs?

AnonymousJan 25, 2022, 7:20 PM

7 points

4 comments1 min readLW link

Causality, Transformative AI and alignment—part I

Marius HobbhahnJan 27, 2022, 4:18 PM

13 points

11 comments8 min readLW link

2+2: Ontological Framework

LyrialtusFeb 1, 2022, 1:07 AM

−15 points

2 comments12 min readLW link

QNR prospects are important for AI alignment research

Eric DrexlerFeb 3, 2022, 3:20 PM

82 points

10 comments11 min readLW link

Paradigm-building: Introduction

Cameron BergFeb 8, 2022, 12:06 AM

25 points

0 comments2 min readLW link

Paradigm-building: The hierarchical question framework

Cameron BergFeb 9, 2022, 4:47 PM

11 points

16 comments3 min readLW link

Question 1: Predicted architecture of AGI learning algorithm(s)

Cameron BergFeb 10, 2022, 5:22 PM

12 points

1 comment7 min readLW link

Question 2: Predicted bad outcomes of AGI learning architecture

Cameron BergFeb 11, 2022, 10:23 PM

5 points

1 comment10 min readLW link

Question 3: Control proposals for minimizing bad outcomes

Cameron BergFeb 12, 2022, 7:13 PM

5 points

1 comment7 min readLW link

Question 4: Implementing the control proposals

Cameron BergFeb 13, 2022, 5:12 PM

6 points

2 comments5 min readLW link

Question 5: The timeline hyperparameter

Cameron BergFeb 14, 2022, 4:38 PM

5 points

3 comments7 min readLW link

Paradigm-building: Conclusion and practical takeaways

Cameron BergFeb 15, 2022, 4:11 PM

2 points

1 comment2 min readLW link

How complex are myopic imitators?

Vivek HebbarFeb 8, 2022, 12:00 PM

23 points

1 comment15 min readLW link

Metaculus launches contest for essays with quantitative predictions about AI

Tamay Besiroglu and Metaculus

Feb 8, 2022, 4:07 PM

25 points

2 comments1 min readLW link

(www.metaculus.com)

Hypothesis: gradient descent prefers general circuits

Quintin PopeFeb 8, 2022, 9:12 PM

40 points

26 comments11 min readLW link

Compute Trends Across Three eras of Machine Learning

Jsevillamol, Pablo Villalobos, lennart, Marius Hobbhahn, Tamay Besiroglu and anson.ho

Feb 16, 2022, 2:18 PM

91 points

13 comments2 min readLW link

[Question] Is the competition/cooperation between symbolic AI and statistical AI (ML) about historical approach to research / engineering, or is it more fundamentally about what intelligent agents “are”?

Edward HammondFeb 17, 2022, 11:11 PM

1 point

1 comment2 min readLW link

HCH and Adversarial Questions

David UdellFeb 19, 2022, 12:52 AM

15 points

7 comments26 min readLW link

Thoughts on Dangerous Learned Optimization

peterbarnettFeb 19, 2022, 10:46 AM

4 points

2 comments4 min readLW link

Relativized Definitions as a Method to Sidestep the Löbian Obstacle

homotowatFeb 27, 2022, 6:37 AM

27 points

4 comments7 min readLW link

What we know about machine learning’s replication crisis

Younes KamelMar 5, 2022, 11:55 PM

35 points

4 comments6 min readLW link

(youneskamel.substack.com)

Projecting compute trends in Machine Learning

Tamay, lennart and Jsevillamol

Mar 7, 2022, 3:32 PM

59 points

5 comments6 min readLW link

[Survey] Expectations of a Post-ASI Order

Lone PineMar 9, 2022, 7:17 PM

5 points

0 comments1 min readLW link

A Longlist of Theories of Impact for Interpretability

Neel NandaMar 11, 2022, 2:55 PM

106 points

29 comments5 min readLW link

New GPT3 Impressive Capabilities—InstructGPT3 [1/2]

simeon_cMar 13, 2022, 10:58 AM

71 points

10 comments7 min readLW link

Phase transitions and AGI

Ege Erdil and Metaculus

Mar 17, 2022, 5:22 PM

44 points

19 comments9 min readLW link

(www.metaculus.com)

Can we simulate human evolution to create a somewhat aligned AGI?

Thomas KwaMar 28, 2022, 10:55 PM

21 points

7 comments7 min readLW link

Project Intro: Selection Theorems for Modularity

CallumMcDougall, Avery and Lucius Bushnaq

Apr 4, 2022, 12:59 PM

69 points

20 comments16 min readLW link

My agenda for research into transformer capabilities—Introduction

p.b.Apr 5, 2022, 9:23 PM

11 points

1 comment3 min readLW link

Research agenda: Can transformers do system 2 thinking?

p.b.Apr 6, 2022, 1:31 PM

20 points

0 comments2 min readLW link

PaLM in “Extrapolating GPT-N performance”

Lukas FinnvedenApr 6, 2022, 1:05 PM

80 points

19 comments2 min readLW link

Research agenda—Building a multi-modal chess-language model

p.b.Apr 7, 2022, 12:25 PM

8 points

2 comments2 min readLW link

Is GPT3 a Good Rationalist? - InstructGPT3 [2/2]

simeon_cApr 7, 2022, 1:46 PM

11 points

0 comments7 min readLW link

Playing with DALL·E 2

Dave OrrApr 7, 2022, 6:49 PM

165 points

116 comments6 min readLW link

Progress Report 4: logit lens redux

Nathan Helm-BurgerApr 8, 2022, 6:35 PM

3 points

0 comments2 min readLW link

Hyperbolic takeoff

Ege ErdilApr 9, 2022, 3:57 PM

17 points

8 comments10 min readLW link

(www.metaculus.com)

Elicit: Language Models as Research Assistants

stuhlmueller and jungofthewon

Apr 9, 2022, 2:56 PM

70 points

7 comments13 min readLW link

Is it time to start thinking about what AI Friendliness means?

Victor NovikovApr 11, 2022, 9:32 AM

18 points

6 comments3 min readLW link

What more compute does for brain-like models: response to Rohin

Nathan Helm-BurgerApr 13, 2022, 3:40 AM

22 points

14 comments11 min readLW link

Alignment and Deep Learning

AiyenApr 17, 2022, 12:02 AM

44 points

35 comments8 min readLW link

[$20K in Prizes] AI Safety Arguments Competition

Dan H, Kevin Liu, ozhang, TW123 and Sidney Hough

Apr 26, 2022, 4:13 PM

74 points

543 comments3 min readLW link

SERI ML Alignment Theory Scholars Program 2022

Ryan Kidd, Victor Warlop and ozhang

Apr 27, 2022, 12:43 AM

56 points

6 comments3 min readLW link

[Question] What is a training “step” vs. “episode” in machine learning?

Evan R. MurphyApr 28, 2022, 9:53 PM

9 points

4 comments1 min readLW link

Prize for Alignment Research Tasks

stuhlmueller and William_S

Apr 29, 2022, 8:57 AM

63 points

36 comments10 min readLW link

Quick Thoughts on A.I. Governance

Nicholas / Heather KrossApr 30, 2022, 2:49 PM

66 points

8 comments2 min readLW link

(www.thinkingmuchbetter.com)

What DALL-E 2 can and cannot do

Swimmer963 (Miranda Dixon-Luinenburg) May 1, 2022, 11:51 PM

351 points

305 comments9 min readLW link

Open Problems in Negative Side Effect Minimization

Fabian Schimpf and Lukas Fluri

May 6, 2022, 9:37 AM

12 points

7 comments17 min readLW link

[Linkpost] diffusion magnetizes manifolds (DALL-E 2 intuition building)

Paul BricmanMay 7, 2022, 11:01 AM

1 point

0 comments1 min readLW link

(paulbricman.com)

Updating Utility Functions

JustinShovelain and Joar Skalse

May 9, 2022, 9:44 AM

36 points

7 comments8 min readLW link

Conditions for mathematical equivalence of Stochastic Gradient Descent and Natural Selection

Oliver SourbutMay 9, 2022, 9:38 PM

54 points

12 comments10 min readLW link

AI safety should be made more accessible using non text-based media

MassimogMay 10, 2022, 3:14 AM

2 points

4 comments4 min readLW link

The limits of AI safety via debate

Marius HobbhahnMay 10, 2022, 1:33 PM

28 points

7 comments10 min readLW link

Introduction to the sequence: Interpretability Research for the Most Important Century

Evan R. MurphyMay 12, 2022, 7:59 PM

16 points

0 comments8 min readLW link

Gato as the Dawn of Early AGI

David UdellMay 15, 2022, 6:52 AM

84 points

29 comments12 min readLW link

Is AI Progress Impossible To Predict?

alyssavanceMay 15, 2022, 6:30 PM

276 points

38 comments2 min readLW link

DeepMind’s generalist AI, Gato: A non-technical explainer

frances_lorenz, Nora Belrose and jonmenaster

May 16, 2022, 9:21 PM

57 points

6 comments6 min readLW link

Gato’s Generalisation: Predictions and Experiments I’d Like to See

Oliver SourbutMay 18, 2022, 7:15 AM

43 points

3 comments10 min readLW link

Understanding Gato’s Supervised Reinforcement Learning

lorepieriMay 18, 2022, 11:08 AM

3 points

5 comments1 min readLW link

(lorenzopieri.com)

A Story of AI Risk: InstructGPT-N

peterbarnettMay 26, 2022, 11:22 PM

24 points

0 comments8 min readLW link

[Linkpost] A Chinese AI optimized for killing

RomanSJun 3, 2022, 9:17 AM

−2 points

4 comments1 min readLW link

Give the AI safe tools

Adam JermynJun 3, 2022, 5:04 PM

3 points

0 comments4 min readLW link

Towards a Formalisation of Returns on Cognitive Reinvestment (Part 1)

DragonGodJun 4, 2022, 6:42 PM

17 points

8 comments13 min readLW link

Give the model a model-builder

Adam JermynJun 6, 2022, 12:21 PM

3 points

0 comments5 min readLW link

AGI Safety FAQ / all-dumb-questions-allowed thread

Aryeh EnglanderJun 7, 2022, 5:47 AM

221 points

515 comments4 min readLW link

Embodiment is Indispensable for AGI

P. G. Keerthana GopalakrishnanJun 7, 2022, 9:31 PM

6 points

1 comment6 min readLW link

(keerthanapg.com)

You Only Get One Shot: an Intuition Pump for Embedded Agency

Oliver SourbutJun 9, 2022, 9:38 PM

22 points

4 comments2 min readLW link

Summary of “AGI Ruin: A List of Lethalities”

Stephen McAleeseJun 10, 2022, 10:35 PM

32 points

2 comments8 min readLW link

Poorly-Aimed Death Rays

Thane RuthenisJun 11, 2022, 6:29 PM

43 points

5 comments4 min readLW link

ELK Proposal—Make the Reporter care about the Predictor’s beliefs

Adam Jermyn and Nicholas Schiefer

Jun 11, 2022, 10:53 PM

8 points

0 comments6 min readLW link

Grokking “Semi-informative priors over AI timelines”

anson.hoJun 12, 2022, 10:17 PM

15 points

7 comments14 min readLW link

[Question] Favourite new AI productivity tools?

Gabe MJun 15, 2022, 1:08 AM

14 points

5 comments1 min readLW link

Contra Hofstadter on GPT-3 Nonsense

ricticJun 15, 2022, 9:53 PM

235 points

22 comments2 min readLW link

[Question] What if LaMDA is indeed sentient / self-aware / worth having rights?

RomanSJun 16, 2022, 9:10 AM

22 points

13 comments1 min readLW link

Ten experiments in modularity, which we’d like you to run!

CallumMcDougall, Lucius Bushnaq and Avery

Jun 16, 2022, 9:17 AM

59 points

2 comments9 min readLW link

Alignment research for “meta” purposes

acylhalideJun 16, 2022, 2:03 PM

15 points

0 comments1 min readLW link

[Question] AI misalignment risk from GPT-like systems?

fiso64Jun 19, 2022, 5:35 PM

10 points

8 comments1 min readLW link

Half-baked alignment idea: training to generalize

Aaron BergmanJun 19, 2022, 8:16 PM

7 points

2 comments4 min readLW link

Getting from an unaligned AGI to an aligned AGI?

Tor Økland BarstadJun 21, 2022, 12:36 PM

9 points

7 comments9 min readLW link

Mitigating the damage from unaligned ASI by cooperating with aliens that don’t exist yet

MSRayneJun 21, 2022, 4:12 PM

−8 points

7 comments6 min readLW link

AI Training Should Allow Opt-Out

alyssavanceJun 23, 2022, 1:33 AM

76 points

13 comments6 min readLW link

Updated Deference is not a strong argument against the utility uncertainty approach to alignment

Ivan VendrovJun 24, 2022, 7:32 PM

20 points

8 comments4 min readLW link

SunPJ in Alenia

FlorianHJun 25, 2022, 7:39 PM

7 points

19 comments8 min readLW link

(plausiblestuff.com)

Conditioning Generative Models

Adam JermynJun 25, 2022, 10:15 PM

22 points

18 comments10 min readLW link

Training Trace Priors and Speed Priors

Adam JermynJun 26, 2022, 6:07 PM

17 points

0 comments3 min readLW link

Deliberation Everywhere: Simple Examples

Oliver SourbutJun 27, 2022, 5:26 PM

14 points

0 comments15 min readLW link

Deliberation, Reactions, and Control: Tentative Definitions and a Restatement of Instrumental Convergence

Oliver SourbutJun 27, 2022, 5:25 PM

10 points

0 comments11 min readLW link

Formal Philosophy and Alignment Possible Projects

Daniel HerrmannJun 30, 2022, 10:42 AM

33 points

5 comments8 min readLW link

Reframing the AI Risk

Thane RuthenisJul 1, 2022, 6:44 PM

26 points

7 comments6 min readLW link

Trends in GPU price-performance

Marius Hobbhahn and Tamay

Jul 1, 2022, 3:51 PM

85 points

10 comments1 min readLW link

(epochai.org)

Follow along with Columbia EA’s Advanced AI Safety Fellowship!

RohanSJul 2, 2022, 5:45 PM

3 points

0 comments2 min readLW link

(forum.effectivealtruism.org)

Can we achieve AGI Alignment by balancing multiple human objectives?

Ben SmithJul 3, 2022, 2:51 AM

11 points

1 comment4 min readLW link

We Need a Consolidated List of Bad AI Alignment Solutions

DoubleJul 4, 2022, 6:54 AM

9 points

14 comments1 min readLW link

A compressed take on recent disagreements

kmanJul 4, 2022, 4:39 AM

33 points

9 comments1 min readLW link

My Most Likely Reason to Die Young is AI X-Risk

AISafetyIsNotLongtermistJul 4, 2022, 5:08 PM

61 points

24 comments4 min readLW link

(forum.effectivealtruism.org)

The curious case of Pretty Good human inner/outer alignment

PavleMihaJul 5, 2022, 7:04 PM

41 points

45 comments4 min readLW link

Introducing the Fund for Alignment Research (We’re Hiring!)

AdamGleave, Scott Emmons, Ethan Perez and Claudia Shi

Jul 6, 2022, 2:07 AM

59 points

0 comments4 min readLW link

Outer vs inner misalignment: three framings

Richard_NgoJul 6, 2022, 7:46 PM

43 points

4 comments9 min readLW link

Response to Blake Richards: AGI, generality, alignment, & loss functions

Steven ByrnesJul 12, 2022, 1:56 PM

59 points

9 comments15 min readLW link

Goal Alignment Is Robust To the Sharp Left Turn

Thane RuthenisJul 13, 2022, 8:23 PM

45 points

15 comments4 min readLW link

Deception?! I ain’t got time for that!

Paul CologneseJul 18, 2022, 12:06 AM

50 points

5 comments13 min readLW link

Four questions I ask AI safety researchers

Orpheus16Jul 17, 2022, 5:25 PM

17 points

0 comments1 min readLW link

A distillation of Evan Hubinger’s training stories (for SERI MATS)

Daphne_WJul 18, 2022, 3:38 AM

15 points

1 comment10 min readLW link

Conditioning Generative Models for Alignment

JozdienJul 18, 2022, 7:11 AM

40 points

8 comments22 min readLW link

Information theoretic model analysis may not lend much insight, but we may have been doing them wrong!

Garrett BakerJul 24, 2022, 12:42 AM

7 points

0 comments10 min readLW link

How to Diversify Conceptual Alignment: the Model Behind Refine

adamShimiJul 20, 2022, 10:44 AM

78 points

11 comments8 min readLW link

Our Existing Solutions to AGI Alignment (semi-safe)

Michael SoareverixJul 21, 2022, 7:00 PM

12 points

1 comment3 min readLW link

Reward is not the optimization target

TurnTroutJul 25, 2022, 12:03 AM

252 points

97 comments10 min readLW link

What Environment Properties Select Agents For World-Modeling?

Thane RuthenisJul 23, 2022, 7:27 PM

24 points

1 comment12 min readLW link

AGI Safety Needs People With All Skillsets!

Severin T. SeehrichJul 25, 2022, 1:32 PM

28 points

0 comments2 min readLW link

Conjecture: Internal Infohazard Policy

Connor Leahy, Sid Black, Chris Scammell and Andrea_Miotti

Jul 29, 2022, 7:07 PM

119 points

6 comments19 min readLW link

Humans Reflecting on HRH

leogaoJul 29, 2022, 9:56 PM

20 points

4 comments2 min readLW link

[Question] Would “Manhattan Project” style be beneficial or deleterious for AI Alignment?

Valentin2026Aug 4, 2022, 7:12 PM

5 points

1 comment1 min readLW link

Convergence Towards World-Models: A Gears-Level Model

Thane RuthenisAug 4, 2022, 11:31 PM

37 points

1 comment13 min readLW link

How To Go From Interpretability To Alignment: Just Retarget The Search

johnswentworthAug 10, 2022, 4:08 PM

143 points

30 comments3 min readLW link

Formalizing Alignment

Marv KAug 10, 2022, 6:50 PM

3 points

0 comments2 min readLW link

My summary of the alignment problem

Peter HroššoAug 11, 2022, 7:42 PM

16 points

3 comments2 min readLW link

(threadreaderapp.com)

Artificial intelligence wireheading

Big TonyAug 12, 2022, 3:06 AM

3 points

2 comments1 min readLW link

Infant AI Scenario

Nathan1123Aug 12, 2022, 9:20 PM

1 point

0 comments3 min readLW link

Gradient descent doesn’t select for inner search

Ivan VendrovAug 13, 2022, 4:15 AM

36 points

23 comments4 min readLW link

No shortcuts to knowledge: Why AI needs to ease up on scaling and learn how to code

YldedlyAug 15, 2022, 8:42 AM

4 points

0 comments1 min readLW link

(deoxyribose.github.io)

Mesa-optimization for goals defined only within a training environment is dangerous

Rubi J. HudsonAug 17, 2022, 3:56 AM

6 points

2 comments4 min readLW link

The longest training run

Jsevillamol, Tamay, Owen D and anson.ho

Aug 17, 2022, 5:18 PM

68 points

11 comments9 min readLW link

(epochai.org)

Matt Yglesias on AI Policy

Grant DemareeAug 17, 2022, 11:57 PM

25 points

1 comment1 min readLW link

(www.slowboring.com)

Epistemic Artefacts of (conceptual) AI alignment research

Nora_Ammann and particlemania

Aug 19, 2022, 5:18 PM

30 points

1 comment5 min readLW link

A Bite Sized Introduction to ELK

Luk27182Sep 17, 2022, 12:28 AM

5 points

0 comments6 min readLW link

Benchmarking Proposals on Risk Scenarios

Paul BricmanAug 20, 2022, 10:01 AM

25 points

2 comments14 min readLW link

The ‘Bitter Lesson’ is Wrong

deepthoughtlifeAug 20, 2022, 4:15 PM

−9 points

14 comments2 min readLW link

My Plan to Build Aligned Superintelligence

apollonianbluesAug 21, 2022, 1:16 PM

18 points

7 comments8 min readLW link

Beliefs and Disagreements about Automating Alignment Research

Ian McKenzieAug 24, 2022, 6:37 PM

92 points

4 comments7 min readLW link

Google AI integrates PaLM with robotics: SayCan update [Linkpost]

Evan R. MurphyAug 24, 2022, 8:54 PM

25 points

0 comments1 min readLW link

(sites.research.google)

The Shard Theory Alignment Scheme

David UdellAug 25, 2022, 4:52 AM

47 points

33 comments2 min readLW link

[Question] What would you expect a massive multimodal online federated learner to be capable of?

Aryeh EnglanderAug 27, 2022, 5:31 PM

13 points

4 comments1 min readLW link

(My understanding of) What Everyone in Technical Alignment is Doing and Why

Thomas Larsen and elifland

Aug 29, 2022, 1:23 AM

345 points

83 comments38 min readLW link

Breaking down the training/deployment dichotomy

Erik JennerAug 28, 2022, 9:45 PM

29 points

4 comments3 min readLW link

Strategy For Conditioning Generative Models

james.lucassen and evhub

Sep 1, 2022, 4:34 AM

28 points

4 comments18 min readLW link

Gradient Hacker Design Principles From Biology

johnswentworthSep 1, 2022, 7:03 PM

52 points

13 comments3 min readLW link

No, human brains are not (much) more efficient than computers

Jesse HooglandSep 6, 2022, 1:53 PM

19 points

16 comments4 min readLW link

(www.jessehoogland.com)

Can “Reward Economics” solve AI Alignment?

Q HomeSep 7, 2022, 7:58 AM

3 points

15 comments18 min readLW link

Generators Of Disagreement With AI Alignment

George3d6Sep 7, 2022, 6:15 PM

26 points

9 comments9 min readLW link

(www.epistem.ink)

Searching for Modularity in Large Language Models

NickyP and Stephen Fowler

Sep 8, 2022, 2:25 AM

43 points

3 comments14 min readLW link

We may be able to see sharp left turns coming

Ethan Perez and Neel Nanda

Sep 3, 2022, 2:55 AM

50 points

26 comments1 min readLW link

Gatekeeper Victory: AI Box Reflection

Double and DaemonicSigil

Sep 9, 2022, 9:38 PM

4 points

5 comments9 min readLW link

Can you force a neural network to keep generalizing?

Q HomeSep 12, 2022, 10:14 AM

2 points

10 comments5 min readLW link

Alignment via prosocial brain algorithms

Cameron BergSep 12, 2022, 1:48 PM

42 points

28 comments6 min readLW link

[Linkpost] A survey on over 300 works about interpretability in deep networks

scasperSep 12, 2022, 7:07 PM

96 points

7 comments2 min readLW link

(arxiv.org)

Trying to find the underlying structure of computational systems

Matthias G. MayerSep 13, 2022, 9:16 PM

17 points

9 comments4 min readLW link

[Question] Are Speed Superintelligences Feasible for Modern ML Techniques?

DragonGodSep 14, 2022, 12:59 PM

8 points

5 comments1 min readLW link

The Defender’s Advantage of Interpretability

Marius HobbhahnSep 14, 2022, 2:05 PM

41 points

4 comments6 min readLW link

When does technical work to reduce AGI conflict make a difference?: Introduction

JesseClifton, Sammy Martin and Anthony DiGiovanni

Sep 14, 2022, 7:38 PM

42 points

3 comments6 min readLW link

ACT-1: Transformer for Actions

Daniel KokotajloSep 14, 2022, 7:09 PM

52 points

4 comments1 min readLW link

(www.adept.ai)

[Question] Forecasting thread: How does AI risk level vary based on timelines?

eliflandSep 14, 2022, 11:56 PM

33 points

7 comments1 min readLW link

General advice for transitioning into Theoretical AI Safety

Martín SotoSep 15, 2022, 5:23 AM

9 points

0 comments10 min readLW link

Why deceptive alignment matters for AGI safety

Marius HobbhahnSep 15, 2022, 1:38 PM

48 points

12 comments13 min readLW link

Understanding Conjecture: Notes from Connor Leahy interview

Orpheus16Sep 15, 2022, 6:37 PM

103 points

24 comments15 min readLW link

ordering capability thresholds

Tamsin LeakeSep 16, 2022, 4:36 PM

27 points

0 comments4 min readLW link

(carado.moe)

Levels of goals and alignment

zeshenSep 16, 2022, 4:44 PM

27 points

4 comments6 min readLW link

Katja Grace on Slowing Down AI, AI Expert Surveys And Estimating AI Risk

Michaël TrazziSep 16, 2022, 5:45 PM

40 points

2 comments3 min readLW link

(theinsideview.ai)

Summaries: Alignment Fundamentals Curriculum

Leon LangSep 18, 2022, 1:08 PM

43 points

3 comments1 min readLW link

(docs.google.com)

Leveraging Legal Informatics to Align AI

John NaySep 18, 2022, 8:39 PM

11 points

0 comments3 min readLW link

(forum.effectivealtruism.org)

Alignment Org Cheat Sheet

Orpheus16 and Thomas Larsen

Sep 20, 2022, 5:36 PM

63 points

6 comments4 min readLW link

Public-facing Censorship Is Safety Theater, Causing Reputational Damage

YitzSep 23, 2022, 5:08 AM

144 points

42 comments6 min readLW link

Nearcast-based “deployment problem” analysis

HoldenKarnofskySep 21, 2022, 6:52 PM

78 points

2 comments26 min readLW link

Mathematical Circuits in Neural Networks

Sean OsierSep 22, 2022, 3:48 AM

34 points

4 comments1 min readLW link

(www.youtube.com)

Understanding Infra-Bayesianism: A Beginner-Friendly Video Series

Jack Parker and Connall Garrod

Sep 22, 2022, 1:25 PM

114 points

6 comments2 min readLW link

Interlude: But Who Optimizes The Optimizer?

Paul BricmanSep 23, 2022, 3:30 PM

15 points

0 comments10 min readLW link

[Question] What Do AI Safety Pitches Not Get About Your Field?

ArisSep 22, 2022, 9:27 PM

28 points

3 comments1 min readLW link

Let’s Compare Notes

Shoshannah TekofskySep 22, 2022, 8:47 PM

17 points

3 comments6 min readLW link

Brain-over-body biases, and the embodied value problem in AI alignment

geoffreymillerSep 24, 2022, 10:24 PM

10 points

6 comments25 min readLW link

Brief Notes on Transformers

Adam JermynSep 26, 2022, 2:46 PM

32 points

2 comments2 min readLW link

You are Underestimating The Likelihood That Convergent Instrumental Subgoals Lead to Aligned AGI

Mark NeyerSep 26, 2022, 2:22 PM

3 points

6 comments3 min readLW link

7 traps that (we think) new alignment researchers often fall into

Orpheus16 and Thomas Larsen

Sep 27, 2022, 11:13 PM

157 points

10 comments4 min readLW link

Threat-Resistant Bargaining Megapost: Introducing the ROSE Value

DiffractorSep 28, 2022, 1:20 AM

89 points

11 comments53 min readLW link

Failure modes in a shard theory alignment plan

Thomas KwaSep 27, 2022, 10:34 PM

24 points

2 comments7 min readLW link

QAPR 3: interpretability-guided training of neural nets

Quintin PopeSep 28, 2022, 4:02 PM

47 points

2 comments10 min readLW link

[Question] What’s the actual evidence that AI marketing tools are changing preferences in a way that makes them easier to predict?

EmrikOct 1, 2022, 3:21 PM

10 points

7 comments1 min readLW link

[Question] Any further work on AI Safety Success Stories?

KriegerOct 2, 2022, 9:53 AM

7 points

6 comments1 min readLW link

AI Timelines via Cumulative Optimization Power: Less Long, More Short

jacob_cannellOct 6, 2022, 12:21 AM

111 points

32 comments6 min readLW link

confusion about alignment requirements

Tamsin LeakeOct 6, 2022, 10:32 AM

28 points

10 comments3 min readLW link

(carado.moe)

Good ontologies induce commutative diagrams

Erik JennerOct 9, 2022, 12:06 AM

40 points

5 comments14 min readLW link

Uncontrollable AI as an Existential Risk

Karl von WendtOct 9, 2022, 10:36 AM

19 points

0 comments20 min readLW link

Objects in Mirror Are Closer Than They Appear...

VestoziaOct 11, 2022, 4:34 AM

2 points

7 comments9 min readLW link

Misalignment Harms Can Be Caused by Low Intelligence Systems

DialecticEelOct 11, 2022, 1:39 PM

11 points

3 comments1 min readLW link

Building a transformer from scratch—AI safety up-skilling challenge

Marius HobbhahnOct 12, 2022, 3:40 PM

42 points

1 comment5 min readLW link

Help out Redwood Research’s interpretability team by finding heuristics implemented by GPT-2 small

Haoxing Du and Buck

Oct 12, 2022, 9:25 PM

49 points

11 comments4 min readLW link

Science of Deep Learning—a technical agenda

Marius HobbhahnOct 18, 2022, 2:54 PM

35 points

7 comments4 min readLW link

Response to Katja Grace’s AI x-risk counterarguments

Erik Jenner and Johannes Treutlein

Oct 19, 2022, 1:17 AM

75 points

18 comments15 min readLW link

[Question] What Does AI Alignment Success Look Like?

ShmiOct 20, 2022, 12:32 AM

23 points

7 comments1 min readLW link

AI Research Program Prediction Markets

tailcalledOct 20, 2022, 1:42 PM

38 points

10 comments1 min readLW link

Learning societal values from law as part of an AGI alignment strategy

John NayOct 21, 2022, 2:03 AM

3 points

18 comments54 min readLW link

Improved Security to Prevent Hacker-AI and Digital Ghosts

Erland WittkotterOct 21, 2022, 10:11 AM

4 points

3 comments12 min readLW link

What will the scaled up GATO look like? (Updated with questions)

Amal Oct 25, 2022, 12:44 PM

33 points

20 comments1 min readLW link

Intent alignment should not be the goal for AGI x-risk reduction

John NayOct 26, 2022, 1:24 AM

−6 points

10 comments3 min readLW link

Resources that (I think) new alignment researchers should know about

Orpheus16Oct 28, 2022, 10:13 PM

69 points

8 comments4 min readLW link

Boundaries vs Frames

Scott GarrabrantOct 31, 2022, 3:14 PM

47 points

7 comments7 min readLW link

Adversarial Policies Beat Professional-Level Go AIs

sanxiynNov 3, 2022, 1:27 PM

31 points

35 comments1 min readLW link

(goattack.alignmentfund.org)

The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren and Sid Black

Nov 28, 2022, 12:54 PM

159 points

27 comments31 min readLW link

Simple Way to Prevent Power-Seeking AI

research_prime_spaceDec 7, 2022, 12:26 AM

7 points

1 comment1 min readLW link

You can still fetch the coffee today if you’re dead tomorrow

davidadDec 9, 2022, 2:06 PM

58 points

15 comments5 min readLW link

Extracting and Evaluating Causal Direction in LLMs’ Activations

Fabien Roger and simeon_c

Dec 14, 2022, 2:33 PM

22 points

2 comments11 min readLW link

Realism about rationality

Richard_NgoSep 16, 2018, 10:46 AM

180 points

145 comments4 min readLW link 3 reviews

(thinkingcomplete.blogspot.com)

Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

Ben PaceOct 4, 2019, 4:08 AM

205 points

60 comments15 min readLW link 2 reviews

The Parable of Predict-O-Matic

abramdemskiOct 15, 2019, 12:49 AM

291 points

42 comments14 min readLW link 2 reviews

2018 AI Alignment Literature Review and Charity Comparison

LarksDec 18, 2018, 4:46 AM

190 points

26 comments62 min readLW link 1 review

An Orthodox Case Against Utility Functions

abramdemskiApr 7, 2020, 7:18 PM

128 points

53 comments8 min readLW link 2 reviews

“How conservative” should the partial maximisers be?

Stuart_ArmstrongApr 13, 2020, 3:50 PM

30 points

8 comments2 min readLW link

[AN #95]: A framework for thinking about how to make AI go well

Rohin ShahApr 15, 2020, 5:10 PM

20 points

2 comments10 min readLW link

(mailchi.mp)

AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

Palus AstraApr 16, 2020, 12:50 AM

58 points

27 comments89 min readLW link

Open question: are minimal circuits daemon-free?

paulfchristianoMay 5, 2018, 10:40 PM

81 points

70 comments2 min readLW link 1 review

Disentangling arguments for the importance of AI safety

Richard_NgoJan 21, 2019, 12:41 PM

129 points

23 comments8 min readLW link

Integrating Hidden Variables Improves Approximation

johnswentworthApr 16, 2020, 9:43 PM

15 points

4 comments1 min readLW link

AI Services as a Research Paradigm

VojtaKovarikApr 20, 2020, 1:00 PM

30 points

12 comments4 min readLW link

(docs.google.com)

Databases of human behaviour and preferences?

Stuart_ArmstrongApr 21, 2020, 6:06 PM

10 points

9 comments1 min readLW link

Critch on career advice for junior AI-x-risk-concerned researchers

Rob BensingerMay 12, 2018, 2:13 AM

117 points

25 comments4 min readLW link

Reframing Impact

TurnTroutSep 20, 2019, 7:03 PM

90 points

15 comments3 min readLW link 1 review

Description vs simulated prediction

Richard Korzekwa Apr 22, 2020, 4:40 PM

26 points

0 comments5 min readLW link

(aiimpacts.org)

DeepMind team on specification gaming

JoshuaFoxApr 23, 2020, 8:01 AM

30 points

2 comments1 min readLW link

(deepmind.com)

[Question] Does Agent-like Behavior Imply Agent-like Architecture?

Scott GarrabrantAug 23, 2019, 2:01 AM

54 points

7 comments1 min readLW link

Risks from Learned Optimization: Conclusion and Related Work

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

Jun 7, 2019, 7:53 PM

78 points

4 comments6 min readLW link

Deceptive Alignment

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

Jun 5, 2019, 8:16 PM

97 points

11 comments17 min readLW link

The Inner Alignment Problem

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

Jun 4, 2019, 1:20 AM

99 points

17 comments13 min readLW link

How the MtG Color Wheel Explains AI Safety

Scott GarrabrantFeb 15, 2019, 11:42 PM

57 points

4 comments6 min readLW link

[Question] How does Gradient Descent Interact with Goodhart?

Scott GarrabrantFeb 2, 2019, 12:14 AM

68 points

19 comments4 min readLW link

Formal Open Problem in Decision Theory

Scott GarrabrantNov 29, 2018, 3:25 AM

35 points

11 comments4 min readLW link

The Ubiquitous Converse Lawvere Problem

Scott GarrabrantNov 29, 2018, 3:16 AM

21 points

0 comments2 min readLW link

Embedded Curiosities

Scott Garrabrant and abramdemski

Nov 8, 2018, 2:19 PM

88 points

1 comment2 min readLW link

Subsystem Alignment

abramdemski and Scott Garrabrant

Nov 6, 2018, 4:16 PM

100 points

12 comments1 min readLW link

Robust Delegation

abramdemski and Scott Garrabrant

Nov 4, 2018, 4:38 PM

110 points

10 comments1 min readLW link

Embedded World-Models

abramdemski and Scott Garrabrant

Nov 2, 2018, 4:07 PM

87 points

16 comments1 min readLW link

Decision Theory

abramdemski and Scott Garrabrant

Oct 31, 2018, 6:41 PM

114 points

46 comments1 min readLW link

(A → B) → A

Scott GarrabrantSep 11, 2018, 10:38 PM

62 points

11 comments2 min readLW link

History of the Development of Logical Induction

Scott GarrabrantAug 29, 2018, 3:15 AM

89 points

4 comments5 min readLW link

Optimization Amplifies

Scott GarrabrantJun 27, 2018, 1:51 AM

98 points

12 comments4 min readLW link

What makes counterfactuals comparable?

Chris_LeongApr 24, 2020, 10:47 PM

11 points

6 comments3 min readLW link

New Paper Expanding on the Goodhart Taxonomy

Scott GarrabrantMar 14, 2018, 9:01 AM

17 points

4 comments1 min readLW link

(arxiv.org)

Sources of intuitions and data on AGI

Scott GarrabrantJan 31, 2018, 11:30 PM

84 points

26 comments3 min readLW link

Corrigibility

paulfchristianoNov 27, 2018, 9:50 PM

52 points

7 comments6 min readLW link

AI prediction case study 5: Omohundro’s AI drives

Stuart_ArmstrongMar 15, 2013, 9:09 AM

10 points

5 comments8 min readLW link

Toy model: convergent instrumental goals

Stuart_ArmstrongFeb 25, 2016, 2:03 PM

15 points

2 comments4 min readLW link

AI-created pseudo-deontology

Stuart_ArmstrongFeb 12, 2015, 9:11 PM

10 points

35 comments1 min readLW link

Ethical Injunctions

Eliezer YudkowskyOct 20, 2008, 11:00 PM

66 points

76 comments9 min readLW link

Motivating Abstraction-First Decision Theory

johnswentworthApr 29, 2020, 5:47 PM

42 points

16 comments5 min readLW link

[AN #97]: Are there historical examples of large, robust discontinuities?

Rohin ShahApr 29, 2020, 5:30 PM

15 points

0 comments10 min readLW link

(mailchi.mp)

My Updating Thoughts on AI policy

Ben PaceMar 1, 2020, 7:06 AM

20 points

1 comment9 min readLW link

Useful Does Not Mean Secure

Ben PaceNov 30, 2019, 2:05 AM

46 points

12 comments11 min readLW link

[Question] What is the alternative to intent alignment called?

Richard_NgoApr 30, 2020, 2:16 AM

12 points

6 comments1 min readLW link

Optimising Society to Constrain Risk of War from an Artificial Superintelligence

JohnCDraperApr 30, 2020, 10:47 AM

3 points

1 comment51 min readLW link

Stanford Encyclopedia of Philosophy on AI ethics and superintelligence

Kaj_SotalaMay 2, 2020, 7:35 AM

43 points

19 comments7 min readLW link

(plato.stanford.edu)

[Question] How does iterated amplification exceed human abilities?

riceissaMay 2, 2020, 11:44 PM

19 points

9 comments2 min readLW link

How uniform is the neocortex?

zhukeepaMay 4, 2020, 2:16 AM

78 points

23 comments11 min readLW link 1 review

Scott Garrabrant’s problem on recovering Brouwer as a corollary of Lawvere

RupertMay 4, 2020, 10:01 AM

26 points

2 comments2 min readLW link

“AI and Efficiency”, OA (44✕ improvement in CNNs since 2012)

gwernMay 5, 2020, 4:32 PM

47 points

0 comments1 min readLW link

(openai.com)

Competitive safety via gradated curricula

Richard_NgoMay 5, 2020, 6:11 PM

38 points

5 comments5 min readLW link

Modeling naturalized decision problems in linear logic

jessicataMay 6, 2020, 12:15 AM

14 points

2 comments6 min readLW link

(unstableontology.com)

[AN #98]: Understanding neural net training by seeing which gradients were helpful

Rohin ShahMay 6, 2020, 5:10 PM

22 points

3 comments9 min readLW link

(mailchi.mp)

[Question] Is AI safety research less parallelizable than AI research?

Mati_RoyMay 10, 2020, 8:43 PM

9 points

5 comments1 min readLW link

Thoughts on implementing corrigible robust alignment

Steven ByrnesNov 26, 2019, 2:06 PM

26 points

2 comments6 min readLW link

Wireheading is in the eye of the beholder

Stuart_ArmstrongJan 30, 2019, 6:23 PM

26 points

10 comments1 min readLW link

Wireheading as a potential problem with the new impact measure

Stuart_ArmstrongSep 25, 2018, 2:15 PM

25 points

20 comments4 min readLW link

Wireheading and discontinuity

Michele CampoloFeb 18, 2020, 10:49 AM

21 points

4 comments3 min readLW link

[AN #99]: Doubling times for the efficiency of AI algorithms

Rohin ShahMay 13, 2020, 5:20 PM

29 points

0 comments10 min readLW link

(mailchi.mp)

How should AIs update a prior over human preferences?

Stuart_ArmstrongMay 15, 2020, 1:14 PM

17 points

9 comments2 min readLW link

Conjecture Workshop

johnswentworthMay 15, 2020, 10:41 PM

34 points

2 comments2 min readLW link

Multi-agent safety

Richard_NgoMay 16, 2020, 1:59 AM

31 points

8 comments5 min readLW link

The Mechanistic and Normative Structure of Agency

Gordon Seidoh WorleyMay 18, 2020, 4:03 PM

15 points

4 comments1 min readLW link

(philpapers.org)

“Starwink” by Alicorn

Zack_M_DavisMay 18, 2020, 8:17 AM

44 points

1 comment1 min readLW link

(alicorn.elcenia.com)

[AN #100]: What might go wrong if you learn a reward function while acting

Rohin ShahMay 20, 2020, 5:30 PM

33 points

2 comments12 min readLW link

(mailchi.mp)

Probabilities, weights, sums: pretty much the same for reward functions

Stuart_ArmstrongMay 20, 2020, 3:19 PM

11 points

1 comment2 min readLW link

[Question] Source code size vs learned model size in ML and in humans?

riceissaMay 20, 2020, 8:47 AM

11 points

6 comments1 min readLW link

Comparing reward learning/reward tampering formalisms

Stuart_ArmstrongMay 21, 2020, 12:03 PM

9 points

3 comments3 min readLW link

AGIs as collectives

Richard_NgoMay 22, 2020, 8:36 PM

22 points

23 comments4 min readLW link

[AN #101]: Why we should rigorously measure and forecast AI progress

Rohin ShahMay 27, 2020, 5:20 PM

15 points

0 comments10 min readLW link

(mailchi.mp)

AI Safety Discussion Days

Linda LinseforsMay 27, 2020, 4:54 PM

13 points

1 comment3 min readLW link

Building brain-inspired AGI is infinitely easier than understanding the brain

Steven ByrnesJun 2, 2020, 2:13 PM

51 points

14 comments7 min readLW link

Sparsity and interpretability?

Ada Böhm, RobertKirk and Tomáš Gavenčiak

Jun 1, 2020, 1:25 PM

41 points

3 comments7 min readLW link

GPT-3: A Summary

leogaoJun 2, 2020, 6:14 PM

20 points

0 comments1 min readLW link

(leogao.dev)

Inaccessible information

paulfchristianoJun 3, 2020, 5:10 AM

84 points

17 comments14 min readLW link 2 reviews

(ai-alignment.com)

[AN #102]: Meta learning by GPT-3, and a list of full proposals for AI alignment

Rohin ShahJun 3, 2020, 5:20 PM

38 points

6 comments10 min readLW link

(mailchi.mp)

Feedback is central to agency

Alex FlintJun 1, 2020, 12:56 PM

28 points

1 comment3 min readLW link

Thinking About Super-Human AI: An Examination of Likely Paths and Ultimate Constitution

meanderingmooseJun 4, 2020, 11:22 PM

−3 points

0 comments7 min readLW link

Emergence and Control: An examination of our ability to govern the behavior of intelligent systems

meanderingmooseJun 5, 2020, 5:10 PM

1 point

0 comments6 min readLW link

GAN Discriminators Don’t Generalize?

tryactionsJun 8, 2020, 8:36 PM

18 points

7 comments2 min readLW link

More on disambiguating “discontinuity”

Aryeh EnglanderJun 9, 2020, 3:16 PM

16 points

1 comment3 min readLW link

[AN #103]: ARCHES: an agenda for existential safety, and combining natural language with deep RL

Rohin ShahJun 10, 2020, 5:20 PM

27 points

1 comment10 min readLW link

(mailchi.mp)

Dutch-Booking CDT: Revised Argument

abramdemskiOct 27, 2020, 4:31 AM

50 points

22 comments16 min readLW link

[Question] List of public predictions of what GPT-X can or can’t do?

Daniel KokotajloJun 14, 2020, 2:25 PM

20 points

9 comments1 min readLW link

Achieving AI alignment through deliberate uncertainty in multiagent systems

Florian DietzJun 15, 2020, 12:19 PM

3 points

10 comments7 min readLW link

Superexponential Historic Growth, by David Roodman

Ben PaceJun 15, 2020, 9:49 PM

43 points

6 comments5 min readLW link

(www.openphilanthropy.org)

Relating HCH and Logical Induction

abramdemskiJun 16, 2020, 10:08 PM

47 points

4 comments5 min readLW link

Image GPT

Daniel KokotajloJun 18, 2020, 11:41 AM

29 points

27 comments1 min readLW link

(openai.com)

[AN #104]: The perils of inaccessible information, and what we can learn about AI alignment from COVID

Rohin ShahJun 18, 2020, 5:10 PM

19 points

5 comments8 min readLW link

(mailchi.mp)

[Question] If AI is based on GPT, how to ensure its safety?

avturchinJun 18, 2020, 8:33 PM

20 points

11 comments1 min readLW link

What’s Your Cognitive Algorithm?

RaemonJun 18, 2020, 10:16 PM

71 points

23 comments13 min readLW link

Relevant pre-AGI possibilities

Daniel KokotajloJun 20, 2020, 10:52 AM

38 points

7 comments19 min readLW link

(aiimpacts.org)

Plausible cases for HRAD work, and locating the crux in the “realism about rationality” debate

riceissaJun 22, 2020, 1:10 AM

85 points

15 comments10 min readLW link

The Indexing Problem

johnswentworthJun 22, 2020, 7:11 PM

35 points

2 comments4 min readLW link

[Question] Requesting feedback/advice: what Type Theory to study for AI safety?

rvnntJun 23, 2020, 5:03 PM

7 points

4 comments3 min readLW link

Locality of goals

adamShimiJun 22, 2020, 9:56 PM

16 points

8 comments6 min readLW link

[Question] What is “Instrumental Corrigibility”?

joebernsteinJun 23, 2020, 8:24 PM

4 points

1 comment1 min readLW link

Models, myths, dreams, and Cheshire cat grins

Stuart_ArmstrongJun 24, 2020, 10:50 AM

21 points

7 comments2 min readLW link

[AN #105]: The economic trajectory of humanity, and what we might mean by optimization

Rohin ShahJun 24, 2020, 5:30 PM

24 points

3 comments11 min readLW link

(mailchi.mp)

There’s an Awesome AI Ethics List and it’s a little thin

AABoylesJun 25, 2020, 1:43 PM

13 points

1 comment1 min readLW link

(github.com)

GPT-3 Fiction Samples

gwernJun 25, 2020, 4:12 PM

63 points

18 comments1 min readLW link

(www.gwern.net)

Walkthrough: The Transformer Architecture [Part 1/2]

Matthew BarnettJul 30, 2019, 1:54 PM

35 points

0 comments6 min readLW link

Robustness as a Path to AI Alignment

abramdemskiOct 10, 2017, 8:14 AM

45 points

9 comments9 min readLW link

Radical Probabilism [Transcript]

abramdemski and Ben Pace

Jun 26, 2020, 10:14 PM

46 points

12 comments6 min readLW link

AI safety via market making

evhubJun 26, 2020, 11:07 PM

55 points

45 comments11 min readLW link

[Question] Have general decomposers been formalized?

QuinnJun 27, 2020, 6:09 PM

8 points

5 comments1 min readLW link

Gary Marcus vs Cortical Uniformity

Steven ByrnesJun 28, 2020, 6:18 PM

18 points

0 comments8 min readLW link

Web AI discussion Groups

Donald HobsonJun 30, 2020, 11:22 AM

11 points

0 comments2 min readLW link

Comparing AI Alignment Approaches to Minimize False Positive Risk

Gordon Seidoh WorleyJun 30, 2020, 7:34 PM

5 points

0 comments9 min readLW link

AvE: Assistance via Empowerment

FactorialCodeJun 30, 2020, 10:07 PM

12 points

1 comment1 min readLW link

(arxiv.org)

Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI

Palus AstraJul 1, 2020, 5:30 PM

35 points

4 comments67 min readLW link

[AN #106]: Evaluating generalization ability of learned reward models

Rohin ShahJul 1, 2020, 5:20 PM

14 points

2 comments11 min readLW link

(mailchi.mp)

The “AI Debate” Debate

michaelcohenJul 2, 2020, 10:16 AM

20 points

20 comments3 min readLW link

Idea: Imitation/Value Learning AIXI

Past AccountJul 3, 2020, 5:10 PM

3 points

6 comments1 min readLW link

Splitting Debate up into Two Subsystems

NandiJul 3, 2020, 8:11 PM

13 points

5 comments4 min readLW link

AI Unsafety via Non-Zero-Sum Debate

VojtaKovarikJul 3, 2020, 10:03 PM

25 points

10 comments5 min readLW link

Classifying games like the Prisoner’s Dilemma

philhJul 4, 2020, 5:10 PM

100 points

28 comments6 min readLW link 1 review

(reasonableapproximation.net)

AI-Feynman as a benchmark for what we should be aiming for

Faustus2Jul 4, 2020, 9:24 AM

8 points

1 comment2 min readLW link

Learning the prior

paulfchristianoJul 5, 2020, 9:00 PM

79 points

29 comments8 min readLW link

(ai-alignment.com)

Better priors as a safety problem

paulfchristianoJul 5, 2020, 9:20 PM

64 points

7 comments5 min readLW link

(ai-alignment.com)

[Question] How far is AGI?

Roko JelavićJul 5, 2020, 5:58 PM

6 points

5 comments1 min readLW link

Classifying specification problems as variants of Goodhart’s Law

VikaAug 19, 2019, 8:40 PM

70 points

5 comments5 min readLW link 1 review

New safety research agenda: scalable agent alignment via reward modeling

VikaNov 20, 2018, 5:29 PM

34 points

13 comments1 min readLW link

(medium.com)

Designing agent incentives to avoid side effects

Vika and TurnTrout

Mar 11, 2019, 8:55 PM

29 points

0 comments2 min readLW link

(medium.com)

Discussion on the machine learning approach to AI safety

VikaNov 1, 2018, 8:54 PM

26 points

3 comments4 min readLW link

Specification gaming examples in AI

VikaApr 3, 2018, 12:30 PM

43 points

9 comments1 min readLW link 2 reviews

[Question] (answered: yes) Has anyone written up a consideration of Downs’s “Paradox of Voting” from the perspective of MIRI-ish decision theories (UDT, FDT, or even just EDT)?

Jameson QuinnJul 6, 2020, 6:26 PM

10 points

24 comments1 min readLW link

New DeepMind AI Safety Research Blog

VikaSep 27, 2018, 4:28 PM

43 points

0 comments1 min readLW link

(medium.com)

Contest: $1,000 for good questions to ask to an Oracle AI

Stuart_ArmstrongJul 31, 2019, 6:48 PM

57 points

156 comments3 min readLW link

Deconfusing Human Values Research Agenda v1

Gordon Seidoh WorleyMar 23, 2020, 4:25 PM

27 points

12 comments4 min readLW link

[Question] How “honest” is GPT-3?

abramdemskiJul 8, 2020, 7:38 PM

72 points

18 comments5 min readLW link

What does it mean to apply decision theory?

abramdemskiJul 8, 2020, 8:31 PM

51 points

5 comments8 min readLW link

AI Research Considerations for Human Existential Safety (ARCHES)

habrykaJul 9, 2020, 2:49 AM

60 points

8 comments1 min readLW link

(arxiv.org)

The Unreasonable Effectiveness of Deep Learning

Richard_NgoSep 30, 2018, 3:48 PM

85 points

5 comments13 min readLW link

(thinkingcomplete.blogspot.com)

mAIry’s room: AI reasoning to solve philosophical problems

Stuart_ArmstrongMar 5, 2019, 8:24 PM

92 points

41 comments6 min readLW link 2 reviews

Failures of an embodied AIXI

So8resJun 15, 2014, 6:29 PM

48 points

46 comments12 min readLW link

The Problem with AIXI

Rob BensingerMar 18, 2014, 1:55 AM

43 points

78 comments23 min readLW link

Versions of AIXI can be arbitrarily stupid

Stuart_ArmstrongAug 10, 2015, 1:23 PM

29 points

59 comments1 min readLW link

Reflective AIXI and Anthropics

DiffractorSep 24, 2018, 2:15 AM

17 points

13 comments8 min readLW link

AIXI and Existential Despair

paulfchristianoDec 8, 2011, 8:03 PM

23 points

38 comments6 min readLW link

How to make AIXI-tl incapable of learning

itaibn0Jan 27, 2014, 12:05 AM

7 points

5 comments2 min readLW link

Help request: What is the Kolmogorov complexity of computable approximations to AIXI?

AnnaSalamonDec 5, 2010, 10:23 AM

9 points

9 comments1 min readLW link

“AIXIjs: A Software Demo for General Reinforcement Learning”, Aslanides 2017

gwernMay 29, 2017, 9:09 PM

7 points

1 comment1 min readLW link

(arxiv.org)

Can AIXI be trained to do anything a human can?

Stuart_ArmstrongOct 20, 2014, 1:12 PM

5 points

9 comments2 min readLW link

Shaping economic incentives for collaborative AGI

Kaj_SotalaJun 29, 2018, 4:26 PM

45 points

15 comments4 min readLW link

Is the Star Trek Federation really incapable of building AI?

Kaj_SotalaMar 18, 2018, 10:30 AM

19 points

4 comments2 min readLW link

(kajsotala.fi)

Some conceptual highlights from “Disjunctive Scenarios of Catastrophic AI Risk”

Kaj_SotalaFeb 12, 2018, 12:30 PM

33 points

4 comments6 min readLW link

(kajsotala.fi)

Misconceptions about continuous takeoff

Matthew BarnettOct 8, 2019, 9:31 PM

79 points

38 comments4 min readLW link

Distinguishing definitions of takeoff

Matthew BarnettFeb 14, 2020, 12:16 AM

60 points

6 comments6 min readLW link

Book review: Artificial Intelligence Safety and Security

PeterMcCluskeyDec 8, 2018, 3:47 AM

27 points

3 comments8 min readLW link

(www.bayesianinvestor.com)

Why AI may not foom

John_MaxwellMar 24, 2013, 8:11 AM

29 points

81 comments12 min readLW link

Humans Who Are Not Concentrating Are Not General Intelligences

sarahconstantinFeb 25, 2019, 8:40 PM

181 points

35 comments6 min readLW link 1 review

(srconstantin.wordpress.com)

The Hacker Learns to Trust

Ben PaceJun 22, 2019, 12:27 AM

80 points

18 comments8 min readLW link

(medium.com)

Book Review: Human Compatible

Scott AlexanderJan 31, 2020, 5:20 AM

77 points

6 comments16 min readLW link

(slatestarcodex.com)

SSC Journal Club: AI Timelines

Scott AlexanderJun 8, 2017, 7:00 PM

12 points

15 comments8 min readLW link

Arguments against myopic training

Richard_NgoJul 9, 2020, 4:07 PM

56 points

39 comments12 min readLW link

On motivations for MIRI’s highly reliable agent design research

jessicataJan 29, 2017, 7:34 PM

27 points

1 comment5 min readLW link

Why is the impact penalty time-inconsistent?

Stuart_ArmstrongJul 9, 2020, 5:26 PM

16 points

1 comment2 min readLW link

My current take on the Paul-MIRI disagreement on alignability of messy AI

jessicataJan 29, 2017, 8:52 PM

21 points

0 comments10 min readLW link

Ben Goertzel: The Singularity Institute’s Scary Idea (and Why I Don’t Buy It)

Paul CrowleyOct 30, 2010, 9:31 AM

42 points

442 comments1 min readLW link

An Analytic Perspective on AI Alignment

DanielFilanMar 1, 2020, 4:10 AM

54 points

45 comments8 min readLW link

(danielfilan.com)

Mechanistic Transparency for Machine Learning

DanielFilanJul 11, 2018, 12:34 AM

54 points

9 comments4 min readLW link

A model I use when making plans to reduce AI x-risk

Ben PaceJan 19, 2018, 12:21 AM

69 points

41 comments6 min readLW link

AI Researchers On AI Risk

Scott AlexanderMay 22, 2015, 11:16 AM

18 points

0 comments16 min readLW link

Mini advent calendar of Xrisks: Artificial Intelligence

Stuart_ArmstrongDec 7, 2012, 11:26 AM

5 points

5 comments1 min readLW link

For FAI: Is “Molecular Nanotechnology” putting our best foot forward?

leplenJun 22, 2013, 4:44 AM

79 points

118 comments3 min readLW link

UFAI cannot be the Great Filter

ThrasymachusDec 22, 2012, 11:26 AM

59 points

92 comments3 min readLW link

Don’t Fear The Filter

Scott AlexanderMay 29, 2014, 12:45 AM

11 points

18 comments6 min readLW link

The Great Filter is early, or AI is hard

Stuart_ArmstrongAug 29, 2014, 4:17 PM

32 points

76 comments1 min readLW link

Talk: Key Issues In Near-Term AI Safety Research

Aryeh EnglanderJul 10, 2020, 6:36 PM

22 points

1 comment1 min readLW link

Mesa-Optimizers vs “Steered Optimizers”

Steven ByrnesJul 10, 2020, 4:49 PM

45 points

7 comments8 min readLW link

AlphaStar: Impressive for RL progress, not for AGI progress

orthonormalNov 2, 2019, 1:50 AM

113 points

58 comments2 min readLW link 1 review

The Catastrophic Convergence Conjecture

TurnTroutFeb 14, 2020, 9:16 PM

44 points

15 comments8 min readLW link

[Question] How well can the GPT architecture solve the parity task?

FactorialCodeJul 11, 2020, 7:02 PM

19 points

3 comments1 min readLW link

Sunday July 12 — talks by Scott Garrabrant, Alexflint, alexei, Stuart_Armstrong

Bird Concept and Ben Pace

Jul 8, 2020, 12:27 AM

19 points

2 comments1 min readLW link

[Link] Word-vector based DL system achieves human parity in verbal IQ tests

jacob_cannellJun 13, 2015, 11:38 PM

17 points

8 comments1 min readLW link

The Power of Intelligence

Eliezer YudkowskyJan 1, 2007, 8:00 PM

66 points

4 comments4 min readLW link

Comments on CAIS

Richard_NgoJan 12, 2019, 3:20 PM

76 points

14 comments7 min readLW link

[Question] What are CAIS’ boldest near/medium-term predictions?

Bird ConceptMar 28, 2019, 1:14 PM

31 points

17 comments1 min readLW link

Drexler on AI Risk

PeterMcCluskeyFeb 1, 2019, 5:11 AM

34 points

10 comments9 min readLW link

(www.bayesianinvestor.com)

Six AI Risk/Strategy Ideas

Wei DaiAug 27, 2019, 12:40 AM

64 points

18 comments4 min readLW link 1 review

New report: Intelligence Explosion Microeconomics

Eliezer YudkowskyApr 29, 2013, 11:14 PM

72 points

251 comments3 min readLW link

Book review: Human Compatible

PeterMcCluskeyJan 19, 2020, 3:32 AM

37 points

2 comments5 min readLW link

(www.bayesianinvestor.com)

Thoughts on “Human-Compatible”

TurnTroutOct 10, 2019, 5:24 AM

63 points

35 comments5 min readLW link

Book Review: The AI Does Not Hate You

PeterMcCluskeyOct 28, 2019, 5:45 PM

26 points

0 comments5 min readLW link

(www.bayesianinvestor.com)

[Link] Book Review: ‘The AI Does Not Hate You’ by Tom Chivers (Scott Aaronson)

eigenOct 7, 2019, 6:16 PM

19 points

0 comments1 min readLW link

Book Review: Life 3.0: Being Human in the Age of Artificial Intelligence

J Thomas MorosJan 18, 2018, 5:18 PM

8 points

0 comments1 min readLW link

(ferocioustruth.com)

Book Review: Weapons of Math Destruction

ZviJun 4, 2017, 9:20 PM

1 point

0 comments16 min readLW link

DARPA Digital Tutor: Four Months to Total Technical Expertise?

SebastianG Jul 6, 2020, 11:34 PM

200 points

19 comments7 min readLW link

Paper: Superintelligence as a Cause or Cure for Risks of Astronomical Suffering

Kaj_SotalaJan 3, 2018, 2:39 PM

1 point

6 comments1 min readLW link

(www.informatica.si)

Preventing s-risks via indexical uncertainty, acausal trade and domination in the multiverse

avturchinSep 27, 2018, 10:09 AM

11 points

6 comments4 min readLW link

Preface to CLR’s Research Agenda on Cooperation, Conflict, and TAI

JesseCliftonDec 13, 2019, 9:02 PM

59 points

10 comments2 min readLW link

Sections 1 & 2: Introduction, Strategy and Governance

JesseCliftonDec 17, 2019, 9:27 PM

34 points

5 comments14 min readLW link

Sections 3 & 4: Credibility, Peaceful Bargaining Mechanisms

JesseCliftonDec 17, 2019, 9:46 PM

19 points

2 comments12 min readLW link

Sections 5 & 6: Contemporary Architectures, Humans in the Loop

JesseCliftonDec 20, 2019, 3:52 AM

27 points

4 comments10 min readLW link

Section 7: Foundations of Rational Agency

JesseCliftonDec 22, 2019, 2:05 AM

14 points

4 comments8 min readLW link

What counts as defection?

TurnTroutJul 12, 2020, 10:03 PM

81 points

21 comments5 min readLW link 1 review

The Commitment Races problem

Daniel KokotajloAug 23, 2019, 1:58 AM

122 points

39 comments5 min readLW link

Alignment Newsletter #36

Rohin ShahDec 12, 2018, 1:10 AM

21 points

0 comments11 min readLW link

(mailchi.mp)

Alignment Newsletter #47

Rohin ShahMar 4, 2019, 4:30 AM

18 points

0 comments8 min readLW link

(mailchi.mp)

Understanding “Deep Double Descent”

evhubDec 6, 2019, 12:00 AM

135 points

51 comments5 min readLW link 4 reviews

[LINK] Strong AI Startup Raises $15M

olalondeAug 21, 2012, 8:47 PM

24 points

13 comments1 min readLW link

Announcing the AI Alignment Prize

cousin_itNov 3, 2017, 3:47 PM

95 points

78 comments1 min readLW link

I’m leaving AI alignment – you better stay

rmoehnMar 12, 2020, 5:58 AM

150 points

19 comments5 min readLW link

New paper: AGI Agent Safety by Iteratively Improving the Utility Function

Koen.HoltmanJul 15, 2020, 2:05 PM

21 points

2 comments6 min readLW link

[Question] How should AI debate be judged?

abramdemskiJul 15, 2020, 10:20 PM

49 points

27 comments6 min readLW link

Alignment proposals and complexity classes

evhubJul 16, 2020, 12:27 AM

33 points

26 comments13 min readLW link

[AN #107]: The convergent instrumental subgoals of goal-directed agents

Rohin ShahJul 16, 2020, 6:47 AM

13 points

1 comment8 min readLW link

(mailchi.mp)

[AN #108]: Why we should scrutinize arguments for AI risk

Rohin ShahJul 16, 2020, 6:47 AM

19 points

6 comments12 min readLW link

(mailchi.mp)

Environments as a bottleneck in AGI development

Richard_NgoJul 17, 2020, 5:02 AM

36 points

19 comments6 min readLW link

[Question] Can an agent use interactive proofs to check the alignment of succesors?

PabloAMCJul 17, 2020, 7:07 PM

7 points

2 comments1 min readLW link

Lessons on AI Takeover from the conquistadors

Daniel Kokotajlo and Bird Concept

Jul 17, 2020, 10:35 PM

58 points

30 comments5 min readLW link

What Would I Do? Self-prediction in Simple Algorithms

Scott GarrabrantJul 20, 2020, 4:27 AM

54 points

13 comments5 min readLW link

Writeup: Progress on AI Safety via Debate

Beth Barnes and paulfchristiano

Feb 5, 2020, 9:04 PM

94 points

18 comments33 min readLW link

Operationalizing Interpretability

lifelonglearnerJul 20, 2020, 5:22 AM

20 points

0 comments4 min readLW link

Learning Values in Practice

Stuart_ArmstrongJul 20, 2020, 6:38 PM

24 points

0 comments5 min readLW link

Parallels Between AI Safety by Debate and Evidence Law

CullenJul 20, 2020, 10:52 PM

10 points

1 comment2 min readLW link

(cullenokeefe.com)

The Rediscovery of Interiority in Machine Learning

DanBJul 21, 2020, 5:02 AM

5 points

4 comments1 min readLW link

(danburfoot.net)

The “AI Dungeons” Dragon Model is heavily path dependent (testing GPT-3 on ethics)

Rafael HarthJul 21, 2020, 12:14 PM

44 points

9 comments6 min readLW link

How good is humanity at coordination?

BuckJul 21, 2020, 8:01 PM

78 points

44 comments3 min readLW link

Alignment As A Bottleneck To Usefulness Of GPT-3

johnswentworthJul 21, 2020, 8:02 PM

111 points

57 comments3 min readLW link

$1000 bounty for OpenAI to show whether GPT3 was “deliberately” pretending to be stupider than it is

Bird ConceptJul 21, 2020, 6:42 PM

59 points

40 comments2 min readLW link

(twitter.com)

[Preprint] The Computational Limits of Deep Learning

Gordon Seidoh WorleyJul 21, 2020, 9:25 PM

9 points

2 comments1 min readLW link

(arxiv.org)

[AN #109]: Teaching neural nets to generalize the way humans would

Rohin ShahJul 22, 2020, 5:10 PM

17 points

3 comments9 min readLW link

(mailchi.mp)

Research agenda for AI safety and a better civilization

agilecavemanJul 22, 2020, 6:35 AM

12 points

2 comments16 min readLW link

Weak HCH accesses EXP

evhubJul 22, 2020, 10:36 PM

14 points

0 comments3 min readLW link

GPT-3 Gems

TurnTroutJul 23, 2020, 12:46 AM

33 points

10 comments48 min readLW link

Optimizing arbitrary expressions with a linear number of queries to a Logical Induction Oracle (Cartoon Guide)

Donald HobsonJul 23, 2020, 9:37 PM

3 points

2 comments2 min readLW link

[Question] Construct a portfolio to profit from AI progress.

sapphireJul 25, 2020, 8:18 AM

29 points

13 comments1 min readLW link

Thinking soberly about the context and consequences of Friendly AI

Mitchell_PorterOct 16, 2012, 4:33 AM

21 points

39 comments1 min readLW link

Goal retention discussion with Eliezer

Max TegmarkSep 4, 2014, 10:23 PM

93 points

26 comments6 min readLW link

[Question] Where do people discuss doing things with GPT-3?

skybrianJul 26, 2020, 2:31 PM

2 points

7 comments1 min readLW link

You Can Probably Amplify GPT3 Directly

Past AccountJul 26, 2020, 9:58 PM

34 points

14 comments6 min readLW link

[updated] how does gpt2′s training corpus capture internet discussion? not well

nostalgebraistJul 27, 2020, 10:30 PM

25 points

3 comments2 min readLW link

(nostalgebraist.tumblr.com)

Agentic Language Model Memes

FactorialCodeAug 1, 2020, 6:03 PM

16 points

1 comment2 min readLW link

A community-curated repository of interesting GPT-3 stuff

Rudi CJul 28, 2020, 2:16 PM

8 points

0 comments1 min readLW link

(github.com)

[Question] Does the lottery ticket hypothesis suggest the scaling hypothesis?

Daniel KokotajloJul 28, 2020, 7:52 PM

14 points

17 comments1 min readLW link

[Question] To what extent are the scaling properties of Transformer networks exceptional?

abramdemskiJul 28, 2020, 8:06 PM

30 points

1 comment1 min readLW link

[Question] What happens to variance as neural network training is scaled? What does it imply about “lottery tickets”?

abramdemskiJul 28, 2020, 8:22 PM

25 points

4 comments1 min readLW link

[Question] How will internet forums like LW be able to defend against GPT-style spam?

ChristianKlJul 28, 2020, 8:12 PM

14 points

18 comments1 min readLW link

Predictions for GPT-N

hippkeJul 29, 2020, 1:16 AM

36 points

31 comments1 min readLW link

Announcement: AI alignment prize winners and next round

cousin_itJan 15, 2018, 2:33 PM

80 points

68 comments2 min readLW link

Jeff Hawkins on neuromorphic AGI within 20 years

Steven ByrnesJul 15, 2019, 7:16 PM

167 points

24 comments12 min readLW link

Cascades, Cycles, Insight...

Eliezer YudkowskyNov 24, 2008, 9:33 AM

31 points

31 comments8 min readLW link

...Recursion, Magic

Eliezer YudkowskyNov 25, 2008, 9:10 AM

27 points

28 comments5 min readLW link

References & Resources for LessWrong

XiXiDuOct 10, 2010, 2:54 PM

153 points

106 comments20 min readLW link

[Question] A game designed to beat AI?

Long tryMar 17, 2020, 3:51 AM

13 points

29 comments1 min readLW link

Truly Part Of You

Eliezer YudkowskyNov 21, 2007, 2:18 AM

149 points

59 comments4 min readLW link

[AN #110]: Learning features from human feedback to enable reward learning

Rohin ShahJul 29, 2020, 5:20 PM

13 points

2 comments10 min readLW link

(mailchi.mp)

Structured Tasks for Language Models

Past AccountJul 29, 2020, 2:17 PM

5 points

0 comments1 min readLW link

Engaging Seriously with Short Timelines

sapphireJul 29, 2020, 7:21 PM

43 points

23 comments3 min readLW link

What Failure Looks Like: Distilling the Discussion

Ben PaceJul 29, 2020, 9:49 PM

79 points

14 comments7 min readLW link

Learning the prior and generalization

evhubJul 29, 2020, 10:49 PM

34 points

16 comments4 min readLW link

[Question] Is the work on AI alignment relevant to GPT?

Richard_KennawayJul 30, 2020, 12:23 PM

20 points

5 comments1 min readLW link

Verification and Transparency

DanielFilanAug 8, 2019, 1:50 AM

34 points

6 comments2 min readLW link

(danielfilan.com)

Robin Hanson on Lumpiness of AI Services

DanielFilanFeb 17, 2019, 11:08 PM

15 points

2 comments2 min readLW link

(www.overcomingbias.com)

One Way to Think About ML Transparency

Matthew BarnettSep 2, 2019, 11:27 PM

26 points

28 comments5 min readLW link

What is Interpretability?

RobertKirk, Tomáš Gavenčiak and Ada Böhm

Mar 17, 2020, 8:23 PM

34 points

0 comments11 min readLW link

Relaxed adversarial training for inner alignment

evhubSep 10, 2019, 11:03 PM

61 points

28 comments1 min readLW link

Conclusion to ‘Reframing Impact’

TurnTroutFeb 28, 2020, 4:05 PM

39 points

17 comments2 min readLW link

Bayesian Evolving-to-Extinction

abramdemskiFeb 14, 2020, 11:55 PM

38 points

13 comments5 min readLW link

Do Sufficiently Advanced Agents Use Logic?

abramdemskiSep 13, 2019, 7:53 PM

41 points

11 comments9 min readLW link

World State is the Wrong Abstraction for Impact

TurnTroutOct 1, 2019, 9:03 PM

62 points

19 comments2 min readLW link

Attainable Utility Preservation: Concepts

TurnTroutFeb 17, 2020, 5:20 AM

38 points

20 comments1 min readLW link

Attainable Utility Preservation: Empirical Results

TurnTrout and nealeratzlaff

Feb 22, 2020, 12:38 AM

61 points

8 comments10 min readLW link 1 review

How Low Should Fruit Hang Before We Pick It?

TurnTroutFeb 25, 2020, 2:08 AM

28 points

9 comments12 min readLW link

Attainable Utility Preservation: Scaling to Superhuman

TurnTroutFeb 27, 2020, 12:52 AM

28 points

21 comments8 min readLW link

Reasons for Excitement about Impact of Impact Measure Research

TurnTroutFeb 27, 2020, 9:42 PM

33 points

8 comments4 min readLW link

Power as Easily Exploitable Opportunities

TurnTroutAug 1, 2020, 2:14 AM

30 points

5 comments6 min readLW link

[Question] Would AGIs parent young AGIs?

Vishrut AryaAug 2, 2020, 12:57 AM

3 points

6 comments1 min readLW link

If I were a well-intentioned AI… I: Image classifier

Stuart_ArmstrongFeb 26, 2020, 12:39 PM

35 points

4 comments5 min readLW link

Non-Consequentialist Cooperation?

abramdemskiJan 11, 2019, 9:15 AM

48 points

15 comments7 min readLW link

Curiosity Killed the Cat and the Asymptotically Optimal Agent

michaelcohenFeb 20, 2020, 5:28 PM

27 points

15 comments1 min readLW link

If I were a well-intentioned AI… IV: Mesa-optimising

Stuart_ArmstrongMar 2, 2020, 12:16 PM

26 points

2 comments6 min readLW link

Response to Oren Etzioni’s “How to know if artificial intelligence is about to destroy civilization”

Daniel KokotajloFeb 27, 2020, 6:10 PM

27 points

5 comments8 min readLW link

Clarifying Power-Seeking and Instrumental Convergence

TurnTroutDec 20, 2019, 7:59 PM

42 points

7 comments3 min readLW link

How important are MDPs for AGI (Safety)?

michaelcohenMar 26, 2020, 8:32 PM

14 points

8 comments2 min readLW link

Synthesizing amplification and debate

evhubFeb 5, 2020, 10:53 PM

33 points

10 comments4 min readLW link

is gpt-3 few-shot ready for real applications?

nostalgebraistAug 3, 2020, 7:50 PM

31 points

5 comments9 min readLW link

(nostalgebraist.tumblr.com)

Interpretability in ML: A Broad Overview

lifelonglearnerAug 4, 2020, 7:03 PM

52 points

5 comments15 min readLW link

Infinite Data/Compute Arguments in Alignment

johnswentworthAug 4, 2020, 8:21 PM

49 points

6 comments2 min readLW link

Four Ways An Impact Measure Could Help Alignment

Matthew BarnettAug 8, 2019, 12:10 AM

21 points

1 comment8 min readLW link

Understanding Recent Impact Measures

Matthew BarnettAug 7, 2019, 4:57 AM

16 points

6 comments7 min readLW link

A Survey of Early Impact Measures

Matthew BarnettAug 6, 2019, 1:22 AM

23 points

0 comments8 min readLW link

Optimization Regularization through Time Penalty

Linda LinseforsJan 1, 2019, 1:05 PM

11 points

4 comments3 min readLW link

Stable Pointers to Value III: Recursive Quantilization

abramdemskiJul 21, 2018, 8:06 AM

19 points

4 comments4 min readLW link

Thoughts on Quantilizers

Stuart_ArmstrongJun 2, 2017, 4:24 PM

2 points

0 comments2 min readLW link

Quantilizers maximize expected utility subject to a conservative cost constraint

jessicataSep 28, 2015, 2:17 AM

25 points

0 comments5 min readLW link

Quantilal control for finite MDPs

Vanessa KosoyApr 12, 2018, 9:21 AM

14 points

0 comments13 min readLW link

The limits of corrigibility

Stuart_ArmstrongApr 10, 2018, 10:49 AM

27 points

9 comments4 min readLW link

Alignment Newsletter #16: 07/23/18

Rohin ShahJul 23, 2018, 4:20 PM

42 points

0 comments12 min readLW link

(mailchi.mp)

Measuring hardware overhang

hippkeAug 5, 2020, 7:59 PM

106 points

14 comments4 min readLW link

[AN #111]: The Circuits hypotheses for deep learning

Rohin ShahAug 5, 2020, 5:40 PM

23 points

0 comments9 min readLW link

(mailchi.mp)

Self-Fulfilling Prophecies Aren’t Always About Self-Awareness

John_MaxwellNov 18, 2019, 11:11 PM

14 points

7 comments4 min readLW link

The Goodhart Game

John_MaxwellNov 18, 2019, 11:22 PM

13 points

5 comments5 min readLW link

Why don’t singularitarians bet on the creation of AGI by buying stocks?

John_MaxwellMar 11, 2020, 4:27 PM

43 points

20 comments4 min readLW link

The Dualist Predict-O-Matic ($100 prize)

John_MaxwellOct 17, 2019, 6:45 AM

16 points

35 comments5 min readLW link

[Question] What AI safety problems need solving for safe AI research assistants?

John_MaxwellNov 5, 2019, 2:09 AM

14 points

13 comments1 min readLW link

Refining the Evolutionary Analogy to AI

lberglundAug 7, 2020, 11:13 PM

9 points

2 comments4 min readLW link

The Fusion Power Generator Scenario

johnswentworthAug 8, 2020, 6:31 PM

136 points

29 comments3 min readLW link

[Question] How much is known about the “inference rules” of logical induction?

Eigil RischelAug 8, 2020, 10:45 AM

11 points

7 comments1 min readLW link

If I were a well-intentioned AI… II: Acting in a world

Stuart_ArmstrongFeb 27, 2020, 11:58 AM

20 points

0 comments3 min readLW link

If I were a well-intentioned AI… III: Extremal Goodhart

Stuart_ArmstrongFeb 28, 2020, 11:24 AM

22 points

0 comments5 min readLW link

Towards a Formalisation of Logical Counterfactuals

BunthutAug 8, 2020, 10:14 PM

6 points

2 comments2 min readLW link

[Question] 10/50/90% chance of GPT-N Transformative AI?

human_generated_textAug 9, 2020, 12:10 AM

24 points

8 comments1 min readLW link

[Question] Can we expect more value from AI alignment than from an ASI with the goal of running alternate trajectories of our universe?

Maxime RichéAug 9, 2020, 5:17 PM

2 points

5 comments1 min readLW link

In defense of Oracle (“Tool”) AI research

Steven ByrnesAug 7, 2019, 7:14 PM

21 points

11 comments4 min readLW link

How GPT-N will escape from its AI-box

hippkeAug 12, 2020, 7:34 PM

7 points

9 comments1 min readLW link

Strong implication of preference uncertainty

Stuart_ArmstrongAug 12, 2020, 7:02 PM

20 points

3 comments2 min readLW link

[AN #112]: Engineering a Safer World

Rohin ShahAug 13, 2020, 5:20 PM

25 points

2 comments12 min readLW link

(mailchi.mp)

Room and Board for People Self-Learning ML or Doing Independent ML Research

SamuelKnocheAug 14, 2020, 5:19 PM

7 points

1 comment1 min readLW link

Talk and Q&A—Dan Hendrycks—Paper: Aligning AI With Shared Human Values. On Discord at Aug 28, 2020 8:00-10:00 AM GMT+8.

wassnameAug 14, 2020, 11:57 PM

1 point

0 comments1 min readLW link

Search versus design

Alex FlintAug 16, 2020, 4:53 PM

89 points

41 comments36 min readLW link 1 review

Work on Security Instead of Friendliness?

Wei DaiJul 21, 2012, 6:28 PM

51 points

107 comments2 min readLW link

Goal-Directedness: What Success Looks Like

adamShimiAug 16, 2020, 6:33 PM

9 points

0 comments2 min readLW link

[Question] A way to beat superrational/EDT agents?

Abhimanyu Pallavi SudhirAug 17, 2020, 2:33 PM

5 points

13 comments1 min readLW link

Learning human preferences: optimistic and pessimistic scenarios

Stuart_ArmstrongAug 18, 2020, 1:05 PM

27 points

6 comments6 min readLW link

Mesa-Search vs Mesa-Control

abramdemskiAug 18, 2020, 6:51 PM

54 points

45 comments7 min readLW link

Why we want unbiased learning processes

Stuart_ArmstrongFeb 20, 2018, 2:48 PM

13 points

3 comments3 min readLW link

Intuitive examples of reward function learning?

Stuart_ArmstrongMar 6, 2018, 4:54 PM

7 points

3 comments2 min readLW link

Open-Category Classification

TurnTroutMar 28, 2018, 2:49 PM

13 points

6 comments10 min readLW link

Looking for adversarial collaborators to test our Debate protocol

Beth BarnesAug 19, 2020, 3:15 AM

52 points

5 comments1 min readLW link

Walkthrough of ‘Formalizing Convergent Instrumental Goals’

TurnTroutFeb 26, 2018, 2:20 AM

10 points

2 comments10 min readLW link

Ambiguity Detection

TurnTroutMar 1, 2018, 4:23 AM

11 points

9 comments4 min readLW link

Penalizing Impact via Attainable Utility Preservation

TurnTroutDec 28, 2018, 9:46 PM

24 points

0 comments3 min readLW link

(arxiv.org)

What You See Isn’t Always What You Want

TurnTroutSep 13, 2019, 4:17 AM

30 points

12 comments3 min readLW link

[Question] Instrumental Occam?

abramdemskiJan 31, 2020, 7:27 PM

30 points

15 comments1 min readLW link

Compact vs. Wide Models

VaniverJul 16, 2018, 4:09 AM

31 points

5 comments3 min readLW link

Alex Irpan: “My AI Timelines Have Sped Up”

VaniverAug 19, 2020, 4:23 PM

43 points

20 comments1 min readLW link

(www.alexirpan.com)

[AN #113]: Checking the ethical intuitions of large language models

Rohin ShahAug 19, 2020, 5:10 PM

23 points

0 comments9 min readLW link

(mailchi.mp)

AI safety as featherless bipeds with broad flat nails

Stuart_ArmstrongAug 19, 2020, 10:22 AM

37 points

1 comment1 min readLW link

Time Magazine has an article about the Singularity...

RaemonFeb 11, 2011, 2:20 AM

40 points

13 comments1 min readLW link

How rapidly are GPUs improving in price performance?

gallabytesNov 25, 2018, 7:54 PM

31 points

9 comments1 min readLW link

(mediangroup.org)

Our values are underdefined, changeable, and manipulable

Stuart_ArmstrongNov 2, 2017, 11:09 AM

25 points

6 comments3 min readLW link

[Question] What funding sources exist for technical AI safety research?

johnswentworthOct 1, 2019, 3:30 PM

26 points

5 comments1 min readLW link

Humans can drive cars

ApprenticeJan 30, 2014, 11:55 AM

53 points

89 comments2 min readLW link

A Less Wrong singularity article?

Kaj_SotalaNov 17, 2009, 2:15 PM

31 points

215 comments1 min readLW link

The Bayesian Tyrant

abramdemskiAug 20, 2020, 12:08 AM

132 points

20 comments6 min readLW link 1 review

Concept Safety: Producing similar AI-human concept spaces

Kaj_SotalaApr 14, 2015, 8:39 PM

50 points

45 comments8 min readLW link

[LINK] What should a reasonable person believe about the Singularity?

Kaj_SotalaJan 13, 2011, 9:32 AM

38 points

14 comments2 min readLW link

The many ways AIs behave badly

Stuart_ArmstrongApr 24, 2018, 11:40 AM

10 points

3 comments2 min readLW link

July 2020 gwern.net newsletter

gwernAug 20, 2020, 4:39 PM

29 points

0 comments1 min readLW link

(www.gwern.net)

Do what we mean vs. do what we say

Rohin ShahAug 30, 2018, 10:03 PM

34 points

14 comments1 min readLW link

[Question] What’s a Decomposable Alignment Topic?

Logan RiggsAug 21, 2020, 10:57 PM

26 points

16 comments1 min readLW link

Tools versus agents

Stuart_ArmstrongMay 16, 2012, 1:00 PM

47 points

39 comments5 min readLW link

An unaligned benchmark

paulfchristianoNov 17, 2018, 3:51 PM

31 points

0 comments9 min readLW link

Following human norms

Rohin ShahJan 20, 2019, 11:59 PM

30 points

10 comments5 min readLW link

nostalgebraist: Recursive Goodhart’s Law

Kaj_SotalaAug 26, 2020, 11:07 AM

53 points

27 comments1 min readLW link

(nostalgebraist.tumblr.com)

[AN #114]: Theory-inspired safety solutions for powerful Bayesian RL agents

Rohin ShahAug 26, 2020, 5:20 PM

21 points

3 comments8 min readLW link

(mailchi.mp)

[Question] How hard would it be to change GPT-3 in a way that allows audio?

ChristianKlAug 28, 2020, 2:42 PM

8 points

5 comments1 min readLW link

Safe Scrambling?

HoagyAug 29, 2020, 2:31 PM

3 points

1 comment2 min readLW link

(Humor) AI Alignment Critical Failure Table

Kaj_SotalaAug 31, 2020, 7:51 PM

24 points

2 comments1 min readLW link

(sl4.org)

What is ambitious value learning?

Rohin ShahNov 1, 2018, 4:20 PM

49 points

28 comments2 min readLW link

The easy goal inference problem is still hard

paulfchristianoNov 3, 2018, 2:41 PM

50 points

19 comments4 min readLW link

[AN #115]: AI safety research problems in the AI-GA framework

Rohin ShahSep 2, 2020, 5:10 PM

19 points

16 comments6 min readLW link

(mailchi.mp)

Emotional valence vs RL reward: a video game analogy

Steven ByrnesSep 3, 2020, 3:28 PM

12 points

6 comments4 min readLW link

Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda

Logan Riggs and Gurkenglas

Sep 3, 2020, 6:27 PM

67 points

12 comments2 min readLW link

“Learning to Summarize with Human Feedback”—OpenAI

[deleted]Sep 7, 2020, 5:59 PM

57 points

3 comments1 min readLW link

[AN #116]: How to make explanations of neurons compositional

Rohin ShahSep 9, 2020, 5:20 PM

21 points

2 comments9 min readLW link

(mailchi.mp)

Safer sandboxing via collective separation

Richard_NgoSep 9, 2020, 7:49 PM

24 points

6 comments4 min readLW link

[Question] Do mesa-optimizer risk arguments rely on the train-test paradigm?

Ben CottierSep 10, 2020, 3:36 PM

12 points

7 comments1 min readLW link

Safety via selection for obedience

Richard_NgoSep 10, 2020, 10:04 AM

31 points

1 comment5 min readLW link

How Much Computational Power Does It Take to Match the Human Brain?

habrykaSep 12, 2020, 6:38 AM

44 points

1 comment1 min readLW link

(www.openphilanthropy.org)

Decision Theory is multifaceted

Michele CampoloSep 13, 2020, 10:30 PM

7 points

12 comments8 min readLW link

AI Safety Discussion Day

Linda LinseforsSep 15, 2020, 2:40 PM

20 points

0 comments1 min readLW link

[AN #117]: How neural nets would fare under the TEVV framework

Rohin ShahSep 16, 2020, 5:20 PM

27 points

0 comments7 min readLW link

(mailchi.mp)

Applying the Counterfactual Prisoner’s Dilemma to Logical Uncertainty

Chris_LeongSep 16, 2020, 10:34 AM

9 points

5 comments2 min readLW link

Artificial Intelligence: A Modern Approach (4th edition) on the Alignment Problem

Zack_M_DavisSep 17, 2020, 2:23 AM

72 points

12 comments5 min readLW link

(aima.cs.berkeley.edu)

The “Backchaining to Local Search” Technique in AI Alignment

adamShimiSep 18, 2020, 3:05 PM

28 points

1 comment2 min readLW link

Draft report on AI timelines

Ajeya CotraSep 18, 2020, 11:47 PM

207 points

56 comments1 min readLW link 1 review

Why GPT wants to mesa-optimize & how we might change this

John_MaxwellSep 19, 2020, 1:48 PM

55 points

32 comments9 min readLW link

My (Mis)Adventures With Algorithmic Machine Learning

AHartNtknSep 20, 2020, 5:31 AM

16 points

4 comments41 min readLW link

[Question] What AI companies would be most likely to have a positive long-term impact on the world as a result of investing in them?

MikkWSep 21, 2020, 11:41 PM

8 points

2 comments2 min readLW link

Anthropomorphisation vs value learning: type 1 vs type 2 errors

Stuart_ArmstrongSep 22, 2020, 10:46 AM

16 points

10 comments1 min readLW link

AI Advantages [Gems from the Wiki]

habryka and Kaj_Sotala

Sep 22, 2020, 10:44 PM

22 points

7 comments2 min readLW link

(www.lesswrong.com)

A long reply to Ben Garfinkel on Scrutinizing Classic AI Risk Arguments

Søren ElverlinSep 27, 2020, 5:51 PM

17 points

6 comments1 min readLW link

Dehumanisation errors

Stuart_ArmstrongSep 23, 2020, 9:51 AM

13 points

0 comments1 min readLW link

[AN #118]: Risks, solutions, and prioritization in a world with many AI systems

Rohin ShahSep 23, 2020, 6:20 PM

15 points

6 comments10 min readLW link

(mailchi.mp)

[Question] David Deutsch on Universal Explainers and AI

alanfSep 24, 2020, 7:50 AM

3 points

8 comments2 min readLW link

KL Divergence as Code Patching Efficiency

Past AccountSep 27, 2020, 4:06 PM

17 points

0 comments8 min readLW link

[Question] What to do with imitation humans, other than asking them what the right thing to do is?

Charlie SteinerSep 27, 2020, 9:51 PM

10 points

6 comments1 min readLW link

[Question] What Decision Theory is Implied By Predictive Processing?

johnswentworthSep 28, 2020, 5:20 PM

55 points

17 comments1 min readLW link

AGI safety from first principles: Superintelligence

Richard_NgoSep 28, 2020, 7:53 PM

80 points

6 comments9 min readLW link

AGI safety from first principles: Introduction

Richard_NgoSep 28, 2020, 7:53 PM

109 points

18 comments2 min readLW link 1 review

[Question] Examples of self-governance to reduce technology risk?

JiaSep 29, 2020, 7:31 PM

10 points

4 comments1 min readLW link

AGI safety from first principles: Goals and Agency

Richard_NgoSep 29, 2020, 7:06 PM

70 points

15 comments15 min readLW link

“Unsupervised” translation as an (intent) alignment problem

paulfchristianoSep 30, 2020, 12:50 AM

61 points

15 comments4 min readLW link

(ai-alignment.com)

[AN #119]: AI safety when agents are shaped by environments, not rewards

Rohin ShahSep 30, 2020, 5:10 PM

11 points

0 comments11 min readLW link

(mailchi.mp)

AGI safety from first principles: Control

Richard_NgoOct 2, 2020, 9:51 PM

61 points

4 comments9 min readLW link

AI race considerations in a report by the U.S. House Committee on Armed Services

NunoSempereOct 4, 2020, 12:11 PM

42 points

4 comments13 min readLW link

[Question] Is there any work on incorporating aleatoric uncertainty and/or inherent randomness into AIXI?

David Scott Krueger (formerly: capybaralet)Oct 4, 2020, 8:10 AM

9 points

7 comments1 min readLW link

AGI safety from first principles: Conclusion

Richard_NgoOct 4, 2020, 11:06 PM

65 points

4 comments3 min readLW link

Universal Eudaimonia

hg00Oct 5, 2020, 1:45 PM

19 points

6 comments2 min readLW link

The Alignment Problem: Machine Learning and Human Values

Rohin ShahOct 6, 2020, 5:41 PM

120 points

7 comments6 min readLW link 1 review

(www.amazon.com)

[AN #120]: Tracing the intellectual roots of AI and AI alignment

Rohin ShahOct 7, 2020, 5:10 PM

13 points

4 comments10 min readLW link

(mailchi.mp)

[Question] Brainstorming positive visions of AI

jungofthewonOct 7, 2020, 4:09 PM

52 points

25 comments1 min readLW link

[Question] How can an AI demonstrate purely through chat that it is an AI, and not a human?

hugh.mannOct 7, 2020, 5:53 PM

3 points

4 comments1 min readLW link

[Question] Why isn’t JS a popular language for deep learning?

Will ClarkOct 8, 2020, 2:36 PM

12 points

21 comments1 min readLW link

[Question] If GPT-6 is human-level AGI but costs $200 per page of output, what would happen?

Daniel KokotajloOct 9, 2020, 12:00 PM

28 points

30 comments1 min readLW link

[Question] Shouldn’t there be a Chinese translation of Human Compatible?

mako yassOct 9, 2020, 8:47 AM

18 points

13 comments1 min readLW link

Idealized Factored Cognition

Rafael HarthNov 30, 2020, 6:49 PM

34 points

6 comments11 min readLW link

[Question] Reviews of the book ‘The Alignment Problem’

Mati_RoyOct 11, 2020, 7:41 AM

8 points

3 comments1 min readLW link

[Question] Reviews of TV show NeXt (about AI safety)

Mati_RoyOct 11, 2020, 4:31 AM

25 points

4 comments1 min readLW link

The Achilles Heel Hypothesis for AI

scasperOct 13, 2020, 2:35 PM

20 points

6 comments1 min readLW link

Toy Problem: Detective Story Alignment

johnswentworthOct 13, 2020, 9:02 PM

34 points

4 comments2 min readLW link

[Question] Does anyone worry about A.I. forums like this where they reinforce each other’s biases/ are led by big tech?

misabella16Oct 13, 2020, 3:14 PM

4 points

3 comments1 min readLW link

[AN #121]: Forecasting transformative AI timelines using biological anchors

Rohin ShahOct 14, 2020, 5:20 PM

27 points

5 comments14 min readLW link

(mailchi.mp)

Gradient hacking

evhubOct 16, 2019, 12:53 AM

99 points

39 comments3 min readLW link 2 reviews

Impact measurement and value-neutrality verification

evhubOct 15, 2019, 12:06 AM

31 points

13 comments6 min readLW link

Outer alignment and imitative amplification

evhubJan 10, 2020, 12:26 AM

24 points

11 comments9 min readLW link

Safe exploration and corrigibility

evhubDec 28, 2019, 11:12 PM

17 points

4 comments4 min readLW link

[Question] What are some non-purely-sampling ways to do deep RL?

evhubDec 5, 2019, 12:09 AM

15 points

9 comments2 min readLW link

More variations on pseudo-alignment

evhubNov 4, 2019, 11:24 PM

26 points

8 comments3 min readLW link

Towards an empirical investigation of inner alignment

evhubSep 23, 2019, 8:43 PM

44 points

9 comments6 min readLW link

Are minimal circuits deceptive?

evhubSep 7, 2019, 6:11 PM

66 points

11 comments8 min readLW link

Concrete experiments in inner alignment

evhubSep 6, 2019, 10:16 PM

63 points

12 comments6 min readLW link

Towards a mechanistic understanding of corrigibility

evhubAug 22, 2019, 11:20 PM

44 points

26 comments6 min readLW link

A Concrete Proposal for Adversarial IDA

evhubMar 26, 2019, 7:50 PM

16 points

5 comments5 min readLW link

Nuances with ascription universality

evhubFeb 12, 2019, 11:38 PM

20 points

1 comment2 min readLW link

Box inversion hypothesis

Jan KulveitOct 20, 2020, 4:20 PM

59 points

4 comments3 min readLW link

[Question] Has anyone researched specification gaming with biological animals?

David Scott Krueger (formerly: capybaralet)Oct 21, 2020, 12:20 AM

9 points

3 comments1 min readLW link

Sunday October 25, 12:00PM (PT) — Scott Garrabrant on “Cartesian Frames”

Ben PaceOct 21, 2020, 3:27 AM

48 points

3 comments2 min readLW link

[Question] Could we use recommender systems to figure out human values?

Olga BabeevaOct 20, 2020, 9:35 PM

7 points

2 comments1 min readLW link

[Question] When was the term “AI alignment” coined?

David Scott Krueger (formerly: capybaralet)Oct 21, 2020, 6:27 PM

11 points

8 comments1 min readLW link

[AN #122]: Arguing for AGI-driven existential risk from first principles

Rohin ShahOct 21, 2020, 5:10 PM

28 points

0 comments9 min readLW link

(mailchi.mp)

[Question] What’s the difference between GAI and a government?

DirectedEvolutionOct 21, 2020, 11:04 PM

11 points

5 comments1 min readLW link

Moral AI: Options

ManfredJul 11, 2015, 9:46 PM

14 points

6 comments4 min readLW link

Can few-shot learning teach AI right from wrong?

Charlie SteinerJul 20, 2018, 7:45 AM

13 points

3 comments6 min readLW link

Some Comments on Stuart Armstrong’s “Research Agenda v0.9”

Charlie SteinerJul 8, 2019, 7:03 PM

21 points

12 comments4 min readLW link

The Artificial Intentional Stance

Charlie SteinerJul 27, 2019, 7:00 AM

12 points

0 comments4 min readLW link

What’s the dream for giving natural language commands to AI?

Charlie SteinerOct 8, 2019, 1:42 PM

8 points

8 comments7 min readLW link

Supervised learning of outputs in the brain

Steven ByrnesOct 26, 2020, 2:32 PM

27 points

9 comments10 min readLW link

Humans are stunningly rational and stunningly irrational

Stuart_ArmstrongOct 23, 2020, 2:13 PM

21 points

4 comments2 min readLW link

Reply to Jebari and Lundborg on Artificial Superintelligence

Richard_NgoOct 25, 2020, 1:50 PM

31 points

4 comments5 min readLW link

(thinkingcomplete.blogspot.com)

Additive Operations on Cartesian Frames

Scott GarrabrantOct 26, 2020, 3:12 PM

61 points

6 comments11 min readLW link

Security Mindset and Takeoff Speeds

DanielFilanOct 27, 2020, 3:20 AM

54 points

23 comments8 min readLW link

(danielfilan.com)

Biextensional Equivalence

Scott GarrabrantOct 28, 2020, 2:07 PM

43 points

13 comments10 min readLW link

Draft papers for REALab and Decoupled Approval on tampering

Jonathan Uesato and Ramana Kumar

Oct 28, 2020, 4:01 PM

47 points

2 comments1 min readLW link

[AN #123]: Inferring what is valuable in order to align recommender systems

Rohin ShahOct 28, 2020, 5:00 PM

20 points

1 comment8 min readLW link

(mailchi.mp)

“Scaling Laws for Autoregressive Generative Modeling”, Henighan et al 2020 {OA}

gwernOct 29, 2020, 1:45 AM

26 points

11 comments1 min readLW link

(arxiv.org)

Controllables and Observables, Revisited

Scott GarrabrantOct 29, 2020, 4:38 PM

34 points

5 comments8 min readLW link

AI risk hub in Singapore?

Daniel KokotajloOct 29, 2020, 11:45 AM

57 points

18 comments4 min readLW link

Functors and Coarse Worlds

Scott GarrabrantOct 30, 2020, 3:19 PM

50 points

4 comments8 min readLW link

[Question] Responses to Christiano on takeoff speeds?

Richard_NgoOct 30, 2020, 3:16 PM

29 points

8 comments1 min readLW link

/r/MLScaling: new subreddit for NN scaling research/discussion

gwernOct 30, 2020, 8:50 PM

20 points

0 comments1 min readLW link

(www.reddit.com)

“Inner Alignment Failures” Which Are Actually Outer Alignment Failures

johnswentworthOct 31, 2020, 8:18 PM

61 points

38 comments5 min readLW link

Automated intelligence is not AI

KatjaGraceNov 1, 2020, 11:30 PM

54 points

10 comments2 min readLW link

(meteuphoric.com)

Confucianism in AI Alignment

johnswentworthNov 2, 2020, 9:16 PM

33 points

28 comments6 min readLW link

[AN #124]: Provably safe exploration through shielding

Rohin ShahNov 4, 2020, 6:20 PM

13 points

0 comments9 min readLW link

(mailchi.mp)

Defining capability and alignment in gradient descent

Edouard HarrisNov 5, 2020, 2:36 PM

22 points

6 comments10 min readLW link

Sub-Sums and Sub-Tensors

Scott GarrabrantNov 5, 2020, 6:06 PM

34 points

4 comments8 min readLW link

Multiplicative Operations on Cartesian Frames

Scott GarrabrantNov 3, 2020, 7:27 PM

34 points

23 comments12 min readLW link

Subagents of Cartesian Frames

Scott GarrabrantNov 2, 2020, 10:02 PM

48 points

5 comments8 min readLW link

[Question] What considerations influence whether I have more influence over short or long timelines?

Daniel KokotajloNov 5, 2020, 7:56 PM

27 points

30 comments1 min readLW link

Additive and Multiplicative Subagents

Scott GarrabrantNov 6, 2020, 2:26 PM

20 points

7 comments12 min readLW link

Committing, Assuming, Externalizing, and Internalizing

Scott GarrabrantNov 9, 2020, 4:59 PM

31 points

25 comments10 min readLW link

Building AGI Using Language Models

leogaoNov 9, 2020, 4:33 PM

11 points

1 comment1 min readLW link

(leogao.dev)

Why You Should Care About Goal-Directedness

adamShimiNov 9, 2020, 12:48 PM

37 points

15 comments9 min readLW link

Clarifying inner alignment terminology

evhubNov 9, 2020, 8:40 PM

98 points

17 comments3 min readLW link 1 review

Eight Definitions of Observability

Scott GarrabrantNov 10, 2020, 11:37 PM

34 points

26 comments12 min readLW link

[AN #125]: Neural network scaling laws across multiple modalities

Rohin ShahNov 11, 2020, 6:20 PM

25 points

7 comments9 min readLW link

(mailchi.mp)

Time in Cartesian Frames

Scott GarrabrantNov 11, 2020, 8:25 PM

48 points

16 comments7 min readLW link

Learning Normativity: A Research Agenda

abramdemskiNov 11, 2020, 9:59 PM

76 points

18 comments19 min readLW link

[Question] Any work on honeypots (to detect treacherous turn attempts)?

David Scott Krueger (formerly: capybaralet)Nov 12, 2020, 5:41 AM

17 points

4 comments1 min readLW link

Misalignment and misuse: whose values are manifest?

KatjaGraceNov 13, 2020, 10:10 AM

42 points

7 comments2 min readLW link

(meteuphoric.com)

A Self-Embedded Probabilistic Model

johnswentworthNov 13, 2020, 8:36 PM

30 points

2 comments5 min readLW link

TU Darmstadt, Computer Science Master’s with a focus on Machine Learning

Master Programs ML/AINov 14, 2020, 3:50 PM

6 points

0 comments8 min readLW link

EPF Lausanne, ML related MSc programs

Master Programs ML/AINov 14, 2020, 3:51 PM

3 points

0 comments4 min readLW link

ETH Zurich, ML related MSc programs

Master Programs ML/AINov 14, 2020, 3:49 PM

3 points

0 comments10 min readLW link

University of Oxford, Master’s Statistical Science

Master Programs ML/AINov 14, 2020, 3:51 PM

3 points

0 comments3 min readLW link

University of Edinburgh, Master’s Artificial Intelligence

Master Programs ML/AINov 14, 2020, 3:49 PM

4 points

0 comments12 min readLW link

University of Amsterdam (UvA), Master’s Artificial Intelligence

Master Programs ML/AINov 14, 2020, 3:49 PM

16 points

6 comments21 min readLW link

University of Tübingen, Master’s Machine Learning

Master Programs ML/AINov 14, 2020, 3:50 PM

14 points

0 comments7 min readLW link

A guide to Iterated Amplification & Debate

Rafael HarthNov 15, 2020, 5:14 PM

68 points

10 comments15 min readLW link

Solomonoff Induction and Sleeping Beauty

ikeNov 17, 2020, 2:28 AM

7 points

0 comments2 min readLW link

The Pointers Problem: Human Values Are A Function Of Humans’ Latent Variables

johnswentworthNov 18, 2020, 5:47 PM

104 points

43 comments11 min readLW link 2 reviews

The ethics of AI for the Routledge Encyclopedia of Philosophy

Stuart_ArmstrongNov 18, 2020, 5:55 PM

45 points

8 comments1 min readLW link

Persuasion Tools: AI takeover without AGI or agency?

Daniel KokotajloNov 20, 2020, 4:54 PM

74 points

24 comments11 min readLW link 1 review

UDT might not pay a Counterfactual Mugger

winwonceNov 21, 2020, 11:27 PM

5 points

18 comments2 min readLW link

Changing the AI race payoff matrix

GurkenglasNov 22, 2020, 10:25 PM

7 points

2 comments1 min readLW link

Syntax, semantics, and symbol grounding, simplified

Stuart_ArmstrongNov 23, 2020, 4:12 PM

30 points

4 comments9 min readLW link

Commentary on AGI Safety from First Principles

Richard_NgoNov 23, 2020, 9:37 PM

80 points

4 comments54 min readLW link

[Question] Critiques of the Agent Foundations agenda?

JsevillamolNov 24, 2020, 4:11 PM

16 points

3 comments1 min readLW link

[Question] How should OpenAI communicate about the commercial performances of the GPT-3 API?

Maxime RichéNov 24, 2020, 8:34 AM

2 points

0 comments1 min readLW link

[AN #126]: Avoiding wireheading by decoupling action feedback from action effects

Rohin ShahNov 26, 2020, 11:20 PM

24 points

1 comment10 min readLW link

(mailchi.mp)

[Question] Is this a good way to bet on short timelines?

Daniel KokotajloNov 28, 2020, 12:51 PM

16 points

8 comments1 min readLW link

Preface to the Sequence on Factored Cognition

Rafael HarthNov 30, 2020, 6:49 PM

35 points

7 comments2 min readLW link

[Linkpost] AlphaFold: a solution to a 50-year-old grand challenge in biology

adamShimiNov 30, 2020, 5:33 PM

54 points

22 comments1 min readLW link

(deepmind.com)

What is “protein folding”? A brief explanation

jasoncrawfordDec 1, 2020, 2:46 AM

69 points

9 comments4 min readLW link

(rootsofprogress.org)

[Question] In a multipolar scenario, how do people expect systems to be trained to interact with systems developed by other labs?

JesseCliftonDec 1, 2020, 8:04 PM

11 points

6 comments1 min readLW link

[AN #127]: Rethinking agency: Cartesian frames as a formalization of ways to carve up the world into an agent and its environment

Rohin ShahDec 2, 2020, 6:20 PM

46 points

0 comments13 min readLW link

(mailchi.mp)

Beyond 175 billion parameters: Can we anticipate future GPT-X Capabilities?

bakztfutureDec 4, 2020, 11:42 PM

−1 points

1 comment2 min readLW link

Thoughts on Robin Hanson’s AI Impacts interview

Steven ByrnesNov 24, 2019, 1:40 AM

25 points

3 comments7 min readLW link

[RXN#7] Russian x-risks newsletter fall 2020

avturchinDec 5, 2020, 4:28 PM

12 points

0 comments3 min readLW link

The AI Safety Game (UPDATED)

Daniel KokotajloDec 5, 2020, 10:27 AM

44 points

9 comments3 min readLW link

Values Form a Shifting Landscape (and why you might care)

VojtaKovarikDec 5, 2020, 11:56 PM

28 points

6 comments4 min readLW link

AI Problems Shared by Non-AI Systems

VojtaKovarikDec 5, 2020, 10:15 PM

7 points

2 comments4 min readLW link

Chance that “AI safety basically [doesn’t need] to be solved, we’ll just solve it by default unless we’re completely completely careless”

Quinn, Aidan_Kierans, Morpheus and Nicholas Turner

Dec 8, 2020, 9:08 PM

27 points

0 comments5 min readLW link

Minimal Maps, Semi-Decisions, and Neural Representations

Past AccountDec 6, 2020, 3:15 PM

30 points

2 comments4 min readLW link

Launching the Forecasting AI Progress Tournament

TamayDec 7, 2020, 2:08 PM

20 points

0 comments1 min readLW link

(www.metaculus.com)

[AN #128]: Prioritizing research on AI existential safety based on its application to governance demands

Rohin ShahDec 9, 2020, 6:20 PM

16 points

2 comments10 min readLW link

(mailchi.mp)

Summary of AI Research Considerations for Human Existential Safety (ARCHES)

peterbarnettDec 9, 2020, 11:28 PM

10 points

0 comments13 min readLW link

Clarifying Factored Cognition

Rafael HarthDec 13, 2020, 8:02 PM

23 points

2 comments3 min readLW link

Homogeneity vs. heterogeneity in AI takeoff scenarios

evhubDec 16, 2020, 1:37 AM

95 points

48 comments4 min readLW link

LBIT Proofs 8: Propositions 53-58

DiffractorDec 16, 2020, 3:29 AM

7 points

0 comments18 min readLW link

LBIT Proofs 6: Propositions 39-47

DiffractorDec 16, 2020, 3:33 AM

7 points

0 comments23 min readLW link

LBIT Proofs 5: Propositions 29-38

DiffractorDec 16, 2020, 3:35 AM

7 points

0 comments21 min readLW link

LBIT Proofs 3: Propositions 19-22

DiffractorDec 16, 2020, 3:40 AM

7 points

0 comments17 min readLW link

LBIT Proofs 2: Propositions 10-18

DiffractorDec 16, 2020, 3:45 AM

7 points

0 comments20 min readLW link

LBIT Proofs 1: Propositions 1-9

DiffractorDec 16, 2020, 3:48 AM

7 points

0 comments25 min readLW link

LBIT Proofs 4: Propositions 22-28

DiffractorDec 16, 2020, 3:38 AM

7 points

0 comments17 min readLW link

LBIT Proofs 7: Propositions 48-52

DiffractorDec 16, 2020, 3:31 AM

7 points

0 comments20 min readLW link

Less Basic Inframeasure Theory

DiffractorDec 16, 2020, 3:52 AM

22 points

1 comment61 min readLW link

[AN #129]: Explaining double descent by measuring bias and variance

Rohin ShahDec 16, 2020, 6:10 PM

14 points

1 comment7 min readLW link

(mailchi.mp)

Machine learning could be fundamentally unexplainable

George3d6Dec 16, 2020, 1:32 PM

26 points

15 comments15 min readLW link

(cerebralab.com)

Beta test GPT-3 based research assistant

jungofthewonDec 16, 2020, 1:42 PM

34 points

2 comments1 min readLW link

[Question] How long till Inverse AlphaFold?

Daniel KokotajloDec 17, 2020, 7:56 PM

41 points

18 comments1 min readLW link

Hierarchical planning: context agents

Charlie SteinerDec 19, 2020, 11:24 AM

21 points

6 comments9 min readLW link

[Question] Is there a community aligned with the idea of creating species of AGI systems for them to become our successors?

iamhefestoDec 20, 2020, 7:06 PM

−2 points

7 comments1 min readLW link

Intuition

Rafael HarthDec 20, 2020, 9:49 PM

26 points

1 comment6 min readLW link

2020 AI Alignment Literature Review and Charity Comparison

LarksDec 21, 2020, 3:27 PM

137 points

14 comments68 min readLW link

TAI Safety Bibliographic Database

JessRiedelDec 22, 2020, 5:42 PM

70 points

10 comments17 min readLW link

Announcing AXRP, the AI X-risk Research Podcast

DanielFilanDec 23, 2020, 8:00 PM

54 points

6 comments1 min readLW link

(danielfilan.com)

[AN #130]: A new AI x-risk podcast, and reviews of the field

Rohin ShahDec 24, 2020, 6:20 PM

8 points

0 comments7 min readLW link

(mailchi.mp)

Can we model technological singularity as the phase transition?

Valentin2026Dec 26, 2020, 3:20 AM

4 points

3 comments4 min readLW link

AGI Alignment Should Solve Corporate Alignment

magfrumpDec 27, 2020, 2:23 AM

19 points

6 comments6 min readLW link

Against GDP as a metric for timelines and takeoff speeds

Daniel KokotajloDec 29, 2020, 5:42 PM

131 points

16 comments14 min readLW link 1 review

AXRP Episode 3 - Negotiable Reinforcement Learning with Andrew Critch

DanielFilanDec 29, 2020, 8:45 PM

26 points

0 comments27 min readLW link

AXRP Episode 1 - Adversarial Policies with Adam Gleave

DanielFilanDec 29, 2020, 8:41 PM

12 points

5 comments33 min readLW link

AXRP Episode 2 - Learning Human Biases with Rohin Shah

DanielFilanDec 29, 2020, 8:43 PM

13 points

0 comments35 min readLW link

Dario Amodei leaves OpenAI

Daniel KokotajloDec 29, 2020, 7:31 PM

69 points

12 comments1 min readLW link

[Question] What Are Some Alternative Approaches to Understanding Agency/Intelligence?

intersticeDec 29, 2020, 11:21 PM

15 points

12 comments1 min readLW link

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Joar SkalseDec 29, 2020, 1:33 PM

67 points

58 comments1 min readLW link 1 review

Debate Minus Factored Cognition

abramdemskiDec 29, 2020, 10:59 PM

37 points

42 comments11 min readLW link

[AN #131]: Formalizing the argument of ignored attributes in a utility function

Rohin ShahDec 31, 2020, 6:20 PM

13 points

4 comments9 min readLW link

(mailchi.mp)

Reflections on Larks’ 2020 AI alignment literature review

Alex FlintJan 1, 2021, 10:53 PM

79 points

8 comments6 min readLW link

Mental subagent implications for AI Safety

moridinamaelJan 3, 2021, 6:59 PM

11 points

0 comments3 min readLW link

The National Defense Authorization Act Contains AI Provisions

ryan_bJan 5, 2021, 3:51 PM

30 points

24 comments1 min readLW link

The Pointers Problem: Clarifications/Variations

abramdemskiJan 5, 2021, 5:29 PM

50 points

14 comments18 min readLW link

[AN #132]: Complex and subtly incorrect arguments as an obstacle to debate

Rohin ShahJan 6, 2021, 6:20 PM

19 points

1 comment19 min readLW link

(mailchi.mp)

Out-of-body reasoning (OOBR)

Jon ZeroJan 9, 2021, 4:10 PM

5 points

0 comments4 min readLW link

Review of Soft Takeoff Can Still Lead to DSA

Daniel KokotajloJan 10, 2021, 6:10 PM

75 points

15 comments6 min readLW link

Review of ‘Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More’

TurnTroutJan 12, 2021, 3:57 AM

40 points

1 comment2 min readLW link

[AN #133]: Building machines that can cooperate (with humans, institutions, or other machines)

Rohin ShahJan 13, 2021, 6:10 PM

14 points

0 comments9 min readLW link

(mailchi.mp)

An Exploratory Toy AI Takeoff Model

niplavJan 13, 2021, 6:13 PM

10 points

3 comments12 min readLW link

Some recent survey papers on (mostly near-term) AI safety, security, and assurance

Aryeh EnglanderJan 13, 2021, 9:50 PM

11 points

0 comments3 min readLW link

Thoughts on Iason Gabriel’s Artificial Intelligence, Values, and Alignment

Alex FlintJan 14, 2021, 12:58 PM

35 points

14 comments4 min readLW link

Why I’m excited about Debate

Richard_NgoJan 15, 2021, 11:37 PM

73 points

12 comments7 min readLW link

Excerpt from Arbital Solomonoff induction dialogue

Richard_NgoJan 17, 2021, 3:49 AM

36 points

6 comments5 min readLW link

(arbital.com)

Short summary of mAIry’s room

Stuart_ArmstrongJan 18, 2021, 6:11 PM

26 points

2 comments4 min readLW link

DALL-E does symbol grounding

p.b.Jan 17, 2021, 9:20 PM

6 points

0 comments1 min readLW link

Some thoughts on risks from narrow, non-agentic AI

Richard_NgoJan 19, 2021, 12:04 AM

35 points

21 comments16 min readLW link

Against the Backward Approach to Goal-Directedness

adamShimiJan 19, 2021, 6:46 PM

19 points

6 comments4 min readLW link

[AN #134]: Underspecification as a cause of fragility to distribution shift

Rohin ShahJan 21, 2021, 6:10 PM

13 points

0 comments7 min readLW link

(mailchi.mp)

Counterfactual control incentives

Stuart_ArmstrongJan 21, 2021, 4:54 PM

21 points

10 comments9 min readLW link

Policy restrictions and Secret keeping AI

Donald HobsonJan 24, 2021, 8:59 PM

6 points

3 comments3 min readLW link

FC final: Can Factored Cognition schemes scale?

Rafael HarthJan 24, 2021, 10:18 PM

15 points

0 comments17 min readLW link

[AN #135]: Five properties of goal-directed systems

Rohin ShahJan 27, 2021, 6:10 PM

33 points

0 comments8 min readLW link

(mailchi.mp)

AMA on EA Forum: Ajeya Cotra, researcher at Open Phil

Ajeya CotraJan 29, 2021, 11:05 PM

23 points

0 comments1 min readLW link

(forum.effectivealtruism.org)

Play with neural net

KatjaGraceJan 30, 2021, 10:50 AM

17 points

0 comments1 min readLW link

(worldspiritsockpuppet.com)

A Critique of Non-Obstruction

Joe CollmanFeb 3, 2021, 8:45 AM

13 points

10 comments4 min readLW link

Distinguishing claims about training vs deployment

Richard_NgoFeb 3, 2021, 11:30 AM

61 points

30 comments9 min readLW link

Graphical World Models, Counterfactuals, and Machine Learning Agents

Koen.HoltmanFeb 17, 2021, 11:07 AM

6 points

2 comments10 min readLW link

OpenAI: “Scaling Laws for Transfer”, Hernandez et al.

Lukas FinnvedenFeb 4, 2021, 12:49 PM

13 points

3 comments1 min readLW link

(arxiv.org)

Evolutions Building Evolutions: Layers of Generate and Test

plexFeb 5, 2021, 6:21 PM

11 points

1 comment6 min readLW link

Epistemology of HCH

adamShimiFeb 9, 2021, 11:46 AM

16 points

2 comments10 min readLW link

[Question] Mathematical Models of Progress?

abramdemskiFeb 16, 2021, 12:21 AM

28 points

8 comments2 min readLW link

[Question] Suggestions of posts on the AF to review

adamShimiFeb 16, 2021, 12:40 PM

56 points

20 comments1 min readLW link

Disentangling Corrigibility: 2015-2021

Koen.HoltmanFeb 16, 2021, 6:01 PM

17 points

20 comments9 min readLW link

Cartesian frames as generalised models

Stuart_ArmstrongFeb 16, 2021, 4:09 PM

20 points

0 comments5 min readLW link

[AN #138]: Why AI governance should find problems rather than just solving them

Rohin ShahFeb 17, 2021, 6:50 PM

12 points

0 comments9 min readLW link

(mailchi.mp)

Safely controlling the AGI agent reward function

Koen.HoltmanFeb 17, 2021, 2:47 PM

7 points

0 comments5 min readLW link

AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger

DanielFilanFeb 18, 2021, 12:03 AM

41 points

10 comments86 min readLW link

Utility Maximization = Description Length Minimization

johnswentworthFeb 18, 2021, 6:04 PM

183 points

40 comments5 min readLW link

Google’s Ethical AI team and AI Safety

magfrumpFeb 20, 2021, 9:42 AM

12 points

16 comments7 min readLW link

AI Safety Beginners Meetup (European Time)

Linda LinseforsFeb 20, 2021, 1:20 PM

8 points

2 comments1 min readLW link

Minimal Map Constraints

Past AccountFeb 21, 2021, 5:49 PM

6 points

0 comments3 min readLW link

[AN #139]: How the simplicity of reality explains the success of neural nets

Rohin ShahFeb 24, 2021, 6:30 PM

26 points

6 comments12 min readLW link

(mailchi.mp)

My Thoughts on the Apperception Engine

J BostockFeb 25, 2021, 7:43 PM

4 points

1 comment3 min readLW link

The Case for Privacy Optimism

bmgarfinkelMar 10, 2020, 8:30 PM

43 points

1 comment32 min readLW link

(benmgarfinkel.wordpress.com)

[Question] How might cryptocurrencies affect AGI timelines?

Dawn DrescherFeb 28, 2021, 7:16 PM

13 points

40 comments2 min readLW link

Fun with +12 OOMs of Compute

Daniel KokotajloMar 1, 2021, 1:30 PM

212 points

78 comments12 min readLW link 1 review

Links for Feb 2021

ikeMar 1, 2021, 5:13 AM

6 points

0 comments6 min readLW link

(misinfounderload.substack.com)

Introduction to Reinforcement Learning

Dr. BirdbrainFeb 28, 2021, 11:03 PM

4 points

1 comment3 min readLW link

Curiosity about Aligning Values

esweetMar 3, 2021, 12:22 AM

3 points

7 comments1 min readLW link

How does bee learning compare with machine learning?

eleniMar 4, 2021, 1:59 AM

62 points

15 comments24 min readLW link

Some recent interviews with AI/math luminaries.

fowlertmMar 4, 2021, 1:26 AM

2 points

0 comments1 min readLW link

A Semitechnical Introductory Dialogue on Solomonoff Induction

Eliezer YudkowskyMar 4, 2021, 5:27 PM

127 points

34 comments54 min readLW link

Connecting the good regulator theorem with semantics and symbol grounding

Stuart_ArmstrongMar 4, 2021, 2:35 PM

11 points

0 comments2 min readLW link

[AN #140]: Theoretical models that predict scaling laws

Rohin ShahMar 4, 2021, 6:10 PM

45 points

0 comments10 min readLW link

(mailchi.mp)

Takeaways from the Intelligence Rising RPG

Quinn and Viktor Rehnberg

Mar 5, 2021, 10:27 AM

50 points

8 comments12 min readLW link

GPT-3 and the future of knowledge work

fowlertmMar 5, 2021, 5:40 PM

16 points

0 comments2 min readLW link

The case for aligning narrowly superhuman models

Ajeya CotraMar 5, 2021, 10:29 PM

187 points

74 comments38 min readLW link

MIRI comments on Cotra’s “Case for Aligning Narrowly Superhuman Models”

Rob BensingerMar 5, 2021, 11:43 PM

136 points

13 comments26 min readLW link

[Question] What are the biggest current impacts of AI?

Sam ClarkeMar 7, 2021, 9:44 PM

15 points

5 comments1 min readLW link

CLR’s recent work on multi-agent systems

JesseCliftonMar 9, 2021, 2:28 AM

54 points

1 comment13 min readLW link

De-confusing myself about Pascal’s Mugging and Newcomb’s Problem

DirectedEvolutionMar 9, 2021, 8:45 PM

7 points

1 comment3 min readLW link

Open Problems with Myopia

Mark Xu and evhub

Mar 10, 2021, 6:38 PM

57 points

16 comments8 min readLW link

[AN #141]: The case for practicing alignment work on GPT-3 and other large models

Rohin ShahMar 10, 2021, 6:30 PM

27 points

4 comments8 min readLW link

(mailchi.mp)

[Link] Whittlestone et al., The Societal Implications of Deep Reinforcement Learning

Aryeh EnglanderMar 10, 2021, 6:13 PM

11 points

1 comment1 min readLW link

(jair.org)

Four Motivations for Learning Normativity

abramdemskiMar 11, 2021, 8:13 PM

42 points

7 comments5 min readLW link

[Question] What’s a good way to test basic machine learning code?

KennyMar 11, 2021, 9:27 PM

5 points

9 comments1 min readLW link

[Video] Intelligence and Stupidity: The Orthogonality Thesis

plexMar 13, 2021, 12:32 AM

5 points

1 comment1 min readLW link

(www.youtube.com)

AI x-risk reduction: why I chose academia over industry

David Scott Krueger (formerly: capybaralet)Mar 14, 2021, 5:25 PM

56 points

14 comments3 min readLW link

[Question] Partial-Consciousness as semantic/symbolic representational language model trained on NN

Joe KwonMar 16, 2021, 6:51 PM

2 points

3 comments1 min readLW link

[AN #142]: The quest to understand a network well enough to reimplement it by hand

Rohin ShahMar 17, 2021, 5:10 PM

34 points

4 comments8 min readLW link

(mailchi.mp)

Intermittent Distillations #1

Mark XuMar 17, 2021, 5:15 AM

25 points

1 comment10 min readLW link

HCH Speculation Post #2A

Charlie SteinerMar 17, 2021, 1:26 PM

42 points

7 comments9 min readLW link

The Age of Imaginative Machines

Yuli_BanMar 18, 2021, 12:35 AM

10 points

1 comment11 min readLW link

Generalizing POWER to multi-agent games

midco and TurnTrout

Mar 22, 2021, 2:41 AM

52 points

17 comments7 min readLW link

My research methodology

paulfchristianoMar 22, 2021, 9:20 PM

148 points

36 comments16 min readLW link

(ai-alignment.com)

“Infra-Bayesianism with Vanessa Kosoy” – Watch/Discuss Party

Ben PaceMar 22, 2021, 11:44 PM

27 points

45 comments1 min readLW link

Preferences and biases, the information argument

Stuart_ArmstrongMar 23, 2021, 12:44 PM

14 points

5 comments1 min readLW link

[AN #143]: How to make embedded agents that reason probabilistically about their environments

Rohin ShahMar 24, 2021, 5:20 PM

13 points

3 comments8 min readLW link

(mailchi.mp)

Toy model of preference, bias, and extra information

Stuart_ArmstrongMar 24, 2021, 10:14 AM

9 points

0 comments4 min readLW link

On language modeling and future abstract reasoning research

alexlyzhovMar 25, 2021, 5:43 PM

3 points

1 comment1 min readLW link

(docs.google.com)

Inframeasures and Domain Theory

DiffractorMar 28, 2021, 9:19 AM

27 points

3 comments33 min readLW link

Infra-Domain Proofs 2

DiffractorMar 28, 2021, 9:15 AM

13 points

0 comments21 min readLW link

Infra-Domain proofs 1

DiffractorMar 28, 2021, 9:16 AM

13 points

0 comments23 min readLW link

Scenarios and Warning Signs for Ajeya’s Aggressive, Conservative, and Best Guess AI Timelines

Kevin LiuMar 29, 2021, 1:38 AM

25 points

1 comment9 min readLW link

(kliu.io)

[Question] How do we prepare for final crunch time?

Eli TyreMar 30, 2021, 5:47 AM

116 points

30 comments8 min readLW link 1 review

[Question] TAI?

Logan ZoellnerMar 30, 2021, 12:41 PM

12 points

8 comments1 min readLW link

A use for Classical AI—Expert Systems

GlpusnaMar 31, 2021, 2:37 AM

1 point

2 comments2 min readLW link

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Andrew_CritchMar 31, 2021, 11:50 PM

203 points

60 comments22 min readLW link

AI and the Probability of Conflict

tonyoconnorApr 1, 2021, 7:00 AM

8 points

10 comments8 min readLW link

“AI and Compute” trend isn’t predictive of what is happening

alexlyzhovApr 2, 2021, 12:44 AM

133 points

15 comments1 min readLW link

[AN #144]: How language models can also be finetuned for non-language tasks

Rohin ShahApr 2, 2021, 5:20 PM

19 points

0 comments6 min readLW link

(mailchi.mp)

2012 Robin Hanson comment on “Intelligence Explosion: Evidence and Import”

Rob BensingerApr 2, 2021, 4:26 PM

28 points

4 comments3 min readLW link

My take on Michael Littman on “The HCI of HAI”

Alex FlintApr 2, 2021, 7:51 PM

59 points

4 comments7 min readLW link

[Question] How do scaling laws work for fine-tuning?

Daniel KokotajloApr 4, 2021, 12:18 PM

24 points

10 comments1 min readLW link

Averting suffering with sentience throttlers (proposal)

QuinnApr 5, 2021, 10:54 AM

8 points

7 comments3 min readLW link

Reflective Bayesianism

abramdemskiApr 6, 2021, 7:48 PM

58 points

27 comments13 min readLW link

[Question] What will GPT-4 be incapable of?

Michaël TrazziApr 6, 2021, 7:57 PM

34 points

32 comments1 min readLW link

I Trained a Neural Network to Play Helltaker

lsusrApr 7, 2021, 8:24 AM

29 points

5 comments3 min readLW link

[AN #145]: Our three year anniversary!

Rohin ShahApr 9, 2021, 5:48 PM

19 points

0 comments8 min readLW link

(mailchi.mp)

Alignment Newsletter Three Year Retrospective

Rohin ShahApr 7, 2021, 2:39 PM

55 points

0 comments5 min readLW link

Which counterfactuals should an AI follow?

Stuart_ArmstrongApr 7, 2021, 4:47 PM

19 points

5 comments7 min readLW link

Solving the whole AGI control problem, version 0.0001

Steven ByrnesApr 8, 2021, 3:14 PM

60 points

7 comments26 min readLW link

The Japanese Quiz: a Thought Experiment of Statistical Epistemology

DanBApr 8, 2021, 5:37 PM

11 points

0 comments9 min readLW link

A possible preference algorithm

Stuart_ArmstrongApr 8, 2021, 6:25 PM

22 points

0 comments4 min readLW link

If you don’t design for extrapolation, you’ll extrapolate poorly—possibly fatally

Stuart_ArmstrongApr 8, 2021, 6:10 PM

17 points

0 comments4 min readLW link

AXRP Episode 6 - Debate and Imitative Generalization with Beth Barnes

DanielFilanApr 8, 2021, 9:20 PM

24 points

3 comments59 min readLW link

My Current Take on Counterfactuals

abramdemskiApr 9, 2021, 5:51 PM

53 points

57 comments25 min readLW link

Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

lifelonglearner and Peter Hase

Apr 9, 2021, 7:19 PM

139 points

16 comments102 min readLW link

Why unriggable almost implies uninfluenceable

Stuart_ArmstrongApr 9, 2021, 5:07 PM

11 points

0 comments4 min readLW link

Intermittent Distillations #2

Mark XuApr 14, 2021, 6:47 AM

32 points

4 comments9 min readLW link

Test Cases for Impact Regularisation Methods

DanielFilanFeb 6, 2019, 9:50 PM

58 points

5 comments12 min readLW link

(danielfilan.com)

Superrational Agents Kelly Bet Influence!

abramdemskiApr 16, 2021, 10:08 PM

41 points

5 comments5 min readLW link

Defining “optimizer”

ChantielApr 17, 2021, 3:38 PM

9 points

6 comments1 min readLW link

Alex Flint on “A software engineer’s perspective on logical induction”

RaemonApr 17, 2021, 6:56 AM

21 points

8 comments1 min readLW link

[Question] Parameter count of ML systems through time?

JsevillamolApr 19, 2021, 12:54 PM

31 points

4 comments1 min readLW link

Gradations of Inner Alignment Obstacles

abramdemskiApr 20, 2021, 10:18 PM

80 points

22 comments9 min readLW link

Where are intentions to be found?

Alex FlintApr 21, 2021, 12:51 AM

44 points

12 comments9 min readLW link

[AN #147]: An overview of the interpretability landscape

Rohin ShahApr 21, 2021, 5:10 PM

14 points

2 comments7 min readLW link

(mailchi.mp)

NTK/GP Models of Neural Nets Can’t Learn Features

intersticeApr 22, 2021, 3:01 AM

31 points

33 comments3 min readLW link

[Question] Is there anything that can stop AGI development in the near term?

Wulky WilkinsenApr 22, 2021, 8:37 PM

5 points

5 comments1 min readLW link

Probability theory and logical induction as lenses

Alex FlintApr 23, 2021, 2:41 AM

43 points

7 comments6 min readLW link

Naturalism and AI alignment

Michele CampoloApr 24, 2021, 4:16 PM

5 points

12 comments8 min readLW link

Malicious non-state actors and AI safety

ketiApr 25, 2021, 3:21 AM

2 points

13 comments2 min readLW link

Announcing the Alignment Research Center

paulfchristianoApr 26, 2021, 11:30 PM

177 points

6 comments1 min readLW link

(ai-alignment.com)

[Linkpost] Treacherous turns in the wild

Mark XuApr 26, 2021, 10:51 PM

31 points

6 comments1 min readLW link

(lukemuehlhauser.com)

FAQ: Advice for AI Alignment Researchers

Rohin ShahApr 26, 2021, 6:59 PM

67 points

2 comments1 min readLW link

(rohinshah.com)

Pitfalls of the agent model

Alex FlintApr 27, 2021, 10:19 PM

19 points

4 comments20 min readLW link

[AN #148]: Analyzing generalization across more axes than just accuracy or loss

Rohin ShahApr 28, 2021, 6:30 PM

24 points

5 comments11 min readLW link

(mailchi.mp)

AMA: Paul Christiano, alignment researcher

paulfchristianoApr 28, 2021, 6:55 PM

117 points

198 comments1 min readLW link

25 Min Talk on MetaEthical.AI with Questions from Stuart Armstrong

June KuApr 29, 2021, 3:38 PM

21 points

7 comments1 min readLW link

Low-stakes alignment

paulfchristianoApr 30, 2021, 12:10 AM

70 points

9 comments7 min readLW link 1 review

(ai-alignment.com)

[Weekly Event] Alignment Researcher Coffee Time (in Walled Garden)

adamShimiMay 2, 2021, 12:59 PM

37 points

0 comments1 min readLW link

Parsing Abram on Gradations of Inner Alignment Obstacles

Alex FlintMay 4, 2021, 5:44 PM

22 points

4 comments6 min readLW link

Mundane solutions to exotic problems

paulfchristianoMay 4, 2021, 6:20 PM

56 points

8 comments5 min readLW link

(ai-alignment.com)

April 15, 2040

NisanMay 4, 2021, 9:18 PM

97 points

19 comments2 min readLW link

[AN #149]: The newsletter’s editorial policy

Rohin ShahMay 5, 2021, 5:10 PM

19 points

3 comments8 min readLW link

(mailchi.mp)

Parsing Chris Mingard on Neural Networks

Alex FlintMay 6, 2021, 10:16 PM

67 points

27 comments6 min readLW link

Life and expanding steerable consequences

Alex FlintMay 7, 2021, 6:33 PM

46 points

3 comments4 min readLW link

Domain Theory and the Prisoner’s Dilemma: FairBot

GurkenglasMay 7, 2021, 7:33 AM

14 points

5 comments2 min readLW link

Pre-Training + Fine-Tuning Favors Deception

Mark XuMay 8, 2021, 6:36 PM

27 points

2 comments3 min readLW link

[Event] Weekly Alignment Research Coffee Time (05/10)

adamShimiMay 9, 2021, 11:05 AM

16 points

2 comments1 min readLW link

[Question] Is driving worth the risk?

Adam ZernerMay 11, 2021, 5:04 AM

26 points

29 comments7 min readLW link

Yampolskiy on AI Risk Skepticism

Gordon Seidoh WorleyMay 11, 2021, 2:50 PM

15 points

5 comments1 min readLW link

(www.researchgate.net)

Human priors, features and models, languages, and Solmonoff induction

Stuart_ArmstrongMay 10, 2021, 10:55 AM

16 points

2 comments4 min readLW link

[AN #150]: The subtypes of Cooperative AI research

Rohin ShahMay 12, 2021, 5:20 PM

15 points

0 comments6 min readLW link

(mailchi.mp)

Understanding the Lottery Ticket Hypothesis

Alex FlintMay 14, 2021, 12:25 AM

50 points

9 comments8 min readLW link

Concerning not getting lost

Alex FlintMay 14, 2021, 7:38 PM

50 points

9 comments4 min readLW link

[Event] Weekly Alignment Research Coffee Time (05/17)

adamShimiMay 15, 2021, 10:07 PM

7 points

0 comments1 min readLW link

Optimizers: To Define or not to Define

J BostockMay 16, 2021, 7:55 PM

4 points

0 comments4 min readLW link

Intermittent Distillations #3

Mark XuMay 15, 2021, 7:13 AM

19 points

1 comment11 min readLW link

AXRP Episode 7 - Side Effects with Victoria Krakovna

DanielFilanMay 14, 2021, 3:50 AM

34 points

6 comments43 min readLW link

Saving Time

Scott GarrabrantMay 18, 2021, 8:11 PM

131 points

19 comments4 min readLW link

[Question] Are there any methods for NNs or other ML systems to get information from knockout-like or assay-like experiments?

J BostockMay 18, 2021, 9:33 PM

2 points

1 comment1 min readLW link

SGD’s Bias

johnswentworthMay 18, 2021, 11:19 PM

60 points

16 comments3 min readLW link

This Sunday, 12PM PT: Scott Garrabrant on “Finite Factored Sets”

RaemonMay 19, 2021, 1:48 AM

33 points

4 comments1 min readLW link

[AN #151]: How sparsity in the final layer makes a neural net debuggable

Rohin ShahMay 19, 2021, 5:20 PM

19 points

0 comments6 min readLW link

(mailchi.mp)

The Variational Characterization of KL-Divergence, Error Catastrophes, and Generalization

Past AccountMay 20, 2021, 8:57 PM

38 points

5 comments3 min readLW link

Oracles, Informers, and Controllers

ozziegooenMay 25, 2021, 2:16 PM

15 points

2 comments3 min readLW link

Knowledge is not just map/territory resemblance

Alex FlintMay 25, 2021, 5:58 PM

28 points

4 comments3 min readLW link

MDP models are determined by the agent architecture and the environmental dynamics

TurnTroutMay 26, 2021, 12:14 AM

23 points

34 comments3 min readLW link

[Question] List of good AI safety project ideas?

Aryeh EnglanderMay 26, 2021, 10:36 PM

24 points

8 comments1 min readLW link

AXRP Episode 7.5 - Forecasting Transformative AI from Biological Anchors with Ajeya Cotra

DanielFilanMay 28, 2021, 12:20 AM

24 points

1 comment67 min readLW link

Predict responses to the “existential risk from AI” survey

Rob BensingerMay 28, 2021, 1:32 AM

44 points

6 comments2 min readLW link

Teaching ML to answer questions honestly instead of predicting human answers

paulfchristianoMay 28, 2021, 5:30 PM

53 points

18 comments16 min readLW link

(ai-alignment.com)

The blue-minimising robot and model splintering

Stuart_ArmstrongMay 28, 2021, 3:09 PM

13 points

4 comments3 min readLW link 1 review

[Question] Use of GPT-3 for identifying Phishing and other email based attacks?

jmhMay 29, 2021, 5:11 PM

6 points

0 comments1 min readLW link

[Event] Weekly Alignment Research Coffee Time

adamShimiMay 29, 2021, 1:26 PM

12 points

5 comments1 min readLW link

What is the most effective way to donate to AGI XRisk mitigation?

JoshuaFoxMay 30, 2021, 11:08 AM

44 points

11 comments1 min readLW link

“Existential risk from AI” survey results

Rob BensingerJun 1, 2021, 8:02 PM

56 points

8 comments11 min readLW link

April 2021 Gwern.net newsletter

gwernJun 3, 2021, 3:13 PM

20 points

0 comments1 min readLW link

(www.gwern.net)

The underlying model of a morphism

Stuart_ArmstrongJun 4, 2021, 10:29 PM

10 points

0 comments5 min readLW link

We need a standard set of community advice for how to financially prepare for AGI

GeneSmithJun 7, 2021, 7:24 AM

50 points

53 comments5 min readLW link

Some AI Governance Research Ideas

apc and markusanderljung

Jun 7, 2021, 2:40 PM

29 points

2 comments2 min readLW link

Big picture of phasic dopamine

Steven ByrnesJun 8, 2021, 1:07 PM

59 points

18 comments36 min readLW link

Bayeswatch 6: Mechwarrior

lsusrJun 7, 2021, 8:20 PM

47 points

8 comments2 min readLW link

Speculations against GPT-n writing alignment papers

Donald HobsonJun 7, 2021, 9:13 PM

31 points

6 comments2 min readLW link

The reverse Goodhart problem

Stuart_ArmstrongJun 8, 2021, 3:48 PM

16 points

22 comments1 min readLW link

Against intelligence

George3d6Jun 8, 2021, 1:03 PM

12 points

17 comments10 min readLW link

(cerebralab.com)

Dangerous optimisation includes variance minimisation

Stuart_ArmstrongJun 8, 2021, 11:34 AM

32 points

5 comments2 min readLW link

Survey on AI existential risk scenarios

Sam Clarke, apc and Jonas Schuett

Jun 8, 2021, 5:12 PM

60 points

11 comments7 min readLW link

AXRP Episode 8 - Assistance Games with Dylan Hadfield-Menell

DanielFilanJun 8, 2021, 11:20 PM

22 points

1 comment71 min readLW link

“Decision Transformer” (Tool AIs are secret Agent AIs)

gwernJun 9, 2021, 1:06 AM

37 points

4 comments1 min readLW link

(sites.google.com)

Evan Hubinger on Homogeneity in Takeoff Speeds, Learned Optimization and Interpretability

Michaël TrazziJun 8, 2021, 7:20 PM

28 points

0 comments55 min readLW link

A naive alignment strategy and optimism about generalization

paulfchristianoJun 10, 2021, 12:10 AM

44 points

4 comments3 min readLW link

(ai-alignment.com)

Knowledge is not just mutual information

Alex FlintJun 10, 2021, 1:01 AM

27 points

6 comments4 min readLW link

The Apprentice Experiment

johnswentworthJun 10, 2021, 3:29 AM

148 points

11 comments4 min readLW link

[Question] ML is now automating parts of chip R&D. How big a deal is this?

Daniel KokotajloJun 10, 2021, 9:51 AM

45 points

17 comments1 min readLW link

Oh No My AI (Filk)

Gordon Seidoh WorleyJun 11, 2021, 3:05 PM

42 points

7 comments1 min readLW link

May 2021 Gwern.net newsletter

gwernJun 11, 2021, 2:13 PM

31 points

0 comments1 min readLW link

(www.gwern.net)

[Question] What other problems would a successful AI safety algorithm solve?

DirectedEvolutionJun 13, 2021, 9:07 PM

12 points

4 comments1 min readLW link

Avoiding the instrumental policy by hiding information about humans

paulfchristianoJun 13, 2021, 8:00 PM

31 points

2 comments2 min readLW link

Answering questions honestly given world-model mismatches

paulfchristianoJun 13, 2021, 6:00 PM

34 points

2 comments16 min readLW link

(ai-alignment.com)

Vignettes Workshop (AI Impacts)

Daniel KokotajloJun 15, 2021, 12:05 PM

47 points

3 comments1 min readLW link

Three Paths to Existential Risk from AI

harsimonyJun 16, 2021, 1:37 AM

1 point

2 comments1 min readLW link

(harsimony.wordpress.com)

[AN #152]: How we’ve overestimated few-shot learning capabilities

Rohin ShahJun 16, 2021, 5:20 PM

22 points

6 comments8 min readLW link

(mailchi.mp)

AI-Based Code Generation Using GPT-J-6B

Tomás B.Jun 16, 2021, 3:05 PM

21 points

15 comments1 min readLW link

(minimaxir.com)

Insufficient Values

Jozdien, Jacob Abraham and Abraham Francis

Jun 16, 2021, 2:33 PM

29 points

15 comments5 min readLW link

[Question] Pros and cons of working on near-term technical AI safety and assurance

Aryeh EnglanderJun 17, 2021, 8:17 PM

11 points

1 comment2 min readLW link

Non-poisonous cake: anthropic updates are normal

Stuart_ArmstrongJun 18, 2021, 2:51 PM

27 points

11 comments2 min readLW link

Knowledge is not just precipitation of action

Alex FlintJun 18, 2021, 11:26 PM

21 points

6 comments7 min readLW link

I’m no longer sure that I buy dutch book arguments and this makes me skeptical of the “utility function” abstraction

Eli TyreJun 22, 2021, 3:53 AM

45 points

29 comments4 min readLW link

Frequent arguments about alignment

John SchulmanJun 23, 2021, 12:46 AM

95 points

16 comments5 min readLW link

Empirical Observations of Objective Robustness Failures

jbkjr and Lauro Langosco

Jun 23, 2021, 11:23 PM

63 points

5 comments9 min readLW link

[AN #153]: Experiments that demonstrate failures of objective robustness

Rohin ShahJun 26, 2021, 5:10 PM

25 points

1 comment8 min readLW link

(mailchi.mp)

Anthropics and Embedded Agency

dadadarrenJun 26, 2021, 1:45 AM

7 points

2 comments2 min readLW link

Deep limitations? Examining expert disagreement over deep learning

Richard_NgoJun 27, 2021, 12:55 AM

17 points

5 comments1 min readLW link

(link.springer.com)

Finite Factored Sets: LW transcript with running commentary

Rob Bensinger and Scott Garrabrant

Jun 27, 2021, 4:02 PM

30 points

0 comments51 min readLW link

Brute force searching for alignment

Donald HobsonJun 27, 2021, 9:54 PM

23 points

3 comments2 min readLW link

How teams went about their research at AI Safety Camp edition 5

RemmeltJun 28, 2021, 3:15 PM

24 points

0 comments6 min readLW link

Search by abstraction

p.b.Jun 29, 2021, 8:56 PM

4 points

0 comments1 min readLW link

[Question] Is there a “coherent decisions imply consistent utilities”-style argument for non-lexicographic preferences?

TetraspaceJun 29, 2021, 7:14 PM

3 points

20 comments1 min readLW link

Trying to approximate Statistical Models as Scoring Tables

JsevillamolJun 29, 2021, 5:20 PM

18 points

2 comments9 min readLW link

Do incoherent entities have stronger reason to become more coherent than less?

KatjaGraceJun 30, 2021, 5:50 AM

46 points

5 comments4 min readLW link

(worldspiritsockpuppet.com)

[AN #154]: What economic growth theory has to say about transformative AI

Rohin ShahJun 30, 2021, 5:20 PM

12 points

0 comments9 min readLW link

(mailchi.mp)

Progress on Causal Influence Diagrams

tom4everittJun 30, 2021, 3:34 PM

71 points

6 comments9 min readLW link

Could Advanced AI Drive Explosive Economic Growth?

Matthew BarnettJun 30, 2021, 10:17 PM

15 points

4 comments2 min readLW link

(www.openphilanthropy.org)

Experimentally evaluating whether honesty generalizes

paulfchristianoJul 1, 2021, 5:47 PM

99 points

23 comments9 min readLW link

Should VS Would and Newcomb’s Paradox

dadadarrenJul 3, 2021, 11:45 PM

5 points

36 comments2 min readLW link

Mauhn Releases AI Safety Documentation

Berg SeverensJul 3, 2021, 9:23 PM

4 points

0 comments1 min readLW link

Anthropic Effects in Estimating Evolution Difficulty

Mark XuJul 5, 2021, 4:02 AM

12 points

2 comments3 min readLW link

A simple example of conditional orthogonality in finite factored sets

DanielFilanJul 6, 2021, 12:36 AM

43 points

3 comments5 min readLW link

(danielfilan.com)

[Question] Is keeping AI “in the box” during training enough?

tgbJul 6, 2021, 3:17 PM

7 points

10 comments1 min readLW link

A second example of conditional orthogonality in finite factored sets

DanielFilanJul 7, 2021, 1:40 AM

46 points

0 comments2 min readLW link

(danielfilan.com)

Agency and the unreliable autonomous car

Alex FlintJul 7, 2021, 2:58 PM

29 points

24 comments10 min readLW link

How much chess engine progress is about adapting to bigger computers?

paulfchristianoJul 7, 2021, 10:35 PM

114 points

23 comments6 min readLW link

BASALT: A Benchmark for Learning from Human Feedback

Rohin ShahJul 8, 2021, 5:40 PM

56 points

20 comments2 min readLW link

(bair.berkeley.edu)

[AN #155]: A Minecraft benchmark for algorithms that learn without reward functions

Rohin ShahJul 8, 2021, 5:20 PM

21 points

5 comments7 min readLW link

(mailchi.mp)

Looking for Collaborators for an AGI Research Project

Rafael CosmanJul 8, 2021, 5:01 PM

3 points

5 comments3 min readLW link

Jackpot! An AI Vignette

Ben GoldhaberJul 8, 2021, 8:32 PM

13 points

0 comments2 min readLW link

Intermittent Distillations #4: Semiconductors, Economics, Intelligence, and Technological Progress.

Mark XuJul 8, 2021, 10:14 PM

81 points

9 comments10 min readLW link

Finite Factored Sets: Conditional Orthogonality

Scott GarrabrantJul 9, 2021, 6:01 AM

27 points

2 comments7 min readLW link

The accumulation of knowledge: literature review

Alex FlintJul 10, 2021, 6:36 PM

29 points

3 comments7 min readLW link

The inescapability of knowledge

Alex FlintJul 11, 2021, 10:59 PM

28 points

17 comments5 min readLW link

[Link] Musk’s non-missing mood

jimrandomhJul 12, 2021, 10:09 PM

70 points

21 comments1 min readLW link

(lukemuehlhauser.com)

[Question] What will the twenties look like if AGI is 30 years away?

Daniel KokotajloJul 13, 2021, 8:14 AM

29 points

18 comments1 min readLW link

Answering questions honestly instead of predicting human answers: lots of problems and some solutions

evhubJul 13, 2021, 6:49 PM

53 points

25 comments31 min readLW link

Model-based RL, Desires, Brains, Wireheading

Steven ByrnesJul 14, 2021, 3:11 PM

17 points

1 comment13 min readLW link

A closer look at chess scalings (into the past)

hippkeJul 15, 2021, 8:13 AM

49 points

14 comments4 min readLW link

AlphaFold 2 paper released: “Highly accurate protein structure prediction with AlphaFold”, Jumper et al 2021

gwernJul 15, 2021, 7:27 PM

39 points

10 comments1 min readLW link

(www.nature.com)

Benchmarking an old chess engine on new hardware

hippkeJul 16, 2021, 7:58 AM

71 points

3 comments5 min readLW link

[AN #156]: The scaling hypothesis: a plan for building AGI

Rohin ShahJul 16, 2021, 5:10 PM

44 points

20 comments8 min readLW link

(mailchi.mp)

Bayesianism versus conservatism versus Goodhart

Stuart_ArmstrongJul 16, 2021, 11:39 PM

15 points

1 comment6 min readLW link

(2009) Shane Legg—Funding safe AGI

Tomás B.Jul 17, 2021, 4:46 PM

36 points

2 comments1 min readLW link

(www.vetta.org)

[Question] Equivalent of Information Theory but for Computation?

J BostockJul 17, 2021, 9:38 AM

5 points

27 comments1 min readLW link

A Models-centric Approach to Corrigible Alignment

J BostockJul 17, 2021, 5:27 PM

2 points

0 comments6 min readLW link

A model of decision-making in the brain (the short version)

Steven ByrnesJul 18, 2021, 2:39 PM

20 points

0 comments3 min readLW link

[Question] Any taxonomies of conscious experience?

JohnDavidBustardJul 18, 2021, 6:28 PM

7 points

10 comments1 min readLW link

[Question] Work on Bayesian fitting of AI trends of performance?

JsevillamolJul 19, 2021, 6:45 PM

3 points

0 comments1 min readLW link

Some thoughts on David Roodman’s GWP model and its relation to AI timelines

Tom DavidsonJul 19, 2021, 10:59 PM

30 points

1 comment8 min readLW link

In search of benevolence (or: what should you get Clippy for Christmas?)

Joe CarlsmithJul 20, 2021, 1:12 AM

20 points

0 comments33 min readLW link

Entropic boundary conditions towards safe artificial superintelligence

Santiago Nunez-CorralesJul 20, 2021, 10:15 PM

3 points

0 comments2 min readLW link

(www.tandfonline.com)

Reward splintering for AI design

Stuart_ArmstrongJul 21, 2021, 4:13 PM

30 points

1 comment8 min readLW link

Re-Define Intent Alignment?

abramdemskiJul 22, 2021, 7:00 PM

27 points

33 comments4 min readLW link

[AN #157]: Measuring misalignment in the technology underlying Copilot

Rohin ShahJul 23, 2021, 5:20 PM

28 points

18 comments7 min readLW link

(mailchi.mp)

Examples of human-level AI running unaligned.

df fdJul 23, 2021, 8:49 AM

−3 points

0 comments2 min readLW link

(sortale.substack.com)

AXRP Episode 10 - AI’s Future and Impacts with Katja Grace

DanielFilanJul 23, 2021, 10:10 PM

34 points

2 comments76 min readLW link

Wanted: Foom-scared alignment research partner

Icarus GallagherJul 26, 2021, 7:23 PM

40 points

5 comments1 min readLW link

Refactoring Alignment (attempt #2)

abramdemskiJul 26, 2021, 8:12 PM

46 points

17 comments8 min readLW link

[Question] How much compute was used to train DeepMind’s generally capable agents?

Daniel KokotajloJul 29, 2021, 11:34 AM

32 points

11 comments1 min readLW link

[Question] Did they or didn’t they learn tool use?

Daniel KokotajloJul 29, 2021, 1:26 PM

16 points

8 comments1 min readLW link

[AN #158]: Should we be optimistic about generalization?

Rohin ShahJul 29, 2021, 5:20 PM

19 points

0 comments8 min readLW link

(mailchi.mp)

[Question] Very Unnatural Tasks?

OrfeasJul 31, 2021, 9:22 PM

4 points

5 comments1 min readLW link

[Question] Is iterated amplification really more powerful than imitation?

ChantielAug 2, 2021, 11:20 PM

5 points

0 comments2 min readLW link

What does GPT-3 understand? Symbol grounding and Chinese rooms

Stuart_ArmstrongAug 3, 2021, 1:14 PM

40 points

15 comments12 min readLW link

Garrabrant and Shah on human modeling in AGI

Rob BensingerAug 4, 2021, 4:35 AM

57 points

10 comments47 min readLW link

Value loading in the human brain: a worked example

Steven ByrnesAug 4, 2021, 5:20 PM

45 points

2 comments8 min readLW link

[AN #159]: Building agents that know how to experiment, by training on procedurally generated games

Rohin ShahAug 4, 2021, 5:10 PM

18 points

4 comments14 min readLW link

(mailchi.mp)

[Question] How many parameters do self-driving-car neural nets have?

Daniel KokotajloAug 6, 2021, 11:24 AM

9 points

3 comments1 min readLW link

Rage Against The MOOChine

BoraskoAug 7, 2021, 5:57 PM

20 points

12 comments7 min readLW link

Applications for Deconfusing Goal-Directedness

adamShimiAug 8, 2021, 1:05 PM

36 points

3 comments5 min readLW link 1 review

Instrumental Convergence: Power as Rademacher Complexity

Past AccountAug 12, 2021, 4:02 PM

6 points

0 comments3 min readLW link

A new definition of “optimizer”

ChantielAug 9, 2021, 1:42 PM

5 points

0 comments7 min readLW link

Goal-Directedness and Behavior, Redux

adamShimiAug 9, 2021, 2:26 PM

14 points

4 comments2 min readLW link

Automating Auditing: An ambitious concrete technical research proposal

evhubAug 11, 2021, 8:32 PM

77 points

9 comments14 min readLW link 1 review

Some criteria for sandwiching projects

dmzAug 12, 2021, 3:40 AM

18 points

1 comment4 min readLW link

Power-seeking for successive choices

adamShimiAug 12, 2021, 8:37 PM

11 points

9 comments4 min readLW link

[AN #160]: Building AIs that learn and think like people

Rohin ShahAug 13, 2021, 5:10 PM

28 points

6 comments10 min readLW link

(mailchi.mp)

[Question] How would the Scaling Hypothesis change things?

Aryeh EnglanderAug 13, 2021, 3:42 PM

4 points

4 comments1 min readLW link

A review of “Agents and Devices”

adamShimiAug 13, 2021, 8:42 AM

10 points

0 comments4 min readLW link

Approaches to gradient hacking

adamShimiAug 14, 2021, 3:16 PM

16 points

8 comments8 min readLW link

[Question] What are some open exposition problems in AI?

Sai Sasank YAug 16, 2021, 3:05 PM

4 points

2 comments1 min readLW link

Thinking about AI relationally

TekhneMakreAug 16, 2021, 10:03 PM

5 points

0 comments2 min readLW link

Finite Factored Sets: Polynomials and Probability

Scott GarrabrantAug 17, 2021, 9:53 PM

21 points

2 comments8 min readLW link

How DeepMind’s Generally Capable Agents Were Trained

1a3ornAug 20, 2021, 6:52 PM

87 points

6 comments19 min readLW link

[AN #161]: Creating generalizable reward functions for multiple tasks by learning a model of functional similarity

Rohin ShahAug 20, 2021, 5:20 PM

15 points

0 comments9 min readLW link

(mailchi.mp)

Implication of AI timelines on planning and solutions

JJ HepburnAug 21, 2021, 5:12 AM

18 points

5 comments2 min readLW link

Autoregressive Propaganda

lsusrAug 22, 2021, 2:18 AM

25 points

3 comments3 min readLW link

AI Risk for Epistemic Minimalists

Alex FlintAug 22, 2021, 3:39 PM

57 points

12 comments13 min readLW link 1 review

The Codex Skeptic FAQ

Michaël TrazziAug 24, 2021, 4:01 PM

49 points

24 comments2 min readLW link

How to turn money into AI safety?

Charlie SteinerAug 25, 2021, 10:49 AM

66 points

26 comments8 min readLW link

Introduction to Reducing Goodhart

Charlie SteinerAug 26, 2021, 6:38 PM

40 points

10 comments4 min readLW link

Could you have stopped Chernobyl?

Carlos RamirezAug 27, 2021, 1:48 AM

29 points

17 comments8 min readLW link

[AN #162]: Foundation models: a paradigm shift within AI

Rohin ShahAug 27, 2021, 5:20 PM

21 points

0 comments8 min readLW link

(mailchi.mp)

A short introduction to machine learning

Richard_NgoAug 30, 2021, 2:31 PM

67 points

0 comments8 min readLW link

[Question] What could small scale disasters from AI look like?

CharlesDAug 31, 2021, 3:52 PM

14 points

8 comments1 min readLW link

NIST AI Risk Management Framework request for information (RFI)

Aryeh EnglanderSep 1, 2021, 12:15 AM

15 points

0 comments2 min readLW link

Reward splintering as reverse of interpretability

Stuart_ArmstrongAug 31, 2021, 10:27 PM

10 points

0 comments1 min readLW link

What are biases, anyway? Multiple type signatures

Stuart_ArmstrongAug 31, 2021, 9:16 PM

11 points

0 comments3 min readLW link

Finite Factored Sets: Applications

Scott GarrabrantAug 31, 2021, 9:19 PM

27 points

1 comment10 min readLW link

Finite Factored Sets: Inferring Time

Scott GarrabrantAug 31, 2021, 9:18 PM

17 points

5 comments4 min readLW link

US Military Global Information Dominance Experiments

NunoSempereSep 1, 2021, 1:34 PM

25 points

0 comments4 min readLW link

(www.defense.gov)

Competent Preferences

Charlie SteinerSep 2, 2021, 2:26 PM

27 points

2 comments6 min readLW link

Formalizing Objections against Surrogate Goals

VojtaKovarikSep 2, 2021, 4:24 PM

5 points

23 comments20 min readLW link

[Question] Is there a name for the theory that “There will be fast takeoff in real-world capabilities because almost everything is AGI-complete”?

David Scott Krueger (formerly: capybaralet)Sep 2, 2021, 11:00 PM

31 points

8 comments1 min readLW link

Thoughts on gradient hacking

Richard_NgoSep 3, 2021, 1:02 PM

33 points

12 comments4 min readLW link

Why the technological singularity by AGI may never happen

hippkeSep 3, 2021, 2:19 PM

5 points

14 comments1 min readLW link

All Possible Views About Humanity’s Future Are Wild

HoldenKarnofskySep 3, 2021, 8:19 PM

140 points

40 comments8 min readLW link 1 review

The Most Important Century: Sequence Introduction

HoldenKarnofskySep 3, 2021, 8:19 PM

68 points

5 comments4 min readLW link 1 review

[Question] Are there substantial research efforts towards aligning narrow AIs?

RossinSep 4, 2021, 6:40 PM

11 points

4 comments2 min readLW link

Multi-Agent Inverse Reinforcement Learning: Suboptimal Demonstrations and Alternative Solution Concepts

sage_bergersonSep 7, 2021, 4:11 PM

5 points

0 comments1 min readLW link

Bayeswatch 7: Wildfire

lsusrSep 8, 2021, 5:35 AM

47 points

6 comments3 min readLW link

[AN #163]: Using finite factored sets for causal and temporal inference

Rohin ShahSep 8, 2021, 5:20 PM

38 points

0 comments10 min readLW link

(mailchi.mp)

Gradient descent is not just more efficient genetic algorithms

leogaoSep 8, 2021, 4:23 PM

54 points

14 comments1 min readLW link

Sam Altman Q&A Notes—Aftermath

p.b.Sep 8, 2021, 8:20 AM

45 points

35 comments2 min readLW link

[Question] Does blockchain technology offer potential solutions to some AI alignment problems?

pilordSep 9, 2021, 4:51 PM

−4 points

8 comments2 min readLW link

Countably Factored Spaces

DiffractorSep 9, 2021, 4:24 AM

47 points

3 comments18 min readLW link

The alignment problem in different capability regimes

BuckSep 9, 2021, 7:46 PM

87 points

12 comments5 min readLW link

GPT-X, DALL-E, and our Multimodal Future [video series]

bakztfutureSep 9, 2021, 11:05 PM

0 points

1 comment1 min readLW link

(youtube.com)

Bayeswatch 8: Antimatter

lsusrSep 10, 2021, 5:01 AM

29 points

6 comments3 min readLW link

Measurement, Optimization, and Take-off Speed

jsteinhardtSep 10, 2021, 7:30 PM

47 points

4 comments13 min readLW link

Bayeswatch 9: Zombies

lsusrSep 11, 2021, 5:57 AM

41 points

15 comments3 min readLW link

[Question] Is MIRI’s reading list up to date?

Aryeh EnglanderSep 11, 2021, 6:56 PM

25 points

5 comments1 min readLW link

Soldiers, Scouts, and Albatrosses.

JanSep 12, 2021, 10:36 AM

5 points

0 comments1 min readLW link

(universalprior.substack.com)

GPT-Augmented Blogging

lsusrSep 14, 2021, 11:55 AM

52 points

18 comments13 min readLW link

[AN #164]: How well can language models write code?

Rohin ShahSep 15, 2021, 5:20 PM

13 points

7 comments9 min readLW link

(mailchi.mp)

I wanted to interview Eliezer Yudkowsky but he’s busy so I simulated him instead

lsusrSep 16, 2021, 7:34 AM

110 points

33 comments5 min readLW link

Economic AI Safety

jsteinhardtSep 16, 2021, 8:50 PM

35 points

3 comments5 min readLW link

Jitters No Evidence of Stupidity in RL

1a3ornSep 16, 2021, 10:43 PM

82 points

18 comments3 min readLW link

Immobile AI makes a move: anti-wireheading, ontology change, and model splintering

Stuart_ArmstrongSep 17, 2021, 3:24 PM

32 points

3 comments2 min readLW link

Great Power Conflict

Zach Stein-PerlmanSep 17, 2021, 3:00 PM

11 points

7 comments4 min readLW link

The theory-practice gap

BuckSep 17, 2021, 10:51 PM

133 points

14 comments6 min readLW link

[Book Review] “The Alignment Problem” by Brian Christian

lsusrSep 20, 2021, 6:36 AM

70 points

16 comments6 min readLW link

AI, learn to be conservative, then learn to be less so: reducing side-effects, learning preserved features, and going beyond conservatism

Stuart_ArmstrongSep 20, 2021, 11:56 AM

14 points

4 comments3 min readLW link

Sigmoids behaving badly: arXiv paper

Stuart_ArmstrongSep 20, 2021, 10:29 AM

24 points

1 comment1 min readLW link

[Question] How much should you be willing to pay for an AGI?

Logan ZoellnerSep 20, 2021, 11:51 AM

11 points

5 comments1 min readLW link

Announcing the Vitalik Buterin Fellowships in AI Existential Safety!

DanielFilanSep 21, 2021, 12:33 AM

64 points

2 comments1 min readLW link

(grants.futureoflife.org)

Redwood Research’s current project

BuckSep 21, 2021, 11:30 PM

143 points

29 comments15 min readLW link

[Question] What are good models of collusion in AI?

EconomicModelSep 22, 2021, 3:16 PM

7 points

1 comment1 min readLW link

[AN #165]: When large models are more likely to lie

Rohin ShahSep 22, 2021, 5:30 PM

23 points

0 comments8 min readLW link

(mailchi.mp)

Neural net / decision tree hybrids: a potential path toward bridging the interpretability gap

Nathan Helm-BurgerSep 23, 2021, 12:38 AM

21 points

2 comments12 min readLW link

What is Compute? - Transformative AI and Compute [1/4]

lennartSep 23, 2021, 4:25 PM

24 points

8 comments19 min readLW link

Forecasting Transformative AI, Part 1: What Kind of AI?

HoldenKarnofskySep 24, 2021, 12:46 AM

17 points

17 comments9 min readLW link

Pathways: Google’s AGI

Lê Nguyên HoangSep 25, 2021, 7:02 AM

44 points

5 comments1 min readLW link

Cognitive Biases in Large Language Models

JanSep 25, 2021, 8:59 PM

17 points

3 comments12 min readLW link

(universalprior.substack.com)

Transformative AI and Compute [Summary]

lennartSep 26, 2021, 11:41 AM

13 points

0 comments9 min readLW link

Beyond fire alarms: freeing the groupstruck

KatjaGraceSep 26, 2021, 9:30 AM

81 points

15 comments54 min readLW link

(worldspiritsockpuppet.com)

[Question] Any writeups on GPT agency?

OzyrusSep 26, 2021, 10:55 PM

4 points

6 comments1 min readLW link

AI takeoff story: a continuation of progress by other means

Edouard HarrisSep 27, 2021, 3:55 PM

75 points

13 comments10 min readLW link

A Confused Chemist’s Review of AlphaFold 2

J BostockSep 27, 2021, 11:10 AM

23 points

4 comments5 min readLW link

[Question] Collection of arguments to expect (outer and inner) alignment failure?

Sam ClarkeSep 28, 2021, 4:55 PM

20 points

10 comments1 min readLW link

Brain-inspired AGI and the “lifetime anchor”

Steven ByrnesSep 29, 2021, 1:09 PM

64 points

16 comments13 min readLW link

[Question] What Heuristics Do You Use to Think About Alignment Topics?

Logan RiggsSep 29, 2021, 2:31 AM

5 points

3 comments1 min readLW link

Bayeswatch 10: Spyware

lsusrSep 29, 2021, 7:01 AM

97 points

7 comments4 min readLW link

Unsolved ML Safety Problems

jsteinhardtSep 29, 2021, 4:00 PM

58 points

2 comments3 min readLW link

(bounded-regret.ghost.io)

Some Existing Selection Theorems

johnswentworthSep 30, 2021, 4:13 PM

48 points

2 comments4 min readLW link

Forecasting Compute—Transformative AI and Compute [2/4]

lennartOct 2, 2021, 3:54 PM

17 points

0 comments19 min readLW link

Nuclear Espionage and AI Governance

GuiveOct 4, 2021, 11:04 PM

26 points

5 comments24 min readLW link

Modelling and Understanding SGD

J BostockOct 5, 2021, 1:41 PM

8 points

0 comments3 min readLW link

Force neural nets to use models, then detect these

Stuart_ArmstrongOct 5, 2021, 11:31 AM

17 points

8 comments2 min readLW link

[Question] Is GPT-3 already sample-efficient?

Daniel KokotajloOct 6, 2021, 1:38 PM

36 points

32 comments1 min readLW link

Preferences from (real and hypothetical) psychology papers

Stuart_ArmstrongOct 6, 2021, 9:06 AM

15 points

0 comments2 min readLW link

Automated Fact Checking: A Look at the Field

HoagyOct 6, 2021, 11:52 PM

12 points

0 comments8 min readLW link

Safety-capabilities tradeoff dials are inevitable in AGI

Steven ByrnesOct 7, 2021, 7:03 PM

57 points

4 comments3 min readLW link

Bayeswatch 11: Parabellum

lsusrOct 9, 2021, 7:08 AM

32 points

12 comments2 min readLW link

Steelman arguments against the idea that AGI is inevitable and will arrive soon

RomanSOct 9, 2021, 6:22 AM

19 points

13 comments4 min readLW link

Intelligence or Evolution?

Ramana KumarOct 9, 2021, 5:14 PM

50 points

15 comments3 min readLW link

Bayeswatch 12: The Singularity War

lsusrOct 10, 2021, 1:04 AM

32 points

6 comments2 min readLW link

The Extrapolation Problem

lsusrOct 10, 2021, 5:11 AM

25 points

8 comments2 min readLW link

The evaluation function of an AI is not its aim

Yair HalberstadtOct 10, 2021, 2:52 PM

13 points

5 comments3 min readLW link

On Solving Problems Before They Appear: The Weird Epistemologies of Alignment

adamShimiOct 11, 2021, 8:20 AM

97 points

11 comments15 min readLW link

Bayeswatch 13: Spaceship

lsusrOct 12, 2021, 9:35 PM

51 points

4 comments1 min readLW link

Compute Governance and Conclusions—Transformative AI and Compute [3/4]

lennartOct 14, 2021, 8:23 AM

13 points

0 comments5 min readLW link

Classical symbol grounding and causal graphs

Stuart_ArmstrongOct 14, 2021, 6:04 PM

22 points

2 comments5 min readLW link

NLP Position Paper: When Combatting Hype, Proceed with Caution

Sam BowmanOct 15, 2021, 8:57 PM

46 points

15 comments1 min readLW link

[Question] Memetic hazards of AGI architecture posts

OzyrusOct 16, 2021, 4:10 PM

9 points

12 comments1 min readLW link

[Prediction] We are in an Algorithmic Overhang, Part 2

lsusrOct 17, 2021, 7:48 AM

20 points

29 comments2 min readLW link

Epistemic Strategies of Selection Theorems

adamShimiOct 18, 2021, 8:57 AM

32 points

1 comment12 min readLW link

On The Risks of Emergent Behavior in Foundation Models

jsteinhardtOct 18, 2021, 8:00 PM

30 points

0 comments3 min readLW link

(bounded-regret.ghost.io)

Beyond the human training distribution: would the AI CEO create almost-illegal teddies?

Stuart_ArmstrongOct 18, 2021, 9:10 PM

36 points

2 comments3 min readLW link

[AN #167]: Concrete ML safety problems and their relevance to x-risk

Rohin ShahOct 20, 2021, 5:10 PM

19 points

4 comments9 min readLW link

(mailchi.mp)

Boring machine learning is where it’s at

George3d6Oct 20, 2021, 11:23 AM

28 points

16 comments3 min readLW link

(cerebralab.com)

AGI Safety Fundamentals curriculum and application

Richard_NgoOct 20, 2021, 9:44 PM

67 points

0 comments8 min readLW link

(docs.google.com)

Epistemic Strategies of Safety-Capabilities Tradeoffs

adamShimiOct 22, 2021, 8:22 AM

5 points

0 comments6 min readLW link

General alignment plus human values, or alignment via human values?

Stuart_ArmstrongOct 22, 2021, 10:11 AM

45 points

27 comments3 min readLW link

Naive self-supervised approaches to truthful AI

ryan_greenblattOct 23, 2021, 1:03 PM

9 points

4 comments2 min readLW link

My ML Scaling bibliography

gwernOct 23, 2021, 2:41 PM

35 points

9 comments1 min readLW link

(www.gwern.net)

Selfishness, preference falsification, and AI alignment

jessicataOct 28, 2021, 12:16 AM

52 points

29 comments13 min readLW link

(unstableontology.com)

[AN #168]: Four technical topics for which Open Phil is soliciting grant proposals

Rohin ShahOct 28, 2021, 5:20 PM

15 points

0 comments9 min readLW link

(mailchi.mp)

Forecasting progress in language models

Matthew Barnett and Metaculus

Oct 28, 2021, 8:40 PM

54 points

5 comments11 min readLW link

(www.metaculus.com)

Request for proposals for projects in AI alignment that work with deep learning systems

abergal and Nick_Beckstead

Oct 29, 2021, 7:26 AM

87 points

0 comments5 min readLW link

Interpretability

abergal and Nick_Beckstead

Oct 29, 2021, 7:28 AM

59 points

13 comments12 min readLW link

Truthful and honest AI

abergal, Nick_Beckstead and Owain_Evans

Oct 29, 2021, 7:28 AM

41 points

1 comment13 min readLW link

Measuring and forecasting risks

abergal, Nick_Beckstead and jsteinhardt

Oct 29, 2021, 7:27 AM

20 points

0 comments12 min readLW link

Techniques for enhancing human feedback

abergal, Ajeya Cotra and Nick_Beckstead

Oct 29, 2021, 7:27 AM

22 points

0 comments2 min readLW link

Stuart Russell and Melanie Mitchell on Munk Debates

Alex FlintOct 29, 2021, 7:13 PM

29 points

3 comments3 min readLW link

True Stories of Algorithmic Improvement

johnswentworthOct 29, 2021, 8:57 PM

91 points

7 comments5 min readLW link

Must true AI sleep?

YimbyGeorgeOct 30, 2021, 4:47 PM

0 points

1 comment1 min readLW link

Nate Soares on the Ultimate Newcomb’s Problem

Rob BensingerOct 31, 2021, 7:42 PM

56 points

20 comments1 min readLW link

Models Modeling Models

Charlie SteinerNov 2, 2021, 7:08 AM

20 points

5 comments10 min readLW link

[Question] What’s the difference between newer Atari-playing AI and the older Deepmind one (from 2014)?

RaemonNov 2, 2021, 11:36 PM

27 points

8 comments1 min readLW link

Apply to the ML for Alignment Bootcamp (MLAB) in Berkeley [Jan 3 - Jan 22]

habryka and Buck

Nov 3, 2021, 6:22 PM

95 points

4 comments1 min readLW link

[External Event] 2022 IEEE International Conference on Assured Autonomy (ICAA) - submission deadline extended

Aryeh EnglanderNov 5, 2021, 3:29 PM

13 points

0 comments3 min readLW link

Y2K: Successful Practice for AI Alignment

DarmaniNov 5, 2021, 6:09 AM

47 points

5 comments6 min readLW link

Some Remarks on Regulator Theorems No One Asked For

Past AccountNov 5, 2021, 7:33 PM

19 points

1 comment4 min readLW link

How should we compare neural network representations?

jsteinhardtNov 5, 2021, 10:10 PM

24 points

0 comments3 min readLW link

(bounded-regret.ghost.io)

Drug addicts and deceptively aligned agents—a comparative analysis

JanNov 5, 2021, 9:42 PM

41 points

2 comments12 min readLW link

(universalprior.substack.com)

Comments on OpenPhil’s Interpretability RFP

paulfchristianoNov 5, 2021, 10:36 PM

84 points

5 comments7 min readLW link

How do we become confident in the safety of a machine learning system?

evhubNov 8, 2021, 10:49 PM

92 points

2 comments32 min readLW link

[Question] What exactly is GPT-3′s base objective?

Daniel KokotajloNov 10, 2021, 12:57 AM

60 points

15 comments2 min readLW link

Relaxation-Based Search, From Everyday Life To Unfamiliar Territory

johnswentworthNov 10, 2021, 9:47 PM

57 points

3 comments8 min readLW link

Using blinders to help you see things for what they are

Adam ZernerNov 11, 2021, 7:07 AM

13 points

2 comments2 min readLW link

AGI is at least as far away as Nuclear Fusion.

Logan ZoellnerNov 11, 2021, 9:33 PM

0 points

8 comments1 min readLW link

Measuring and Forecasting Risks from AI

jsteinhardtNov 12, 2021, 2:30 AM

24 points

0 comments3 min readLW link

(bounded-regret.ghost.io)

Why I’m excited about Redwood Research’s current project

paulfchristianoNov 12, 2021, 7:26 PM

112 points

6 comments7 min readLW link

A Defense of Functional Decision Theory

HeighnNov 12, 2021, 8:59 PM

21 points

120 comments10 min readLW link

Comments on Carlsmith’s “Is power-seeking AI an existential risk?”

So8resNov 13, 2021, 4:29 AM

137 points

13 comments40 min readLW link

[Question] What’s the likelihood of only sub exponential growth for AGI?

M. Y. ZuoNov 13, 2021, 10:46 PM

5 points

22 comments1 min readLW link

My current uncertainties regarding AI, alignment, and the end of the world

dominicqNov 14, 2021, 2:08 PM

2 points

3 comments2 min readLW link

My understanding of the alignment problem

danieldeweyNov 15, 2021, 6:13 PM

43 points

3 comments3 min readLW link

“Summarizing Books with Human Feedback” (recursive GPT-3)

gwernNov 15, 2021, 5:41 PM

24 points

4 comments1 min readLW link

(openai.com)

Quantilizer ≡ Optimizer with a Bounded Amount of Output

itaibn0Nov 16, 2021, 1:03 AM

10 points

4 comments2 min readLW link

Two Stupid AI Alignment Ideas

aphyerNov 16, 2021, 4:13 PM

24 points

3 comments4 min readLW link

[Question] What are the mutual benefits of AGI-human collaboration that would otherwise be unobtainable?

M. Y. ZuoNov 17, 2021, 3:09 AM

1 point

4 comments1 min readLW link

Applications for AI Safety Camp 2022 Now Open!

adamShimiNov 17, 2021, 9:42 PM

47 points

3 comments1 min readLW link

Ngo and Yudkowsky on AI capability gains

Eliezer Yudkowsky and Richard_Ngo

Nov 18, 2021, 10:19 PM

129 points

61 comments39 min readLW link

“Acquisition of Chess Knowledge in AlphaZero”: probing AZ over time

jsdNov 18, 2021, 11:24 PM

11 points

9 comments1 min readLW link

(arxiv.org)

How To Get Into Independent Research On Alignment/Agency

johnswentworthNov 19, 2021, 12:00 AM

314 points

33 comments13 min readLW link

Goodhart: Endgame

Charlie SteinerNov 19, 2021, 1:26 AM

23 points

3 comments8 min readLW link

More detailed proposal for measuring alignment of current models

Beth BarnesNov 20, 2021, 12:03 AM

31 points

0 comments8 min readLW link

From language to ethics by automated reasoning

Michele CampoloNov 21, 2021, 3:16 PM

4 points

4 comments6 min readLW link

Morally underdefined situations can be deadly

Stuart_ArmstrongNov 22, 2021, 2:48 PM

17 points

8 comments2 min readLW link

Yudkowsky and Christiano discuss “Takeoff Speeds”

Eliezer YudkowskyNov 22, 2021, 7:35 PM

191 points

181 comments60 min readLW link 1 review

Potential Alignment mental tool: Keeping track of the types

Donald HobsonNov 22, 2021, 8:05 PM

28 points

1 comment2 min readLW link

Formalizing Policy-Modification Corrigibility

TurnTroutDec 3, 2021, 1:31 AM

23 points

6 comments6 min readLW link

[AN #169]: Collaborating with humans without human data

Rohin ShahNov 24, 2021, 6:30 PM

33 points

0 comments8 min readLW link

(mailchi.mp)

Christiano, Cotra, and Yudkowsky on AI progress

Eliezer Yudkowsky and Ajeya Cotra

Nov 25, 2021, 4:45 PM

117 points

95 comments68 min readLW link

Latacora might be of interest to some AI Safety organizations

NunoSempereNov 25, 2021, 11:57 PM

14 points

10 comments1 min readLW link

(www.latacora.com)

Solve Corrigibility Week

Logan RiggsNov 28, 2021, 5:00 PM

39 points

21 comments1 min readLW link

TTS audio of “Ngo and Yudkowsky on alignment difficulty”

Quintin PopeNov 28, 2021, 6:11 PM

4 points

3 comments1 min readLW link

Redwood Research is hiring for several roles

Jack R and billzito

Nov 29, 2021, 12:16 AM

44 points

0 comments1 min readLW link

Compute Research Questions and Metrics—Transformative AI and Compute [4/4]

lennartNov 28, 2021, 10:49 PM

6 points

0 comments16 min readLW link

Comments on Allan Dafoe on AI Governance

Alex FlintNov 29, 2021, 4:16 PM

13 points

0 comments7 min readLW link

Soares, Tallinn, and Yudkowsky discuss AGI cognition

So8res, Eliezer Yudkowsky and jaan

Nov 29, 2021, 7:26 PM

118 points

35 comments40 min readLW link

Self-studying to develop an inside-view model of AI alignment; co-studiers welcome!

Vael GatesNov 30, 2021, 9:25 AM

13 points

0 comments4 min readLW link

Machine Agents, Hybrid Superintelligences, and The Loss of Human Control (Chapter 1)

Justin BullockNov 30, 2021, 5:35 PM

4 points

0 comments8 min readLW link

AXRP Episode 12 - AI Existential Risk with Paul Christiano

DanielFilanDec 2, 2021, 2:20 AM

36 points

0 comments125 min readLW link

Morality is Scary

Wei DaiDec 2, 2021, 6:35 AM

175 points

125 comments4 min readLW link

Sydney AI Safety Fellowship

Chris_LeongDec 2, 2021, 7:34 AM

22 points

0 comments2 min readLW link

$100/$50 rewards for good references

Stuart_ArmstrongDec 3, 2021, 4:55 PM

20 points

5 comments1 min readLW link

[Question] Does the Structure of an algorithm matter for AI Risk and/or consciousness?

Logan ZoellnerDec 3, 2021, 6:31 PM

7 points

5 comments1 min readLW link

[Linkpost] A General Language Assistant as a Laboratory for Alignment

Quintin PopeDec 3, 2021, 7:42 PM

37 points

2 comments2 min readLW link

Agency: What it is and why it matters

Daniel KokotajloDec 4, 2021, 9:32 PM

25 points

2 comments2 min readLW link

[Question] Are limited-horizon agents a good heuristic for the off-switch problem?

Yonadav ShavitDec 5, 2021, 7:27 PM

5 points

19 comments1 min readLW link

Introduction to inaccessible information

Ryan KiddDec 9, 2021, 1:28 AM

27 points

6 comments8 min readLW link

More Christiano, Cotra, and Yudkowsky on AI progress

Eliezer Yudkowsky and Ajeya Cotra

Dec 6, 2021, 8:33 PM

85 points

30 comments40 min readLW link

Exterminating humans might be on the to-do list of a Friendly AI

RomanSDec 7, 2021, 2:15 PM

5 points

8 comments2 min readLW link

Interviews on Improving the AI Safety Pipeline

Chris_LeongDec 7, 2021, 12:03 PM

55 points

16 comments17 min readLW link

Let’s buy out Cyc, for use in AGI interpretability systems?

Steven ByrnesDec 7, 2021, 8:46 PM

47 points

10 comments2 min readLW link

[AN #170]: Analyzing the argument for risk from power-seeking AI

Rohin ShahDec 8, 2021, 6:10 PM

21 points

1 comment7 min readLW link

(mailchi.mp)

[MLSN #2]: Adversarial Training

Dan_HDec 9, 2021, 5:16 PM

26 points

0 comments3 min readLW link

Supervised learning and self-modeling: What’s “superhuman?”

Charlie SteinerDec 9, 2021, 12:44 PM

12 points

1 comment8 min readLW link

Some abstract, non-technical reasons to be non-maximally-pessimistic about AI alignment

Rob BensingerDec 12, 2021, 2:08 AM

66 points

37 comments7 min readLW link

Transforming myopic optimization to ordinary optimization—Do we want to seek convergence for myopic optimization problems?

tailcalledDec 11, 2021, 8:38 PM

12 points

1 comment5 min readLW link

Redwood’s Technique-Focused Epistemic Strategy

adamShimiDec 12, 2021, 4:36 PM

48 points

1 comment7 min readLW link

[Question] [Resolved] Who else prefers “AI alignment” to “AI safety?”

Evan_GaensbauerDec 13, 2021, 12:35 AM

5 points

8 comments1 min readLW link

Hard-Coding Neural Computation

MadHatterDec 13, 2021, 4:35 AM

32 points

8 comments27 min readLW link

Solving Interpretability Week

Logan RiggsDec 13, 2021, 5:09 PM

11 points

5 comments1 min readLW link

Understanding and controlling auto-induced distributional shift

L Rudolf LDec 13, 2021, 2:59 PM

26 points

3 comments16 min readLW link

Language Model Alignment Research Internships

Ethan PerezDec 13, 2021, 7:53 PM

68 points

1 comment1 min readLW link

Enabling More Feedback for AI Safety Researchers

frances_lorenzDec 13, 2021, 8:10 PM

17 points

0 comments3 min readLW link

ARC’s first technical report: Eliciting Latent Knowledge

paulfchristiano, Mark Xu and Ajeya Cotra

Dec 14, 2021, 8:09 PM

212 points

88 comments1 min readLW link

(docs.google.com)

Interlude: Agents as Automobiles

Daniel KokotajloDec 14, 2021, 6:49 PM

25 points

6 comments5 min readLW link

ARC is hiring!

paulfchristiano and Mark Xu

Dec 14, 2021, 8:09 PM

62 points

2 comments1 min readLW link

Ngo’s view on alignment difficulty

Richard_Ngo and Eliezer Yudkowsky

Dec 14, 2021, 9:34 PM

63 points

7 comments17 min readLW link

The Natural Abstraction Hypothesis: Implications and Evidence

CallumMcDougallDec 14, 2021, 11:14 PM

30 points

8 comments19 min readLW link

Elicitation for Modeling Transformative AI Risks

DavidmanheimDec 16, 2021, 3:24 PM

30 points

2 comments9 min readLW link

Some motivations to gradient hack

peterbarnettDec 17, 2021, 3:06 AM

8 points

0 comments6 min readLW link

Introducing the Principles of Intelligent Behaviour in Biological and Social Systems (PIBBSS) Fellowship

adamShimiDec 18, 2021, 3:23 PM

51 points

4 comments10 min readLW link

[Question] Important ML systems from before 2012?

JsevillamolDec 18, 2021, 12:12 PM

12 points

5 comments1 min readLW link

[Extended Deadline: Jan 23rd] Announcing the PIBBSS Summer Research Fellowship

Nora_AmmannDec 18, 2021, 4:56 PM

6 points

1 comment1 min readLW link

Exploring Decision Theories With Counterfactuals and Dynamic Agent Self-Pointers

JoshuaOSHickmanDec 18, 2021, 9:50 PM

2 points

0 comments4 min readLW link

Don’t Influence the Influencers!

lhcDec 19, 2021, 9:02 AM

14 points

2 comments10 min readLW link

SGD Understood through Probability Current

J BostockDec 19, 2021, 11:26 PM

23 points

1 comment5 min readLW link

Worst-case thinking in AI alignment

BuckDec 23, 2021, 1:29 AM

139 points

15 comments6 min readLW link

2021 AI Alignment Literature Review and Charity Comparison

LarksDec 23, 2021, 2:06 PM

164 points

26 comments73 min readLW link

Reply to Eliezer on Biological Anchors

HoldenKarnofskyDec 23, 2021, 4:15 PM

146 points

46 comments15 min readLW link

Risks from AI persuasion

Beth BarnesDec 24, 2021, 1:48 AM

68 points

15 comments31 min readLW link

Understanding the tensor product formulation in Transformer Circuits

Tom LieberumDec 24, 2021, 6:05 PM

16 points

2 comments3 min readLW link

Mechanistic Interpretability for the MLP Layers (rough early thoughts)

MadHatterDec 24, 2021, 7:24 AM

11 points

2 comments1 min readLW link

(www.youtube.com)

My Overview of the AI Alignment Landscape: Threat Models

Neel NandaDec 25, 2021, 11:07 PM

50 points

4 comments28 min readLW link

Reinforcement Learning Study Group

Kay KozaronekDec 26, 2021, 11:11 PM

20 points

9 comments1 min readLW link

AI Fire Alarm Scenarios

PeterMcCluskeyDec 28, 2021, 2:20 AM

10 points

0 comments6 min readLW link

(www.bayesianinvestor.com)

Reverse-engineering using interpretability

Beth BarnesDec 29, 2021, 11:21 PM

21 points

1 comment5 min readLW link

Counterexamples to some ELK proposals

paulfchristianoDec 31, 2021, 5:05 PM

50 points

10 comments7 min readLW link

We Choose To Align AI

johnswentworthJan 1, 2022, 8:06 PM

259 points

15 comments3 min readLW link

Why don’t we just, like, try and build safe AGI?

SunJan 1, 2022, 11:24 PM

0 points

4 comments1 min readLW link

[Question] Tag for AI alignment?

Alex_AltairJan 2, 2022, 6:55 PM

7 points

6 comments1 min readLW link

How an alien theory of mind might be unlearnable

Stuart_ArmstrongJan 3, 2022, 11:16 AM

26 points

35 comments5 min readLW link

Shadows Of The Coming Race (1879)

CapybasiliskJan 3, 2022, 3:55 PM

49 points

4 comments7 min readLW link

Apply for research internships at ARC!

paulfchristianoJan 3, 2022, 8:26 PM

61 points

0 comments1 min readLW link

Promising posts on AF that have fallen through the cracks

Evan R. MurphyJan 4, 2022, 3:39 PM

33 points

6 comments2 min readLW link

You can’t understand human agency without understanding amoeba agency

ShmiJan 6, 2022, 4:42 AM

19 points

36 comments1 min readLW link

Satisf-AI: A Route to Reducing Risks From AI

harsimonyJan 6, 2022, 2:34 AM

4 points

1 comment4 min readLW link

(harsimony.wordpress.com)

Importance of foresight evaluations within ELK

Jonathan UesatoJan 6, 2022, 3:34 PM

25 points

1 comment10 min readLW link

Goal-directedness: my baseline beliefs

Morgan_RogersJan 8, 2022, 1:09 PM

21 points

3 comments3 min readLW link

The Unreasonable Feasibility Of Playing Chess Under The Influence

JanJan 12, 2022, 11:09 PM

29 points

17 comments13 min readLW link

(universalprior.substack.com)

New year, new research agenda post

Charlie SteinerJan 12, 2022, 5:58 PM

29 points

4 comments16 min readLW link

Value extrapolation partially resolves symbol grounding

Stuart_ArmstrongJan 12, 2022, 4:30 PM

24 points

10 comments1 min readLW link

2020 Review Article

VaniverJan 14, 2022, 4:58 AM

74 points

3 comments7 min readLW link

The Greedy Doctor Problem… turns out to be relevant to the ELK problem?

JanJan 14, 2022, 11:58 AM

33 points

10 comments14 min readLW link

(universalprior.substack.com)

PIBBSS Fellowship: Bounty for Referrals & Deadline Extension

Anna GajdovaJan 17, 2022, 4:23 PM

7 points

0 comments1 min readLW link

Different way classifiers can be diverse

Stuart_ArmstrongJan 17, 2022, 4:30 PM

10 points

5 comments2 min readLW link

Scalar reward is not enough for aligned AGI

Peter VamplewJan 17, 2022, 9:02 PM

15 points

3 comments11 min readLW link

Challenges with Breaking into MIRI-Style Research

Chris_LeongJan 17, 2022, 9:23 AM

72 points

15 comments3 min readLW link

Thought Experiments Provide a Third Anchor

jsteinhardtJan 18, 2022, 4:00 PM

44 points

20 comments4 min readLW link

(bounded-regret.ghost.io)

Anchor Weights for ML

jsteinhardtJan 20, 2022, 4:20 PM

17 points

2 comments2 min readLW link

(bounded-regret.ghost.io)

Estimating training compute of Deep Learning models

lennart, Jsevillamol, Marius Hobbhahn, Tamay Besiroglu and anson.ho

Jan 20, 2022, 4:12 PM

37 points

4 comments1 min readLW link

Sharing Powerful AI Models

apcJan 21, 2022, 11:57 AM

6 points

4 comments1 min readLW link

[AN #171]: Disagreements between alignment “optimists” and “pessimists”

Rohin ShahJan 21, 2022, 6:30 PM

32 points

1 comment7 min readLW link

(mailchi.mp)

A one-question Turing test for GPT-3

Paul Crowley and rosiecam

Jan 22, 2022, 6:17 PM

84 points

23 comments5 min readLW link

ML Systems Will Have Weird Failure Modes

jsteinhardtJan 26, 2022, 1:40 AM

54 points

8 comments6 min readLW link

(bounded-regret.ghost.io)

Search Is All You Need

blake8086Jan 25, 2022, 11:13 PM

33 points

13 comments3 min readLW link

Aligned AI Needs Slack

ShmiJan 26, 2022, 9:29 AM

23 points

10 comments1 min readLW link

Empirical Findings Generalize Surprisingly Far

jsteinhardtFeb 1, 2022, 10:30 PM

46 points

0 comments6 min readLW link

(bounded-regret.ghost.io)

OpenAI Solves (Some) Formal Math Olympiad Problems

Michaël TrazziFeb 2, 2022, 9:49 PM

77 points

26 comments2 min readLW link

Observed patterns around major technological advancements

Richard Korzekwa Feb 3, 2022, 12:30 AM

45 points

15 comments11 min readLW link

(aiimpacts.org)

Paperclippers, s-risks, hope

superads91Feb 4, 2022, 7:03 PM

13 points

17 comments1 min readLW link

AI Writeup Part 1

SNlFeb 4, 2022, 9:16 PM

8 points

1 comment18 min readLW link

Alignment versus AI Alignment

Alex FlintFeb 4, 2022, 10:59 PM

87 points

15 comments22 min readLW link

Capability Phase Transition Examples

gwernFeb 8, 2022, 3:32 AM

39 points

1 comment1 min readLW link

(www.reddit.com)

A broad basin of attraction around human values?

Wei DaiApr 12, 2022, 5:15 AM

105 points

16 comments2 min readLW link

Appendix: More Is Different In Other Domains

jsteinhardtFeb 8, 2022, 4:00 PM

12 points

1 comment4 min readLW link

(bounded-regret.ghost.io)

[Intro to brain-like-AGI safety] 2. “Learning from scratch” in the brain

Steven ByrnesFeb 2, 2022, 1:22 PM

43 points

12 comments25 min readLW link

Better impossibility result for unbounded utilities

paulfchristianoFeb 9, 2022, 6:10 AM

29 points

24 comments5 min readLW link

EleutherAI’s GPT-NeoX-20B release

leogaoFeb 10, 2022, 6:56 AM

30 points

3 comments1 min readLW link

(eaidata.bmk.sh)

Inferring utility functions from locally non-transitive preferences

JanFeb 10, 2022, 10:33 AM

28 points

15 comments8 min readLW link

(universalprior.substack.com)

A summary of aligning narrowly superhuman models

guguFeb 10, 2022, 6:26 PM

8 points

0 comments8 min readLW link

Idea: build alignment dataset for very capable models

Quintin PopeFeb 12, 2022, 7:30 PM

9 points

2 comments3 min readLW link

Goal-directedness: exploring explanations

Morgan_RogersFeb 14, 2022, 4:20 PM

13 points

3 comments18 min readLW link

Is ELK enough? Diamond, Matrix and Child AI

adamShimiFeb 15, 2022, 2:29 AM

17 points

10 comments4 min readLW link

What Does The Natural Abstraction Framework Say About ELK?

johnswentworthFeb 15, 2022, 2:27 AM

34 points

0 comments6 min readLW link

Some Hacky ELK Ideas

johnswentworthFeb 15, 2022, 2:27 AM

34 points

8 comments5 min readLW link

How harmful are improvements in AI? + Poll

tilmanr and Marius Hobbhahn

Feb 15, 2022, 6:16 PM

15 points

4 comments8 min readLW link

Becoming Stronger as Epistemologist: Introduction

adamShimiFeb 15, 2022, 6:15 AM

29 points

2 comments4 min readLW link

REPL’s: a type signature for agents

scottviteriFeb 15, 2022, 10:57 PM

23 points

5 comments2 min readLW link

REPL’s and ELK

scottviteriFeb 17, 2022, 1:14 AM

9 points

4 comments1 min readLW link

[Link] Eric Schmidt’s new AI2050 Fund

Aryeh EnglanderFeb 16, 2022, 9:21 PM

32 points

3 comments2 min readLW link

Alignment researchers, how useful is extra compute for you?

Lauro LangoscoFeb 19, 2022, 3:35 PM

7 points

4 comments1 min readLW link

[Question] 2 (naive?) ideas for alignment

Jonathan MoregårdFeb 20, 2022, 7:01 PM

3 points

1 comment1 min readLW link

The Big Picture Of Alignment (Talk Part 1)

johnswentworthFeb 21, 2022, 5:49 AM

98 points

35 comments1 min readLW link

(www.youtube.com)

[Question] Favorite / most obscure research on understanding DNNs?

Vivek HebbarFeb 21, 2022, 5:49 AM

16 points

1 comment1 min readLW link

Two Challenges for ELK

derek shillerFeb 21, 2022, 5:49 AM

7 points

0 comments4 min readLW link

[Question] Do any AI alignment orgs hire remotely?

RobertMFeb 21, 2022, 10:33 PM

24 points

9 comments2 min readLW link

More GPT-3 and symbol grounding

Stuart_ArmstrongFeb 23, 2022, 6:30 PM

21 points

7 comments3 min readLW link

Transformer inductive biases & RASP

Vivek HebbarFeb 24, 2022, 12:42 AM

15 points

4 comments1 min readLW link

(proceedings.mlr.press)

A comment on Ajeya Cotra’s draft report on AI timelines

Matthew BarnettFeb 24, 2022, 12:41 AM

69 points

13 comments7 min readLW link

The Big Picture Of Alignment (Talk Part 2)

johnswentworthFeb 25, 2022, 2:53 AM

33 points

12 comments1 min readLW link

(www.youtube.com)

Trust-maximizing AGI

Jan and Karl von Wendt

Feb 25, 2022, 3:13 PM

7 points

26 comments9 min readLW link

(universalprior.substack.com)

IMO challenge bet with Eliezer

paulfchristianoFeb 26, 2022, 4:50 AM

162 points

25 comments3 min readLW link

New Speaker Series on AI Alignment Starting March 3

Zechen ZhangFeb 26, 2022, 7:31 PM

7 points

1 comment1 min readLW link

How I Formed My Own Views About AI Safety

Neel NandaFeb 27, 2022, 6:50 PM

64 points

6 comments13 min readLW link

(www.neelnanda.io)

Shah and Yudkowsky on alignment failures

Rohin Shah and Eliezer Yudkowsky

Feb 28, 2022, 7:18 PM

83 points

38 comments91 min readLW link

ELK Thought Dump

abramdemskiFeb 28, 2022, 6:46 PM

58 points

18 comments17 min readLW link

Late 2021 MIRI Conversations: AMA / Discussion

Rob BensingerFeb 28, 2022, 8:03 PM

119 points

208 comments1 min readLW link

[Question] What are the causality effects of an agents presence in a reinforcement learning environment

Jonas KgomoMar 1, 2022, 9:57 PM

0 points

2 comments1 min readLW link

Musings on the Speed Prior

evhubMar 2, 2022, 4:04 AM

19 points

4 comments10 min readLW link

AI Performance on Human Tasks

Asher EllisMar 3, 2022, 8:13 PM

58 points

3 comments21 min readLW link

Introducing myself: Henry Lieberman, MIT CSAIL, whycantwe.org

Henry A LiebermanMar 3, 2022, 11:42 PM

−2 points

9 comments1 min readLW link

Preserving and continuing alignment research through a severe global catastrophe

A_donorMar 6, 2022, 6:43 PM

36 points

11 comments5 min readLW link

Why work at AI Impacts?

KatjaMar 6, 2022, 10:10 PM

50 points

7 comments13 min readLW link

(aiimpacts.org)

Personal imitation software

FlaglandbaseMar 7, 2022, 7:55 AM

6 points

6 comments1 min readLW link

[MLSN #3]: NeurIPS Safety Paper Roundup

Dan HMar 8, 2022, 3:17 PM

45 points

0 comments4 min readLW link

ELK prize results

paulfchristiano and Mark Xu

Mar 9, 2022, 12:01 AM

130 points

50 comments21 min readLW link

[Question] Non-coercive motivation for alignment research?

Jonathan MoregårdMar 8, 2022, 8:50 PM

1 point

0 comments1 min readLW link

On presenting the case for AI risk

Aryeh EnglanderMar 9, 2022, 1:41 AM

54 points

18 comments4 min readLW link

Ask AI companies about what they are doing for AI safety?

micMar 9, 2022, 3:14 PM

50 points

0 comments2 min readLW link

Deriving Our World From Small Datasets

CapybasiliskMar 9, 2022, 12:34 AM

5 points

4 comments2 min readLW link

Value extrapolation, concept extrapolation, model splintering

Stuart_ArmstrongMar 8, 2022, 10:50 PM

14 points

1 comment2 min readLW link

The Proof of Doom

johnlawrenceaspdenMar 9, 2022, 7:37 PM

27 points

18 comments3 min readLW link

A Rephrasing Of and Footnote To An Embedded Agency Proposal

JoshuaOSHickmanMar 9, 2022, 6:13 PM

5 points

0 comments5 min readLW link

ELK Sub—Note-taking in internal rollouts

HoagyMar 9, 2022, 5:23 PM

6 points

0 comments5 min readLW link

[Question] Are there any impossibility theorems for strong and safe AI?

David JohnstonMar 11, 2022, 1:41 AM

5 points

3 comments1 min readLW link

Compute Trends — Comparison to OpenAI’s AI and Compute

lennart, Jsevillamol, Pablo Villalobos, Marius Hobbhahn, Tamay Besiroglu and anson.ho

Mar 12, 2022, 6:09 PM

23 points

3 comments3 min readLW link

ELK contest submission: route understanding through the human ontology

Vika, Ramana Kumar and Vikrant Varma

Mar 14, 2022, 9:42 PM

21 points

2 comments2 min readLW link

Dual use of artificial-intelligence-powered drug discovery

VaniverMar 15, 2022, 2:52 AM

91 points

15 comments1 min readLW link

(www.nature.com)

[Intro to brain-like-AGI safety] 8. Takeaways from neuro 1/2: On AGI development

Steven ByrnesMar 16, 2022, 1:59 PM

41 points

2 comments15 min readLW link

Some (potentially) fundable AI Safety Ideas

Logan RiggsMar 16, 2022, 12:48 PM

21 points

5 comments5 min readLW link

What do paradigm shifts look like?

leogaoMar 16, 2022, 7:17 PM

15 points

2 comments1 min readLW link

[Question] What is the equivalent of the “do” operator for finite factored sets?

Chris van MerwijkMar 17, 2022, 8:05 AM

8 points

2 comments1 min readLW link

[Question] What to do after inventing AGI?

elephantcrewMar 18, 2022, 10:30 PM

9 points

4 comments1 min readLW link

Goal-directedness: imperfect reasoning, limited knowledge and inaccurate beliefs

Morgan_RogersMar 19, 2022, 5:28 PM

4 points

1 comment21 min readLW link

Wargaming AGI Development

ryan_bMar 19, 2022, 5:59 PM

36 points

13 comments5 min readLW link

Exploring Finite Factored Sets with some toy examples

Thomas KehrenbergMar 19, 2022, 10:08 PM

36 points

1 comment9 min readLW link

(tm.kehrenberg.net)

Natural Value Learning

Chris van MerwijkMar 20, 2022, 12:44 PM

7 points

10 comments4 min readLW link

Why will an AGI be rational?

azsantoskMar 21, 2022, 9:54 PM

4 points

8 comments2 min readLW link

We cannot directly choose an AGI’s utility function

azsantoskMar 21, 2022, 10:08 PM

12 points

18 comments3 min readLW link

Progress Report 1: interpretability experiments & learning, testing compression hypotheses

Nathan Helm-BurgerMar 22, 2022, 8:12 PM

11 points

0 comments2 min readLW link

Lessons After a Couple Months of Trying to Do ML Research

RowanWangMar 22, 2022, 11:45 PM

68 points

8 comments6 min readLW link

Job Offering: Help Communicate Infrabayesianism

abramdemski, Vanessa Kosoy and Diffractor

Mar 23, 2022, 6:35 PM

135 points

21 comments1 min readLW link

A survey of tool use and workflows in alignment research

Logan Riggs, Jan, janus and jacquesthibs

Mar 23, 2022, 11:44 PM

43 points

5 comments1 min readLW link

Why Agent Foundations? An Overly Abstract Explanation

johnswentworthMar 25, 2022, 11:17 PM

247 points

54 comments8 min readLW link

[ASoT] Observations about ELK

leogaoMar 26, 2022, 12:42 AM

30 points

0 comments3 min readLW link

[Question] When people ask for your P(doom), do you give them your inside view or your betting odds?

Vivek HebbarMar 26, 2022, 11:08 PM

11 points

12 comments1 min readLW link

Compute Governance: The Role of Commodity Hardware

JanMar 26, 2022, 10:08 AM

14 points

7 comments7 min readLW link

(universalprior.substack.com)

Agency and Coherence

David UdellMar 26, 2022, 7:25 PM

23 points

2 comments3 min readLW link

[ASoT] Some ways ELK could still be solvable in practice

leogaoMar 27, 2022, 1:15 AM

26 points

1 comment2 min readLW link

[Question] Your specific attitudes towards AI safety

Esben KranMar 27, 2022, 10:33 PM

8 points

22 comments1 min readLW link

[ASoT] Searching for consequentialist structure

leogaoMar 27, 2022, 7:09 PM

25 points

2 comments4 min readLW link

Vaniver’s ELK Submission

VaniverMar 28, 2022, 9:14 PM

10 points

0 comments7 min readLW link

Towards a better circuit prior: Improving on ELK state-of-the-art

evhubMar 29, 2022, 1:56 AM

19 points

0 comments16 min readLW link

Strategies for differential divulgation of key ideas in AI capability

azsantoskMar 29, 2022, 3:22 AM

8 points

0 comments6 min readLW link

[ASoT] Some thoughts about deceptive mesaoptimization

leogaoMar 28, 2022, 9:14 PM

24 points

5 comments7 min readLW link

[Question] What would make you confident that AGI has been achieved?

YitzMar 29, 2022, 11:02 PM

17 points

6 comments1 min readLW link

Progress Report 2

Nathan Helm-BurgerMar 30, 2022, 2:29 AM

4 points

1 comment1 min readLW link

[ASoT] Some thoughts about LM monologue limitations and ELK

leogaoMar 30, 2022, 2:26 PM

10 points

0 comments2 min readLW link

Procedurally evaluating factual accuracy: a request for research

Jacob_HiltonMar 30, 2022, 4:37 PM

24 points

2 comments6 min readLW link

No, EDT Did Not Get It Right All Along: Why the Coin Flip Creation Problem Is Irrelevant

HeighnMar 30, 2022, 6:41 PM

6 points

6 comments3 min readLW link

ELK Computational Complexity: Three Levels of Difficulty

abramdemskiMar 30, 2022, 8:56 PM

46 points

9 comments7 min readLW link

[Link] Training Compute-Optimal Large Language Models

nostalgebraistMar 31, 2022, 6:01 PM

50 points

23 comments1 min readLW link

(arxiv.org)

Newcomb’s problem is just a standard time consistency problem

basil.halperinMar 31, 2022, 5:32 PM

12 points

6 comments12 min readLW link

The Calculus of Newcomb’s Problem

HeighnApr 1, 2022, 2:41 PM

3 points

6 comments2 min readLW link

New Scaling Laws for Large Language Models

1a3ornApr 1, 2022, 8:41 PM

223 points

21 comments5 min readLW link

Interacting with a Boxed AI

aphyerApr 1, 2022, 10:42 PM

11 points

19 comments4 min readLW link

Optimality is the tiger, and agents are its teeth

VeedracApr 2, 2022, 12:46 AM

197 points

31 comments16 min readLW link

[Question] How can a layman contribute to AI Alignment efforts, given shorter timeline/doomier scenarios?

AprilSRApr 2, 2022, 4:34 AM

13 points

5 comments1 min readLW link

AI Governance across Slow/Fast Takeoff and Easy/Hard Alignment spectra

DavidmanheimApr 3, 2022, 7:45 AM

27 points

6 comments3 min readLW link

[Question] What are some ways in which we can die with more dignity?

Chris_LeongApr 3, 2022, 5:32 AM

14 points

19 comments1 min readLW link

[Question] Should we push for banning making hiring decisions based on AI?

ChristianKlApr 3, 2022, 7:46 PM

10 points

6 comments1 min readLW link

Bayeswatch 9.5: Rest & Relaxation

lsusrApr 4, 2022, 1:13 AM

24 points

1 comment2 min readLW link

Bayeswatch 6.5: Therapy

lsusrApr 4, 2022, 1:20 AM

15 points

0 comments1 min readLW link

Theories of Modularity in the Biological Literature

CallumMcDougall, Avery and Lucius Bushnaq

Apr 4, 2022, 12:48 PM

47 points

13 comments7 min readLW link

Google’s new 540 billion parameter language model

Matthew BarnettApr 4, 2022, 5:49 PM

108 points

83 comments1 min readLW link

(storage.googleapis.com)

Call For Distillers

johnswentworthApr 4, 2022, 6:25 PM

192 points

42 comments3 min readLW link

Is the scaling race finally on?

p.b.Apr 4, 2022, 7:53 PM

24 points

0 comments2 min readLW link

Yudkowsky Contra Christiano on AI Takeoff Speeds [Linkpost]

aogApr 5, 2022, 2:09 AM

18 points

0 comments11 min readLW link

[Cross-post] Half baked ideas: defining and measuring Artificial Intelligence system effectiveness

David JohnstonApr 5, 2022, 12:29 AM

2 points

0 comments7 min readLW link

[Question] Why is Toby Ord’s likelihood of human extinction due to AI so low?

ChristianKlApr 5, 2022, 12:16 PM

8 points

9 comments1 min readLW link

Non-programmers intro to AI for programmers

DustinApr 5, 2022, 6:12 PM

6 points

0 comments2 min readLW link

What Would A Fight Between Humanity And AGI Look Like?

johnswentworthApr 5, 2022, 8:03 PM

79 points

22 comments3 min readLW link

Supervise Process, not Outcomes

stuhlmueller and jungofthewon

Apr 5, 2022, 10:18 PM

119 points

8 comments10 min readLW link

AXRP Episode 14 - Infra-Bayesian Physicalism with Vanessa Kosoy

DanielFilanApr 5, 2022, 11:10 PM

23 points

9 comments52 min readLW link

[Question] What’s the problem with having an AI align itself?

FinalFormal2Apr 6, 2022, 12:59 AM

0 points

3 comments1 min readLW link

What if we stopped making GPUs for a bit?

MrPointyApr 5, 2022, 11:02 PM

−3 points

2 comments1 min readLW link

Don’t die with dignity; instead play to your outs

Jeffrey LadishApr 6, 2022, 7:53 AM

243 points

58 comments5 min readLW link

What I Was Thinking About Before Alignment

johnswentworthApr 6, 2022, 4:08 PM

77 points

8 comments5 min readLW link

[Link] A minimal viable product for alignment

janleikeApr 6, 2022, 3:38 PM

51 points

38 comments1 min readLW link

[Link] Why I’m excited about AI-assisted human feedback

janleikeApr 6, 2022, 3:37 PM

29 points

0 comments1 min readLW link

Testing PaLM prompts on GPT3

YitzApr 6, 2022, 5:21 AM

103 points

15 comments8 min readLW link

[ASoT] Some thoughts about imperfect world modeling

leogaoApr 7, 2022, 3:42 PM

7 points

0 comments4 min readLW link

Truthfulness, standards and credibility

Joe CollmanApr 7, 2022, 10:31 AM

12 points

2 comments32 min readLW link

What if “friendly/unfriendly” GAI isn’t a thing?

homunqApr 7, 2022, 4:54 PM

−1 points

4 comments2 min readLW link

Productive Mistakes, Not Perfect Answers

adamShimiApr 7, 2022, 4:41 PM

95 points

11 comments6 min readLW link

Believable near-term AI disaster

DagonApr 7, 2022, 6:20 PM

8 points

2 comments2 min readLW link

How BoMAI Might fail

Donald HobsonApr 7, 2022, 3:32 PM

11 points

3 comments2 min readLW link

DeepMind: The Podcast—Excerpts on AGI

WilliamKielyApr 7, 2022, 10:09 PM

75 points

10 comments5 min readLW link

AI Alignment and Recognition

Chris_LeongApr 8, 2022, 5:39 AM

7 points

2 comments1 min readLW link

Reverse (intent) alignment may allow for safer Oracles

azsantoskApr 8, 2022, 2:48 AM

4 points

0 comments4 min readLW link

AIs should learn human preferences, not biases

Stuart_ArmstrongApr 8, 2022, 1:45 PM

10 points

1 comment1 min readLW link

[Question] Is there a possibility that the upcoming scaling of data in language models causes A.G.I.?

ArtMiApr 8, 2022, 6:56 AM

2 points

0 comments1 min readLW link

Different perspectives on concept extrapolation

Stuart_ArmstrongApr 8, 2022, 10:42 AM

42 points

7 comments5 min readLW link

[RETRACTED] It’s time for EA leadership to pull the short-timelines fire alarm.

Not RelevantApr 8, 2022, 4:07 PM

112 points

165 comments4 min readLW link

Convincing All Capability Researchers

Logan RiggsApr 8, 2022, 5:40 PM

120 points

70 comments3 min readLW link

Language Model Tools for Alignment Research

Logan RiggsApr 8, 2022, 5:32 PM

27 points

0 comments2 min readLW link

[Question] What would the creation of aligned AGI look like for us?

PerhapsApr 8, 2022, 6:05 PM

3 points

4 comments1 min readLW link

Takeaways From 3 Years Working In Machine Learning

George3d6Apr 8, 2022, 5:14 PM

34 points

10 comments11 min readLW link

(www.epistem.ink)

[Question] Can AI systems have extremely impressive outputs and also not need to be aligned because they aren’t general enough or something?

WilliamKielyApr 9, 2022, 6:03 AM

6 points

3 comments1 min readLW link

Why Instrumental Goals are not a big AI Safety Problem

Jonathan PaulsonApr 9, 2022, 12:10 AM

0 points

9 comments3 min readLW link

Emergent Ventures/Schmidt (new grantor for individual researchers)

gwernApr 9, 2022, 2:41 PM

21 points

6 comments1 min readLW link

(marginalrevolution.com)

Strategies for keeping AIs narrow in the short term

RossinApr 9, 2022, 4:42 PM

9 points

3 comments3 min readLW link

A concrete bet offer to those with short AI timelines

Matthew Barnett and Tamay

Apr 9, 2022, 9:41 PM

195 points

104 comments4 min readLW link

Finally Entering Alignment

Ulisse MiniApr 10, 2022, 5:01 PM

75 points

8 comments2 min readLW link

[Question] Does non-access to outputs prevent recursive self-improvement?

Gunnar_ZarnckeApr 10, 2022, 6:37 PM

14 points

0 comments1 min readLW link

[Question] Convince me that humanity is as doomed by AGI as Yudkowsky et al., seems to believe

YitzApr 10, 2022, 9:02 PM

91 points

142 comments2 min readLW link

[Question] Could we set a resolution/stopper for the upper bound of the utility function of an AI?

FinalFormal2Apr 11, 2022, 3:10 AM

−5 points

2 comments1 min readLW link

What can people not smart/technical enough for AI research/AI risk work do to reduce AI-risk/maximize AI safety? (which is most people?)

Alex K. Chen (parrot)Apr 11, 2022, 2:05 PM

7 points

3 comments3 min readLW link

We should stop being so confident that AI coordination is unlikely

trevorApr 11, 2022, 10:27 PM

14 points

7 comments1 min readLW link

The Regulatory Option: A response to near 0% survival odds

Matthew LowensteinApr 11, 2022, 10:00 PM

45 points

21 comments6 min readLW link

[Question] How can I determine that Elicit is not some weak AGI’s attempt at taking over the world ?

Lucie PhilipponApr 12, 2022, 12:54 AM

5 points

3 comments1 min readLW link

[Question] Three questions about mesa-optimizers

Eric NeymanApr 12, 2022, 2:58 AM

23 points

5 comments3 min readLW link

A Small Negative Result on Debate

Sam BowmanApr 12, 2022, 6:19 PM

42 points

11 comments1 min readLW link

The Peerless

Tamsin LeakeApr 13, 2022, 1:07 AM

18 points

2 comments1 min readLW link

(carado.moe)

Convincing People of Alignment with Street Epistemology

Logan RiggsApr 12, 2022, 11:43 PM

54 points

4 comments3 min readLW link

[Question] “Fragility of Value” vs. LLMs

Not RelevantApr 13, 2022, 2:02 AM

32 points

32 comments1 min readLW link

How dath ilan coordinates around solving alignment

Thomas KwaApr 13, 2022, 4:22 AM

46 points

37 comments5 min readLW link

[Question] What’s a good probability distribution family (e.g. “log-normal”) to use for AGI timelines?

David Scott Krueger (formerly: capybaralet)Apr 13, 2022, 4:45 AM

9 points

12 comments1 min readLW link

Takeoff speeds have a huge effect on what it means to work on AI x-risk

BuckApr 13, 2022, 5:38 PM

117 points

25 comments2 min readLW link

Design, Implement and Verify

rwallaceApr 13, 2022, 6:14 PM

32 points

13 comments4 min readLW link

[Question] What to include in a guest lecture on existential risks from AI?

Aryeh EnglanderApr 13, 2022, 5:03 PM

20 points

9 comments1 min readLW link

A Quick Guide to Confronting Doom

RubyApr 13, 2022, 7:30 PM

224 points

36 comments2 min readLW link

Exploring toy neural nets under node removal. Section 1.

Donald HobsonApr 13, 2022, 11:30 PM

12 points

7 comments8 min readLW link

[Question] Unchangeable Code possible ?

AntonTimmerApr 14, 2022, 11:17 AM

7 points

9 comments1 min readLW link

How to become an AI safety researcher

peterbarnettApr 15, 2022, 11:41 AM

19 points

0 comments14 min readLW link

Early 2022 Paper Round-up

jsteinhardtApr 14, 2022, 8:50 PM

80 points

4 comments3 min readLW link

(bounded-regret.ghost.io)

[Question] Can someone explain to me why MIRI is so pessimistic of our chances of survival?

iamthouthouartiApr 14, 2022, 8:28 PM

10 points

7 comments1 min readLW link

Pivotal acts from Math AIs

azsantoskApr 15, 2022, 12:25 AM

10 points

4 comments5 min readLW link

Refine: An Incubator for Conceptual Alignment Research Bets

adamShimiApr 15, 2022, 8:57 AM

123 points

13 comments4 min readLW link

My least favorite thing

sudoApr 14, 2022, 10:33 PM

41 points

30 comments3 min readLW link

[Question] Constraining narrow AI in a corporate setting

MaximumLibertyApr 15, 2022, 10:36 PM

28 points

4 comments1 min readLW link

Pop Culture Alignment Research and Taxes

JanApr 16, 2022, 3:45 PM

16 points

14 comments11 min readLW link

(universalprior.substack.com)

Org announcement: [AC]RC

Vivek HebbarApr 17, 2022, 5:24 PM

79 points

12 comments1 min readLW link

Code Generation as an AI risk setting

Not RelevantApr 17, 2022, 10:27 PM

91 points

16 comments2 min readLW link

Mental Health and the Alignment Problem: A Compilation of Resources

Chris ScammellApr 18, 2022, 6:36 PM

139 points

7 comments17 min readLW link

Is “Control” of a Superintelligence Possible?

Mahdi ComplexApr 18, 2022, 4:03 PM

9 points

14 comments1 min readLW link

[Closed] Hiring a mathematician to work on the learning-theoretic AI alignment agenda

Vanessa KosoyApr 19, 2022, 6:44 AM

84 points

21 comments2 min readLW link

[Question] The two missing core reasons why aligning at-least-partially superhuman AGI is hard

Joel BurgetApr 19, 2022, 5:15 PM

7 points

2 comments1 min readLW link

[Question] How does the world look like 10 years after we have deployed an aligned AGI?

mukashiApr 19, 2022, 11:34 AM

4 points

3 comments1 min readLW link

[Question] Clarification on Definition of AGI

stanislawApr 19, 2022, 12:41 PM

0 points

1 comment1 min readLW link

[Question] What’s the Relationship Between “Human Values” and the Brain’s Reward System?

intersticeApr 19, 2022, 5:15 AM

36 points

16 comments1 min readLW link

Deceptive Agents are a Good Way to Do Things

David UdellApr 19, 2022, 6:04 PM

15 points

0 comments1 min readLW link

The Scale Problem in AI

tailcalledApr 19, 2022, 5:46 PM

22 points

17 comments3 min readLW link

Concept extrapolation: key posts

Stuart_ArmstrongApr 19, 2022, 10:01 AM

12 points

2 comments1 min readLW link

“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

Andrew_CritchApr 19, 2022, 8:25 PM

96 points

56 comments7 min readLW link

GPT-3 and concept extrapolation

Stuart_ArmstrongApr 20, 2022, 10:39 AM

19 points

28 comments1 min readLW link

[Intro to brain-like-AGI safety] 12. Two paths forward: “Controlled AGI” and “Social-instinct AGI”

Steven ByrnesApr 20, 2022, 12:58 PM

33 points

10 comments16 min readLW link

Preregistration: Air Conditioner Test

johnswentworthApr 21, 2022, 7:48 PM

109 points

64 comments9 min readLW link

[Question] Choice := Anthropics uncertainty? And potential implications for agency

Antoine de ScorrailleApr 21, 2022, 4:38 PM

5 points

1 comment1 min readLW link

Understanding the Merging of Opinions with Increasing Information theorem

ViktoriaMalyasovaApr 21, 2022, 2:13 PM

13 points

1 comment5 min readLW link

Early 2022 Paper Round-up (Part 2)

jsteinhardtApr 21, 2022, 11:40 PM

10 points

0 comments5 min readLW link

(bounded-regret.ghost.io)

[Question] What are the numbers in mind for the super-short AGI timelines so many long-termists are alarmed about?

Evan_GaensbauerApr 21, 2022, 11:32 PM

22 points

14 comments1 min readLW link

AI Will Multiply

harsimonyApr 22, 2022, 4:33 AM

13 points

4 comments1 min readLW link

(harsimony.wordpress.com)

Humanity as an entity: An alternative to Coherent Extrapolated Volition

Victor NovikovApr 22, 2022, 12:48 PM

0 points

2 comments4 min readLW link

[ASoT] Consequentialist models as a superset of mesaoptimizers

leogaoApr 23, 2022, 5:57 PM

36 points

2 comments4 min readLW link

Skilling-up in ML Engineering for Alignment: request for comments

CallumMcDougall and Jamie Bernardi

Apr 23, 2022, 3:11 PM

19 points

0 comments1 min readLW link

[Question] Wanting to change what you want

MithrandirApr 23, 2022, 4:23 AM

−1 points

1 comment1 min readLW link

Progress Report 5: tying it together

Nathan Helm-BurgerApr 23, 2022, 9:07 PM

10 points

0 comments2 min readLW link

Calling for Student Submissions: AI Safety Distillation Contest

ArisApr 24, 2022, 1:53 AM

48 points

15 comments4 min readLW link

Examining Evolution as an Upper Bound for AGI Timelines

meanderingmooseApr 24, 2022, 7:08 PM

5 points

1 comment9 min readLW link

AI safety raising awareness resources bleg

iivonenApr 24, 2022, 5:13 PM

6 points

1 comment1 min readLW link

Intuitions about solving hard problems

Richard_NgoApr 25, 2022, 3:29 PM

92 points

23 comments6 min readLW link

[Request for Distillation] Coherence of Distributed Decisions With Different Inputs Implies Conditioning

johnswentworthApr 25, 2022, 5:01 PM

22 points

14 comments2 min readLW link

dalle2 comments

nostalgebraistApr 26, 2022, 5:30 AM

183 points

13 comments13 min readLW link

(nostalgebraist.tumblr.com)

Make a neural network in ~10 minutes

Arjun YadavApr 26, 2022, 5:24 AM

8 points

0 comments4 min readLW link

(arjunyadav.net)

Law-Following AI 1: Sequence Introduction and Structure

CullenApr 27, 2022, 5:26 PM

16 points

10 comments9 min readLW link

Law-Following AI 2: Intent Alignment + Superintelligence → Lawless AI (By Default)

CullenApr 27, 2022, 5:27 PM

5 points

2 comments6 min readLW link

Law-Following AI 3: Lawless AI Agents Undermine Stabilizing Agreements

CullenApr 27, 2022, 5:30 PM

2 points

2 comments3 min readLW link

If you’re very optimistic about ELK then you should be optimistic about outer alignment

Sam MarksApr 27, 2022, 7:30 PM

17 points

8 comments3 min readLW link

AI Alternative Futures: Scenario Mapping Artificial Intelligence Risk—Request for Participation (Closed)

KakiliApr 27, 2022, 10:07 PM

10 points

2 comments8 min readLW link

The Speed + Simplicity Prior is probably anti-deceptive

Yonadav ShavitApr 27, 2022, 7:30 PM

30 points

29 comments12 min readLW link

Slides: Potential Risks From Advanced AI

Aryeh EnglanderApr 28, 2022, 2:15 AM

7 points

0 comments1 min readLW link

How Might an Alignment Attractor Look like?

ShmiApr 28, 2022, 6:46 AM

47 points

15 comments2 min readLW link

Naive comments on AGIlignment

EricfApr 28, 2022, 1:08 AM

2 points

4 comments1 min readLW link

[Question] Is alignment possible?

ShayApr 28, 2022, 9:18 PM

0 points

5 comments1 min readLW link

Learning the smooth prior

Geoffrey Irving, Rohin Shah and evhub

Apr 29, 2022, 9:10 PM

31 points

0 comments12 min readLW link

[Linkpost] New multi-modal Deepmind model fusing Chinchilla with images and videos

p.b.Apr 30, 2022, 3:47 AM

53 points

18 comments1 min readLW link

Note-Taking without Hidden Messages

HoagyApr 30, 2022, 11:15 AM

7 points

1 comment4 min readLW link

[Question] Why hasn’t deep learning generated significant economic value yet?

Alex_AltairApr 30, 2022, 8:27 PM

112 points

95 comments2 min readLW link

What is the solution to the Alignment problem?

AlgonApr 30, 2022, 11:19 PM

24 points

2 comments1 min readLW link

[Linkpost] Value extraction via language model abduction

Paul BricmanMay 1, 2022, 7:11 PM

4 points

3 comments1 min readLW link

(paulbricman.com)

ELK shaving

Miss Aligned AIMay 1, 2022, 9:05 PM

6 points

1 comment1 min readLW link

So has AI conquered Bridge ?

Ponder StibbonsMay 2, 2022, 3:01 PM

16 points

2 comments14 min readLW link

Information security considerations for AI and the long term future

Jeffrey Ladish and lennart

May 2, 2022, 8:54 PM

74 points

6 comments10 min readLW link

Is evolutionary influence the mesa objective that we’re interested in?

David JohnstonMay 3, 2022, 1:18 AM

3 points

2 comments5 min readLW link

Various Alignment Strategies (and how likely they are to work)

Logan ZoellnerMay 3, 2022, 4:54 PM

73 points

34 comments11 min readLW link

Introducing the ML Safety Scholars Program

Dan H, TW123, Mantas Mazeika, ozhang, Sidney Hough and Kevin Liu

May 4, 2022, 4:01 PM

73 points

2 comments3 min readLW link

Frankenstein: A Modern AGI

SableMay 5, 2022, 4:16 PM

9 points

10 comments9 min readLW link

[Question] What is bias in alignment terms?

Jonas KgomoMay 4, 2022, 9:35 PM

0 points

2 comments1 min readLW link

Ethan Caballero on Private Scaling Progress

Michaël TrazziMay 5, 2022, 6:32 PM

62 points

1 comment2 min readLW link

(theinsideview.github.io)

Apply to the second iteration of the ML for Alignment Bootcamp (MLAB 2) in Berkeley [Aug 15 - Fri Sept 2]

BuckMay 6, 2022, 4:23 AM

68 points

0 comments6 min readLW link

The case for becoming a black-box investigator of language models

BuckMay 6, 2022, 2:35 PM

118 points

19 comments3 min readLW link

Getting GPT-3 to predict Metaculus questions

MathiasKBMay 6, 2022, 6:01 AM

68 points

8 comments2 min readLW link

But What’s Your New Alignment Insight, out of a Future-Textbook Paragraph?

David UdellMay 7, 2022, 3:10 AM

24 points

18 comments5 min readLW link

Video and Transcript of Presentation on Existential Risk from Power-Seeking AI

Joe CarlsmithMay 8, 2022, 3:50 AM

20 points

1 comment29 min readLW link

A Bird’s Eye View of the ML Field [Pragmatic AI Safety #2]

Dan H and TW123

May 9, 2022, 5:18 PM

126 points

5 comments35 min readLW link

Introduction to Pragmatic AI Safety [Pragmatic AI Safety #1]

Dan H and TW123

May 9, 2022, 5:06 PM

70 points

1 comment6 min readLW link

Jobs: Help scale up LM alignment research at NYU

Sam BowmanMay 9, 2022, 2:12 PM

60 points

1 comment1 min readLW link

When is AI safety research harmful?

NathanBarnardMay 9, 2022, 6:19 PM

2 points

0 comments8 min readLW link

AI Alignment YouTube Playlists

jacquesthibs and remember

May 9, 2022, 9:33 PM

29 points

4 comments1 min readLW link

Examining Armstrong’s category of generalized models

Morgan_RogersMay 10, 2022, 9:07 AM

14 points

0 comments7 min readLW link

An Inside View of AI Alignment

Ansh RadhakrishnanMay 11, 2022, 2:16 AM

31 points

2 comments2 min readLW link

[Question] What are your recommendations for technical AI alignment podcasts?

Evan_GaensbauerMay 11, 2022, 9:52 PM

5 points

4 comments1 min readLW link

Deepmind’s Gato: Generalist Agent

Daniel KokotajloMay 12, 2022, 4:01 PM

164 points

61 comments1 min readLW link

“A Generalist Agent”: New DeepMind Publication

1a3ornMay 12, 2022, 3:30 PM

79 points

43 comments1 min readLW link

A tentative dialogue with a Friendly-boxed-super-AGI on brain uploads

Ramiro P.May 12, 2022, 7:40 PM

1 point

12 comments4 min readLW link

Positive outcomes under an unaligned AGI takeover

YitzMay 12, 2022, 7:45 AM

19 points

12 comments3 min readLW link

The Last Paperclip

Logan ZoellnerMay 12, 2022, 7:25 PM

57 points

15 comments17 min readLW link

RLHF

Ansh RadhakrishnanMay 12, 2022, 9:18 PM

16 points

5 comments5 min readLW link

[Question] What to do when starting a business in an imminent-AGI world?

ryan_bMay 12, 2022, 9:07 PM

25 points

7 comments1 min readLW link

DeepMind is hiring for the Scalable Alignment and Alignment Teams

Rohin Shah and Geoffrey Irving

May 13, 2022, 12:17 PM

145 points

35 comments9 min readLW link

“Tech company singularities”, and steering them to reduce x-risk

Andrew_CritchMay 13, 2022, 5:24 PM

73 points

12 comments4 min readLW link

Against Time in Agent Models

johnswentworthMay 13, 2022, 7:55 PM

50 points

12 comments3 min readLW link

Frame for Take-Off Speeds to inform compute governance & scaling alignment

Logan RiggsMay 13, 2022, 10:23 PM

15 points

2 comments2 min readLW link

Alignment as Constraints

Logan RiggsMay 13, 2022, 10:07 PM

10 points

0 comments2 min readLW link

Fermi estimation of the impact you might have working on AI safety

Fabien RogerMay 13, 2022, 5:49 PM

6 points

0 comments1 min readLW link

An observation about Hubinger et al.’s framework for learned optimization

carboniferous_umbraculum May 13, 2022, 4:20 PM

33 points

9 comments8 min readLW link

Thoughts on AI Safety Camp

Charlie SteinerMay 13, 2022, 7:16 AM

24 points

7 comments7 min readLW link

Clarifying the confusion around inner alignment

Rauno ArikeMay 13, 2022, 11:05 PM

27 points

0 comments11 min readLW link

[Link post] Promising Paths to Alignment—Connor Leahy | Talk

frances_lorenzMay 14, 2022, 4:01 PM

34 points

0 comments1 min readLW link

The AI Countdown Clock

River LewisMay 15, 2022, 6:37 PM

40 points

27 comments2 min readLW link

(heytraveler.substack.com)

Surviving Automation In The 21st Century—Part 1

George3d6May 15, 2022, 7:16 PM

27 points

17 comments8 min readLW link

(www.epistem.ink)

Why I’m Optimistic About Near-Term AI Risk

harsimonyMay 15, 2022, 11:05 PM

57 points

28 comments1 min readLW link

Optimization at a Distance

johnswentworthMay 16, 2022, 5:58 PM

78 points

13 comments4 min readLW link

[Question] To what extent is your AGI timeline bimodal or otherwise “bumpy”?

jchanMay 16, 2022, 5:42 PM

13 points

2 comments1 min readLW link

Proxy misspecification and the capabilities vs. value learning race

Sam MarksMay 16, 2022, 6:58 PM

19 points

1 comment4 min readLW link

How to invest in expectation of AGI?

JakobovskiMay 17, 2022, 11:03 AM

3 points

4 comments1 min readLW link

[Intro to brain-like-AGI safety] 15. Conclusion: Open problems, how to help, AMA

Steven ByrnesMay 17, 2022, 3:11 PM

81 points

11 comments14 min readLW link

Actionable-guidance and roadmap recommendations for the NIST AI Risk Management Framework

Dan H and Tony Barrett

May 17, 2022, 3:26 PM

25 points

0 comments3 min readLW link

What are the possible trajectories of an AGI/ASI world?

JakobovskiMay 17, 2022, 1:28 PM

0 points

2 comments1 min readLW link

Maxent and Abstractions: Current Best Arguments

johnswentworthMay 18, 2022, 7:54 PM

34 points

2 comments3 min readLW link

How to get into AI safety research

Stuart_ArmstrongMay 18, 2022, 6:05 PM

44 points

7 comments1 min readLW link

A bridge to Dath Ilan? Improved governance on the critical path to AI alignment.

Jackson WagnerMay 18, 2022, 3:51 PM

23 points

0 comments11 min readLW link

We have achieved Noob Gains in AI

phdeadMay 18, 2022, 8:56 PM

114 points

21 comments7 min readLW link

[Question] Why does gradient descent always work on neural networks?

MichaelDickensMay 20, 2022, 9:13 PM

15 points

11 comments1 min readLW link

How RL Agents Behave When Their Actions Are Modified? [Distillation post]

PabloAMCMay 20, 2022, 6:47 PM

21 points

0 comments8 min readLW link

Over-digitalization: A Prelude to Analogia (Chapter 6)

Justin BullockMay 20, 2022, 4:39 PM

3 points

0 comments13 min readLW link

Clarifying what ELK is trying to achieve

Towards_KeeperhoodMay 21, 2022, 7:34 AM

7 points

0 comments5 min readLW link

[Short version] Information Loss --> Basin flatness

Vivek HebbarMay 21, 2022, 12:59 PM

11 points

0 comments1 min readLW link

Information Loss --> Basin flatness

Vivek HebbarMay 21, 2022, 12:58 PM

47 points

31 comments7 min readLW link

What kinds of algorithms do multi-human imitators learn?

Chris van Merwijk and Joar Skalse

May 22, 2022, 2:27 PM

20 points

0 comments3 min readLW link

Are human imitators superhuman models with explicit constraints on capabilities?

Chris van MerwijkMay 22, 2022, 12:46 PM

41 points

3 comments1 min readLW link

Adversarial attacks and optimal control

JanMay 22, 2022, 6:22 PM

16 points

7 comments8 min readLW link

(universalprior.substack.com)

CNN feature visualization in 50 lines of code

StefanHexMay 26, 2022, 11:02 AM

17 points

4 comments5 min readLW link

[Question] [Alignment] Is there a census on who’s working on what?

CedarMay 23, 2022, 3:33 PM

23 points

6 comments1 min readLW link

AXRP Episode 15 - Natural Abstractions with John Wentworth

DanielFilanMay 23, 2022, 5:40 AM

32 points

1 comment57 min readLW link

Why I’m Worried About AI

peterbarnettMay 23, 2022, 9:13 PM

21 points

2 comments12 min readLW link

Complex Systems for AI Safety [Pragmatic AI Safety #3]

Dan H and TW123

May 24, 2022, 12:00 AM

49 points

2 comments21 min readLW link

The No Free Lunch theorems and their Razor

Adrià Garriga-alonsoMay 24, 2022, 6:40 AM

47 points

3 comments9 min readLW link

Google’s Imagen uses larger text encoder

Ben LivengoodMay 24, 2022, 9:55 PM

27 points

2 comments1 min readLW link

autonomy: the missing AGI ingredient?

nostalgebraistMay 25, 2022, 12:33 AM

61 points

13 comments6 min readLW link

Paper: Teaching GPT3 to express uncertainty in words

Owain_EvansMay 31, 2022, 1:27 PM

96 points

7 comments4 min readLW link

Croesus, Cerberus, and the magpies: a gentle introduction to Eliciting Latent Knowledge

Alexandre VariengienMay 27, 2022, 5:58 PM

14 points

0 comments16 min readLW link

[Question] How much white collar work could be automated using existing ML models?

AMMay 26, 2022, 8:09 AM

25 points

4 comments1 min readLW link

The Pointers Problem—Distilled

Nina PanicksseryMay 26, 2022, 10:44 PM

9 points

0 comments2 min readLW link

Iterated Distillation-Amplification, Gato, and Proto-AGI [Re-Explained]

Gabe MMay 27, 2022, 5:42 AM

21 points

4 comments6 min readLW link

Bootstrapping Language Models

harsimonyMay 27, 2022, 7:43 PM

7 points

5 comments2 min readLW link

Understanding Selection Theorems

adamkMay 28, 2022, 1:49 AM

35 points

3 comments7 min readLW link

[Question] What have been the major “triumphs” in the field of AI over the last ten years?

lcMay 28, 2022, 7:49 PM

35 points

10 comments1 min readLW link

[Question] Bayesian Persuasion?

Karthik TadepalliMay 28, 2022, 5:52 PM

8 points

2 comments1 min readLW link

Distributed Decisions

johnswentworthMay 29, 2022, 2:43 AM

65 points

4 comments6 min readLW link

The Problem With The Current State of AGI Definitions

YitzMay 29, 2022, 1:58 PM

40 points

22 comments8 min readLW link

Functional Analysis Reading Group

Ulisse MiniMay 28, 2022, 2:40 AM

4 points

0 comments1 min readLW link

[Question] Impact of ” ‘Let’s think step by step’ is all you need”?

yrimonJul 24, 2022, 8:59 PM

20 points

2 comments1 min readLW link

Perform Tractable Research While Avoiding Capabilities Externalities [Pragmatic AI Safety #4]

Dan H and TW123

May 30, 2022, 8:25 PM

43 points

3 comments25 min readLW link

[Question] What is the state of Chinese AI research?

RatiosMay 31, 2022, 10:05 AM

34 points

17 comments1 min readLW link

The Brain That Builds Itself

JanMay 31, 2022, 9:42 AM

55 points

6 comments8 min readLW link

(universalprior.substack.com)

Machines vs. Memes 2: Memetically-Motivated Model Extensions

naterushMay 31, 2022, 10:03 PM

4 points

0 comments4 min readLW link

Machines vs Memes Part 3: Imitation and Memes

ceru23Jun 1, 2022, 1:36 PM

5 points

0 comments7 min readLW link

Paradigms of AI alignment: components and enablers

VikaJun 2, 2022, 6:19 AM

48 points

4 comments8 min readLW link

The Bio Anchors Forecast

Ansh RadhakrishnanJun 2, 2022, 1:32 AM

12 points

0 comments3 min readLW link

[MLSN #4]: Many New Interpretability Papers, Virtual Logit Matching, Rationalization Helps Robustness

Dan HJun 3, 2022, 1:20 AM

18 points

0 comments4 min readLW link

The prototypical catastrophic AI action is getting root access to its datacenter

BuckJun 2, 2022, 11:46 PM

142 points

10 comments2 min readLW link

Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing

BuckJun 2, 2022, 11:48 PM

33 points

0 comments3 min readLW link

Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

johnswentworthJun 4, 2022, 5:41 AM

118 points

52 comments2 min readLW link

How to pursue a career in technical AI alignment

Charlie Rogers-SmithJun 4, 2022, 9:11 PM

63 points

0 comments39 min readLW link

Noisy environment regulate utility maximizers

Niclas KupperJun 5, 2022, 6:48 PM

4 points

0 comments7 min readLW link

Why agents are powerful

Daniel KokotajloJun 6, 2022, 1:37 AM

35 points

7 comments7 min readLW link

Why do some people try to make AGI?

TekhneMakreJun 6, 2022, 9:14 AM

14 points

7 comments3 min readLW link

Some ideas for follow-up projects to Redwood Research’s recent paper

JanBJun 6, 2022, 1:29 PM

10 points

0 comments7 min readLW link

Reading the ethicists 2: Hunting for AI alignment papers

Charlie SteinerJun 6, 2022, 3:49 PM

21 points

1 comment7 min readLW link

DALL-E 2 - Unofficial Natural Language Image Editing, Art Critique Survey

bakztfutureJun 6, 2022, 6:27 PM

0 points

0 comments1 min readLW link

(bakztfuture.substack.com)

Thinking about Broad Classes of Utility-like Functions

J BostockJun 7, 2022, 2:05 PM

7 points

0 comments4 min readLW link

Thoughts on Formalizing Composition

Tom LieberumJun 7, 2022, 7:51 AM

13 points

0 comments7 min readLW link

“Pivotal Acts” means something specific

RaemonJun 7, 2022, 9:56 PM

114 points

23 comments2 min readLW link

Why I don’t believe in doom

mukashiJun 7, 2022, 11:49 PM

6 points

30 comments4 min readLW link

[Question] Has anyone actually tried to convince Terry Tao or other top mathematicians to work on alignment?

P.Jun 8, 2022, 10:26 PM

52 points

49 comments4 min readLW link

Today in AI Risk History: The Terminator (1984 film) was released.

ImpassionataJun 9, 2022, 1:32 AM

−3 points

6 comments1 min readLW link

There’s probably a tradeoff between AI capability and safety, and we should act like it

David JohnstonJun 9, 2022, 12:17 AM

3 points

3 comments1 min readLW link

AI Could Defeat All Of Us Combined

HoldenKarnofskyJun 9, 2022, 3:50 PM

168 points

29 comments17 min readLW link

(www.cold-takes.com)

[Question] If there was a millennium equivalent prize for AI alignment, what would the problems be?

Yair HalberstadtJun 9, 2022, 4:56 PM

17 points

4 comments1 min readLW link

[Linkpost & Discussion] AI Trained on 4Chan Becomes ‘Hate Speech Machine’ [and outperforms GPT-3 on TruthfulQA Benchmark?!]

YitzJun 9, 2022, 10:59 AM

16 points

5 comments2 min readLW link

(www.vice.com)

If no near-term alignment strategy, research should aim for the long-term

harsimonyJun 9, 2022, 7:10 PM

7 points

1 comment1 min readLW link

How Do Selection Theorems Relate To Interpretability?

johnswentworthJun 9, 2022, 7:39 PM

57 points

14 comments3 min readLW link

Bureaucracy of AIs

Logan ZoellnerJun 9, 2022, 11:03 PM

11 points

6 comments14 min readLW link

Tao, Kontsevich & others on HLAI in Math

intersticeJun 10, 2022, 2:25 AM

41 points

5 comments2 min readLW link

(www.youtube.com)

Open Problems in AI X-Risk [PAIS #5]

Dan H and TW123

Jun 10, 2022, 2:08 AM

50 points

3 comments36 min readLW link

[Question] why assume AGIs will optimize for fixed goals?

nostalgebraistJun 10, 2022, 1:28 AM

119 points

52 comments4 min readLW link

Progress Report 6: get the tool working

Nathan Helm-BurgerJun 10, 2022, 11:18 AM

4 points

0 comments2 min readLW link

Another plausible scenario of AI risk: AI builds military infrastructure while collaborating with humans, defects later.

avturchinJun 10, 2022, 5:24 PM

10 points

2 comments1 min readLW link

[Question] Is AI Alignment Impossible?

HeighnJun 10, 2022, 10:08 AM

3 points

3 comments1 min readLW link

How dangerous is human-level AI?

Alex_AltairJun 10, 2022, 5:38 PM

21 points

4 comments8 min readLW link

[linkpost] The final AI benchmark: BIG-bench

RomanSJun 10, 2022, 8:53 AM

30 points

19 comments1 min readLW link

[Question] Could Patent-Trolling delay AI timelines?

Pablo RepettoJun 10, 2022, 2:53 AM

1 point

3 comments1 min readLW link

How fast can we perform a forward pass?

jsteinhardtJun 10, 2022, 11:30 PM

53 points

9 comments15 min readLW link

(bounded-regret.ghost.io)

Steganography and the CycleGAN—alignment failure case study

Jan CzechowskiJun 11, 2022, 9:41 AM

28 points

0 comments4 min readLW link

AGI Safety Communications Initiative

inesJun 11, 2022, 5:34 PM

7 points

0 comments1 min readLW link

[Question] How much stupider than humans can AI be and still kill us all through sheer numbers and resource access?

ShmiJun 12, 2022, 1:01 AM

11 points

12 comments1 min readLW link

A claim that Google’s LaMDA is sentient

Ben LivengoodJun 12, 2022, 4:18 AM

31 points

134 comments1 min readLW link

Let’s not name specific AI labs in an adversarial context

acylhalideJun 12, 2022, 5:38 PM

8 points

17 comments1 min readLW link

[Question] How much does cybersecurity reduce AI risk?

DarmaniJun 12, 2022, 10:13 PM

34 points

23 comments1 min readLW link

[Question] How are compute assets distributed in the world?

Chris van MerwijkJun 12, 2022, 10:13 PM

29 points

7 comments1 min readLW link

The beautiful magical enchanted golden Dall-e Mini is underrated

p.b.Jun 13, 2022, 7:58 AM

14 points

0 comments1 min readLW link

Why so little AI risk on rationalist-adjacent blogs?

Grant DemareeJun 13, 2022, 6:31 AM

46 points

23 comments8 min readLW link

[Question] What’s the “This AI is of moral concern.” fire alarm?

Quintin PopeJun 13, 2022, 8:05 AM

37 points

56 comments2 min readLW link

On A List of Lethalities

ZviJun 13, 2022, 12:30 PM

154 points

48 comments54 min readLW link

(thezvi.wordpress.com)

[Question] Can you MRI a deep learning model?

Yair HalberstadtJun 13, 2022, 1:43 PM

3 points

3 comments1 min readLW link

What are some smaller-but-concrete challenges related to AI safety that are impacting people today?

nonzerosumJun 13, 2022, 5:36 PM

3 points

2 comments1 min readLW link

Continuity Assumptions

Jan_KulveitJun 13, 2022, 9:31 PM

26 points

13 comments4 min readLW link

Crypto-fed Computation

aaguirreJun 13, 2022, 9:20 PM

22 points

7 comments7 min readLW link

Blake Richards on Why he is Skeptical of Existential Risk from AI

Michaël TrazziJun 14, 2022, 7:09 PM

41 points

12 comments4 min readLW link

(theinsideview.ai)

I applied for a MIRI job in 2020. Here’s what happened next.

ViktoriaMalyasovaJun 15, 2022, 7:37 PM

78 points

17 comments7 min readLW link

[Question] What are all the AI Alignment and AI Safety Communication Hubs?

Gunnar_ZarnckeJun 15, 2022, 4:16 PM

25 points

5 comments1 min readLW link

[Question] Has there been any work on attempting to use Pascal’s Mugging to make an AGI behave?

Chris_LeongJun 15, 2022, 8:33 AM

7 points

17 comments1 min readLW link

Will vague “AI sentience” concerns do more for AI safety than anything else we might do?

Aryeh EnglanderJun 14, 2022, 11:53 PM

12 points

1 comment1 min readLW link

“Brain enthusiasts” in AI Safety

Jan and Samuel Nellessen

Jun 18, 2022, 9:59 AM

57 points

5 comments10 min readLW link

(universalprior.substack.com)

FYI: I’m working on a book about the threat of AGI/ASI for a general audience. I hope it will be of value to the cause and the community

Darren McKeeJun 15, 2022, 6:08 PM

40 points

17 comments2 min readLW link

A central AI alignment problem: capabilities generalization, and the sharp left turn

So8resJun 15, 2022, 1:10 PM

253 points

48 comments10 min readLW link

AI Risk, as Seen on Snapchat

dkirmaniJun 16, 2022, 7:31 PM

23 points

8 comments1 min readLW link

Humans are very reliable agents

alyssavanceJun 16, 2022, 10:02 PM

248 points

35 comments3 min readLW link

A possible AI-inoculation due to early “robot uprising”

ShmiJun 16, 2022, 9:21 PM

16 points

2 comments1 min readLW link

A transparency and interpretability tech tree

evhubJun 16, 2022, 11:44 PM

136 points

10 comments19 min readLW link

Value extrapolation vs Wireheading

Stuart_ArmstrongJun 17, 2022, 3:02 PM

16 points

1 comment1 min readLW link

#SAT with Tensor Networks

Adam JermynJun 17, 2022, 1:20 PM

4 points

0 comments2 min readLW link

wrapper-minds are the enemy

nostalgebraistJun 17, 2022, 1:58 AM

92 points

36 comments8 min readLW link

[Question] Is there an unified way to make sense of ai failure modes?

walking_mushroomJun 17, 2022, 6:00 PM

3 points

1 comment1 min readLW link

Quantifying General Intelligence

JasonBrownJun 17, 2022, 9:57 PM

9 points

6 comments13 min readLW link

Pivotal outcomes and pivotal processes

Andrew_CritchJun 17, 2022, 11:43 PM

79 points

32 comments4 min readLW link

Scott Aaronson is joining OpenAI to work on AI safety

peterbarnettJun 18, 2022, 4:06 AM

117 points

31 comments1 min readLW link

(scottaaronson.blog)

Can DALL-E understand simple geometry?

Isaac KingJun 18, 2022, 4:37 AM

25 points

2 comments1 min readLW link

Specific problems with specific animal comparisons for AI policy

trevorJun 19, 2022, 1:27 AM

3 points

1 comment2 min readLW link

Agent level parallelism

Johannes C. MayerJun 18, 2022, 8:56 PM

6 points

5 comments1 min readLW link

[Link-post] On Deference and Yudkowsky’s AI Risk Estimates

bmgJun 19, 2022, 5:25 PM

27 points

7 comments1 min readLW link

Where I agree and disagree with Eliezer

paulfchristianoJun 19, 2022, 7:15 PM

777 points

205 comments20 min readLW link

Let’s See You Write That Corrigibility Tag

Eliezer YudkowskyJun 19, 2022, 9:11 PM

109 points

67 comments1 min readLW link

Are we there yet?

theflowerpotJun 20, 2022, 11:19 AM

2 points

2 comments1 min readLW link

On corrigibility and its basin

Donald HobsonJun 20, 2022, 4:33 PM

16 points

3 comments2 min readLW link

Parable: The Bomb that doesn’t Explode

Lone PineJun 20, 2022, 4:41 PM

14 points

5 comments2 min readLW link

Key Papers in Language Model Safety

aogJun 20, 2022, 3:00 PM

37 points

1 comment22 min readLW link

Survey re AIS/LTism office in NYC

RyanCareyJun 20, 2022, 7:21 PM

7 points

0 comments1 min readLW link

An AI defense-offense symmetry thesis

Chris van MerwijkJun 20, 2022, 10:01 AM

10 points

9 comments3 min readLW link

[Question] How easy/fast is it for a AGI to hack computers/a human brain?

Noosphere89Jun 21, 2022, 12:34 AM

0 points

1 comment1 min readLW link

A Toy Model of Gradient Hacking

Oam PatelJun 20, 2022, 10:01 PM

25 points

7 comments4 min readLW link

Debating Whether AI is Conscious Is A Distraction from Real Problems

sidhe_theyJun 21, 2022, 4:56 PM

4 points

10 comments1 min readLW link

(techpolicy.press)

The inordinately slow spread of good AGI conversations in ML

Rob BensingerJun 21, 2022, 4:09 PM

160 points

66 comments8 min readLW link

[Question] What is the difference between AI misalignment and bad programming?

puzzleGuzzleJun 21, 2022, 9:52 PM

6 points

2 comments1 min readLW link

Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

elspoodJun 21, 2022, 11:55 PM

331 points

40 comments7 min readLW link

A Quick List of Some Problems in AI Alignment As A Field

Nicholas / Heather KrossJun 21, 2022, 11:23 PM

74 points

12 comments6 min readLW link

(www.thinkingmuchbetter.com)

Confusion about neuroscience/cognitive science as a danger for AI Alignment

Samuel NellessenJun 22, 2022, 5:59 PM

2 points

1 comment3 min readLW link

(snellessen.com)

Air Conditioner Test Results & Discussion

johnswentworthJun 22, 2022, 10:26 PM

80 points

38 comments6 min readLW link

Loose thoughts on AGI risk

YitzJun 23, 2022, 1:02 AM

7 points

3 comments1 min readLW link

[Question] What’s the contingency plan if we get AGI tomorrow?

YitzJun 23, 2022, 3:10 AM

61 points

24 comments1 min readLW link

[Question] What are the best “policy” approaches in worlds where alignment is difficult?

LHAJun 23, 2022, 1:53 AM

1 point

0 comments1 min readLW link

[Question] Is CIRL a promising agenda?

Chris_LeongJun 23, 2022, 5:12 PM

25 points

12 comments1 min readLW link

Half-baked AI Safety ideas thread

Aryeh EnglanderJun 23, 2022, 4:11 PM

58 points

60 comments1 min readLW link

20 Critiques of AI Safety That I Found on Twitter

dkirmaniJun 23, 2022, 7:23 PM

21 points

16 comments1 min readLW link

Linkpost: Robin Hanson—Why Not Wait On AI Risk?

Yair HalberstadtJun 24, 2022, 2:23 PM

41 points

14 comments1 min readLW link

(www.overcomingbias.com)

Raphaël Millière on Generalization and Scaling Maximalism

Michaël TrazziJun 24, 2022, 6:18 PM

21 points

2 comments4 min readLW link

(theinsideview.ai)

[Question] Do alignment concerns extend to powerful non-AI agents?

OzyrusJun 24, 2022, 6:26 PM

21 points

13 comments1 min readLW link

Dependencies for AGI pessimism

YitzJun 24, 2022, 10:25 PM

6 points

4 comments1 min readLW link

What if the best path for a person who wants to work on AGI alignment is to join Facebook or Google?

dbaschJun 24, 2022, 9:23 PM

2 points

3 comments1 min readLW link

[Link] Adversarially trained neural representations may already be as robust as corresponding biological neural representations

Gunnar_ZarnckeJun 24, 2022, 8:51 PM

35 points

9 comments1 min readLW link

AI-Written Critiques Help Humans Notice Flaws

paulfchristianoJun 25, 2022, 5:22 PM

133 points

5 comments3 min readLW link

(openai.com)

[LQ] Some Thoughts on Messaging Around AI Risk

DragonGodJun 25, 2022, 1:53 PM

5 points

3 comments6 min readLW link

[Question] Should any human enslave an AGI system?

AlignmentMirrorJun 25, 2022, 7:35 PM

−13 points

44 comments1 min readLW link

The Basics of AGI Policy (Flowchart)

trevorJun 26, 2022, 2:01 AM

18 points

8 comments2 min readLW link

Slow motion videos as AI risk intuition pumps

Andrew_CritchJun 14, 2022, 7:31 PM

209 points

36 comments2 min readLW link

Robin Hanson asks “Why Not Wait On AI Risk?”

Gunnar_ZarnckeJun 26, 2022, 11:32 PM

22 points

4 comments1 min readLW link

(www.overcomingbias.com)

Epistemic modesty and how I think about AI risk

Aryeh EnglanderJun 27, 2022, 6:47 PM

22 points

4 comments4 min readLW link

Announcing the Inverse Scaling Prize ($250k Prize Pool)

Ethan Perez, Ian McKenzie and Sam Bowman

Jun 27, 2022, 3:58 PM

166 points

14 comments7 min readLW link

Scott Aaronson and Steven Pinker Debate AI Scaling

LironJun 28, 2022, 4:04 PM

37 points

10 comments1 min readLW link

(scottaaronson.blog)

Four reasons I find AI safety emotionally compelling

KatWoods and AmberDawn

Jun 28, 2022, 2:10 PM

38 points

3 comments4 min readLW link

Some alternative AI safety research projects

Michele CampoloJun 28, 2022, 2:09 PM

9 points

0 comments3 min readLW link

Assessing AlephAlphas Multimodal Model

p.b.Jun 28, 2022, 9:28 AM

30 points

5 comments3 min readLW link

Kurzgesagt – The Last Human (Youtube)

habrykaJun 29, 2022, 3:28 AM

54 points

7 comments1 min readLW link

(www.youtube.com)

Can We Align AI by Having It Learn Human Preferences? I’m Scared (summary of last third of Human Compatible)

apollonianbluesJun 29, 2022, 4:09 AM

19 points

3 comments6 min readLW link

Looking back on my alignment PhD

TurnTroutJul 1, 2022, 3:19 AM

287 points

60 comments11 min readLW link

Will Capabilities Generalise More?

Ramana KumarJun 29, 2022, 5:12 PM

109 points

38 comments4 min readLW link

Gradient hacking: definitions and examples

Richard_NgoJun 29, 2022, 9:35 PM

24 points

1 comment5 min readLW link

[Question] Correcting human error vs doing exactly what you’re told—is there literature on this in context of general system design?

Jan CzechowskiJun 29, 2022, 9:30 PM

6 points

0 comments1 min readLW link

Most Functions Have Undesirable Global Extrema

En KepeigJun 30, 2022, 5:10 PM

8 points

5 comments3 min readLW link

$500 bounty for alignment contest ideas

Orpheus16Jun 30, 2022, 1:56 AM

29 points

5 comments2 min readLW link

Quick survey on AI alignment resources

frances_lorenzJun 30, 2022, 7:09 PM

14 points

0 comments1 min readLW link

[Linkpost] Solving Quantitative Reasoning Problems with Language Models

YitzJun 30, 2022, 6:58 PM

76 points

15 comments2 min readLW link

(storage.googleapis.com)

GPT-3 Catching Fish in Morse Code

Megan KinnimentJun 30, 2022, 9:22 PM

110 points

27 comments8 min readLW link

Selection processes for subagents

Ryan KiddJun 30, 2022, 11:57 PM

33 points

2 comments9 min readLW link

AI safety university groups: a promising opportunity to reduce existential risk

micJul 1, 2022, 3:59 AM

13 points

0 comments11 min readLW link

Safetywashing

Adam SchollJul 1, 2022, 11:56 AM

212 points

17 comments1 min readLW link

[Question] AGI alignment with what?

AlignmentMirrorJul 1, 2022, 10:22 AM

6 points

10 comments1 min readLW link

What Is The True Name of Modularity?

CallumMcDougall, Lucius Bushnaq and Avery

Jul 1, 2022, 2:55 PM

21 points

10 comments12 min readLW link

AXRP Episode 16 - Preparing for Debate AI with Geoffrey Irving

DanielFilanJul 1, 2022, 10:20 PM

14 points

0 comments37 min readLW link

Agenty AGI – How Tempting?

PeterMcCluskeyJul 1, 2022, 11:40 PM

21 points

3 comments5 min readLW link

(www.bayesianinvestor.com)

[Linkpost] Existential Risk Analysis in Empirical Research Papers

Dan HJul 2, 2022, 12:09 AM

40 points

0 comments1 min readLW link

(arxiv.org)

Minerva

AlgonJul 1, 2022, 8:06 PM

35 points

6 comments2 min readLW link

(ai.googleblog.com)

Could an AI Alignment Sandbox be useful?

Michael SoareverixJul 2, 2022, 5:06 AM

2 points

1 comment1 min readLW link

Goal-directedness: tackling complexity

Morgan_RogersJul 2, 2022, 1:51 PM

8 points

0 comments38 min readLW link

[Question] Which one of these two academic routes should I take to end up in AI Safety?

Martín SotoJul 3, 2022, 1:05 AM

5 points

2 comments1 min readLW link

Wonder and The Golden AI Rule

JeffreyKJul 3, 2022, 6:21 PM

0 points

4 comments6 min readLW link

Decision theory and dynamic inconsistency

paulfchristianoJul 3, 2022, 10:20 PM

66 points

33 comments10 min readLW link

(sideways-view.com)

AI Forecasting: One Year In

jsteinhardtJul 4, 2022, 5:10 AM

131 points

12 comments6 min readLW link

(bounded-regret.ghost.io)

Remaking EfficientZero (as best I can)

HoagyJul 4, 2022, 11:03 AM

34 points

9 comments22 min readLW link

Please help us communicate AI xrisk. It could save the world.

otto.bartenJul 4, 2022, 9:47 PM

4 points

7 comments2 min readLW link

Benchmark for successful concept extrapolation/avoiding goal misgeneralization

Stuart_ArmstrongJul 4, 2022, 8:48 PM

80 points

12 comments4 min readLW link

Anthropic’s SoLU (Softmax Linear Unit)

Joel BurgetJul 4, 2022, 6:38 PM

15 points

1 comment4 min readLW link

(transformer-circuits.pub)

[AN #172] Sorry for the long hiatus!

Rohin ShahJul 5, 2022, 6:20 AM

54 points

0 comments3 min readLW link

(mailchi.mp)

Principles for Alignment/Agency Projects

johnswentworthJul 7, 2022, 2:07 AM

115 points

20 comments4 min readLW link

Race Along Rashomon Ridge

Stephen Fowler, Peter S. Park and MichaelEinhorn

Jul 7, 2022, 3:20 AM

49 points

15 comments8 min readLW link

Confusions in My Model of AI Risk

peterbarnettJul 7, 2022, 1:05 AM

21 points

9 comments5 min readLW link

Safety considerations for online generative modeling

Sam MarksJul 7, 2022, 6:31 PM

41 points

9 comments14 min readLW link

Reinforcement Learner Wireheading

Nate ShowellJul 8, 2022, 5:32 AM

8 points

2 comments4 min readLW link

MATS Models

johnswentworthJul 9, 2022, 12:14 AM

84 points

5 comments16 min readLW link

Train first VS prune first in neural networks.

Donald HobsonJul 9, 2022, 3:53 PM

20 points

5 comments2 min readLW link

Research Notes: What are we aligning for?

Shoshannah TekofskyJul 8, 2022, 10:13 PM

19 points

8 comments2 min readLW link

Report from a civilizational observer on Earth

owencbJul 9, 2022, 5:26 PM

49 points

12 comments6 min readLW link

Visualizing Neural networks, how to blame the bias

Donald HobsonJul 9, 2022, 3:52 PM

7 points

1 comment6 min readLW link

Comment on “Propositions Concerning Digital Minds and Society”

Zack_M_DavisJul 10, 2022, 5:48 AM

95 points

12 comments8 min readLW link

Hessian and Basin volume

Vivek HebbarJul 10, 2022, 6:59 AM

33 points

9 comments4 min readLW link

Checksum Sensor Alignment

lsusrJul 11, 2022, 3:31 AM

12 points

2 comments1 min readLW link

The Alignment Problem

lsusrJul 11, 2022, 3:03 AM

45 points

20 comments3 min readLW link

[Question] How do AI timelines affect how you live your life?

Quadratic ReciprocityJul 11, 2022, 1:54 PM

77 points

47 comments1 min readLW link

Three Minimum Pivotal Acts Possible by Narrow AI

Michael SoareverixJul 12, 2022, 9:51 AM

0 points

4 comments2 min readLW link

On how various plans miss the hard bits of the alignment challenge

So8resJul 12, 2022, 2:49 AM

258 points

81 comments29 min readLW link

[Question] What is wrong with this approach to corrigibility?

Rafael CosmanJul 12, 2022, 10:55 PM

7 points

8 comments1 min readLW link

MIRI Conversations: Technology Forecasting & Gradualism (Distillation)

CallumMcDougallJul 13, 2022, 3:55 PM

31 points

1 comment20 min readLW link

[Question] Which AI Safety research agendas are the most promising?

Chris_LeongJul 13, 2022, 7:54 AM

27 points

6 comments1 min readLW link

Deep learning curriculum for large language model alignment

Jacob_HiltonJul 13, 2022, 9:58 PM

53 points

3 comments1 min readLW link

(github.com)

Artificial Sandwiching: When can we test scalable alignment protocols without humans?

Sam BowmanJul 13, 2022, 9:14 PM

40 points

6 comments5 min readLW link

[Question] How to impress students with recent advances in ML?

Charbel-RaphaëlJul 14, 2022, 12:03 AM

12 points

2 comments1 min readLW link

Circumventing interpretability: How to defeat mind-readers

Lee SharkeyJul 14, 2022, 4:59 PM

94 points

8 comments36 min readLW link

Musings on the Human Objective Function

Michael SoareverixJul 15, 2022, 7:13 AM

3 points

0 comments3 min readLW link

Peter Singer’s first published piece on AI

FaiJul 15, 2022, 6:18 AM

20 points

5 comments1 min readLW link

(link.springer.com)

Notes on Learning the Prior

carboniferous_umbraculum Jul 15, 2022, 5:28 PM

21 points

2 comments25 min readLW link

Proposed Orthogonality Theses #2-5

rjbgJul 14, 2022, 10:59 PM

6 points

0 comments2 min readLW link

A story about a duplicitous API

LiLiLiJul 15, 2022, 6:26 PM

2 points

0 comments1 min readLW link

Safety Implications of LeCun’s path to machine intelligence

Ivan VendrovJul 15, 2022, 9:47 PM

89 points

16 comments6 min readLW link

QNR Prospects

PeterMcCluskeyJul 16, 2022, 2:03 AM

38 points

3 comments8 min readLW link

(www.bayesianinvestor.com)

All AGI safety questions welcome (especially basic ones) [July 2022]

plex and Robert Miles

Jul 16, 2022, 12:57 PM

84 points

130 comments3 min readLW link

Alignment as Game Design

Shoshannah TekofskyJul 16, 2022, 10:36 PM

11 points

7 comments2 min readLW link

Why I Think Abrupt AI Takeoff

lincolnquirkJul 17, 2022, 5:04 PM

14 points

6 comments1 min readLW link

Why you might expect homogeneous take-off: evidence from ML research

Andrei AlexandruJul 17, 2022, 8:31 PM

24 points

0 comments10 min readLW link

What should you change in response to an “emergency”? And AI risk

AnnaSalamonJul 18, 2022, 1:11 AM

303 points

60 comments6 min readLW link

Quantilizers and Generative Models

Adam JermynJul 18, 2022, 4:32 PM

24 points

5 comments4 min readLW link

Training goals for large language models

Johannes TreutleinJul 18, 2022, 7:09 AM

26 points

5 comments19 min readLW link

Machine Learning Model Sizes and the Parameter Gap [abridged]

Pablo VillalobosJul 18, 2022, 4:51 PM

20 points

0 comments1 min readLW link

(epochai.org)

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya CotraJul 18, 2022, 7:06 PM

310 points

89 comments84 min readLW link

At what point will we know if Eliezer’s predictions are right or wrong?

anonymous123456Jul 18, 2022, 10:06 PM

5 points

6 comments1 min readLW link

A daily routine I do for my AI safety research work

scasperJul 19, 2022, 9:58 PM

15 points

7 comments1 min readLW link

Pitfalls with Proofs

scasperJul 19, 2022, 10:21 PM

19 points

21 comments8 min readLW link

Which singularity schools plus the no singularity school was right?

Noosphere89Jul 23, 2022, 3:16 PM

9 points

27 comments9 min readLW link

Defining Optimization in a Deeper Way Part 3

J BostockJul 20, 2022, 10:06 PM

8 points

0 comments2 min readLW link

[AN #173] Recent language model results from DeepMind

Rohin ShahJul 21, 2022, 2:30 AM

37 points

9 comments8 min readLW link

(mailchi.mp)

[Question] How much to optimize for the short-timelines scenario?

SoerenMindJul 21, 2022, 10:47 AM

19 points

3 comments1 min readLW link

Making DALL-E Count

DirectedEvolutionJul 22, 2022, 9:11 AM

23 points

12 comments4 min readLW link

Conditioning Generative Models with Restrictions

Adam JermynJul 21, 2022, 8:33 PM

16 points

4 comments8 min readLW link

General alignment properties

TurnTroutAug 8, 2022, 11:40 PM

46 points

2 comments1 min readLW link

Which values are stable under ontology shifts?

Richard_NgoJul 23, 2022, 2:40 AM

68 points

47 comments3 min readLW link

(thinkingcomplete.blogspot.com)

Trying out Prompt Engineering on TruthfulQA

Megan KinnimentJul 23, 2022, 2:04 AM

10 points

0 comments8 min readLW link

Symbolic distillation, Diffusion, Entropy, Replicators, Agents, oh my (a mid-low quality thinking out loud post)

the gears to ascensionJul 23, 2022, 9:13 PM

2 points

2 comments6 min readLW link

Eavesdropping on Aliens: A Data Decoding Challenge

anonymousaisafetyJul 24, 2022, 4:35 AM

44 points

9 comments4 min readLW link

How much should we worry about mesa-optimization challenges?

sudoJul 25, 2022, 3:56 AM

4 points

13 comments2 min readLW link

[Question] Does agent foundations cover all future ML systems?

Jonas HallgrenJul 25, 2022, 1:17 AM

2 points

0 comments1 min readLW link

[Question] How optimistic should we be about AI figuring out how to interpret itself?

oh54321Jul 25, 2022, 10:09 PM

3 points

1 comment1 min readLW link

Active Inference as a formalisation of instrumental convergence

Roman LeventovJul 26, 2022, 5:55 PM

6 points

2 comments3 min readLW link

(direct.mit.edu)

«Boundaries» Sequence (Index Post)

Andrew_CritchJul 26, 2022, 7:12 PM

23 points

1 comment1 min readLW link

Moral strategies at different capability levels

Richard_NgoJul 27, 2022, 6:50 PM

95 points

14 comments5 min readLW link

(thinkingcomplete.blogspot.com)

Principles of Privacy for Alignment Research

johnswentworthJul 27, 2022, 7:53 PM

68 points

30 comments7 min readLW link

Seeking beta readers who are ignorant of biology but knowledgeable about AI safety

Holly_ElmoreJul 27, 2022, 11:02 PM

10 points

6 comments1 min readLW link

Defining Optimization in a Deeper Way Part 4

J BostockJul 28, 2022, 5:02 PM

7 points

0 comments5 min readLW link

Announcing the AI Safety Field Building Hub, a new effort to provide AISFB projects, mentorship, and funding

Vael GatesJul 28, 2022, 9:29 PM

49 points

3 comments6 min readLW link

Distillation Contest—Results and Recap

ArisJul 29, 2022, 5:40 PM

33 points

0 comments7 min readLW link

Abstracting The Hardness of Alignment: Unbounded Atomic Optimization

adamShimiJul 29, 2022, 6:59 PM

62 points

3 comments16 min readLW link

How transparency changed over time

ViktoriaMalyasovaJul 30, 2022, 4:36 AM

21 points

0 comments6 min readLW link

Translating between Latent Spaces

JamesH, Jeremy Gillen and NickyP

Jul 30, 2022, 3:25 AM

20 points

1 comment8 min readLW link

AGI-level reasoner will appear sooner than an agent; what the humanity will do with this reasoner is critical

Roman LeventovJul 30, 2022, 8:56 PM

24 points

10 comments1 min readLW link

chinchilla’s wild implications

nostalgebraistJul 31, 2022, 1:18 AM

366 points

114 comments11 min readLW link

Technical AI Alignment Study Group

Eric KAug 1, 2022, 6:33 PM

5 points

0 comments1 min readLW link

[Question] Which intro-to-AI-risk text would you recommend to...

SherrinfordAug 1, 2022, 9:36 AM

12 points

1 comment1 min readLW link

Two-year update on my personal AI timelines

Ajeya CotraAug 2, 2022, 11:07 PM

287 points

60 comments16 min readLW link

What are the Red Flags for Neural Network Suffering? - Seeds of Science call for reviewers

rogersbaconAug 2, 2022, 10:37 PM

24 points

5 comments1 min readLW link

Precursor checking for deceptive alignment

evhubAug 3, 2022, 10:56 PM

18 points

0 comments14 min readLW link

Survey: What (de)motivates you about AI risk?

Daniel_FriedrichAug 3, 2022, 7:17 PM

1 point

0 comments1 min readLW link

(forms.gle)

High Reliability Orgs, and AI Companies

RaemonAug 4, 2022, 5:45 AM

73 points

6 comments12 min readLW link

Interpretability isn’t Free

Joel BurgetAug 4, 2022, 3:02 PM

10 points

1 comment2 min readLW link

[Question] AI alignment: Would a lazy self-preservation instinct be sufficient?

BrainFrogAug 4, 2022, 5:53 PM

−1 points

4 comments1 min readLW link

[Question] What drives progress, theory or application?

lberglundAug 5, 2022, 1:14 AM

5 points

1 comment1 min readLW link

The Pragmascope Idea

johnswentworthAug 4, 2022, 9:52 PM

55 points

19 comments3 min readLW link

$20K In Bounties for AI Safety Public Materials

Dan H, TW123 and ozhang

Aug 5, 2022, 2:52 AM

68 points

7 comments6 min readLW link

Rant on Problem Factorization for Alignment

johnswentworthAug 5, 2022, 7:23 PM

73 points

48 comments6 min readLW link

Rant on Problem Factorization for Alignment

johnswentworthAug 5, 2022, 7:23 PM

73 points

48 comments6 min readLW link

Announcing the Introduction to ML Safety course

Dan H, TW123 and ozhang

Aug 6, 2022, 2:46 AM

69 points

6 comments7 min readLW link

Why I Am Skeptical of AI Regulation as an X-Risk Mitigation Strategy

A RayAug 6, 2022, 5:46 AM

31 points

14 comments2 min readLW link

My advice on finding your own path

A RayAug 6, 2022, 4:57 AM

34 points

3 comments3 min readLW link

A Deceptively Simple Argument in favor of Problem Factorization

Logan ZoellnerAug 6, 2022, 5:32 PM

3 points

4 comments1 min readLW link

[Question] Can we get full audio for Eliezer’s conversation with Sam Harris?

JakubKAug 7, 2022, 8:35 PM

30 points

8 comments1 min readLW link

How Deadly Will Roughly-Human-Level AGI Be?

David UdellAug 8, 2022, 1:59 AM

12 points

6 comments1 min readLW link

Broad Basins and Data Compression

Jeremy Gillen, Stephen Fowler and Thomas Larsen

Aug 8, 2022, 8:33 PM

29 points

6 comments7 min readLW link

Encultured AI Pre-planning, Part 1: Enabling New Benchmarks

Andrew_Critch and Nick Hay

Aug 8, 2022, 10:44 PM

62 points

2 comments6 min readLW link

Encultured AI, Part 1 Appendix: Relevant Research Examples

Andrew_Critch and Nick Hay

Aug 8, 2022, 10:44 PM

11 points

1 comment7 min readLW link

Disagreements about Alignment: Why, and how, we should try to solve them

ojorgensenAug 9, 2022, 6:49 PM

8 points

1 comment16 min readLW link

[Question] Many Gods refutation and Instrumental Goals. (Proper one)

aditya malikAug 9, 2022, 11:59 AM

0 points

15 comments1 min readLW link

[Question] Is it possible to find venture capital for AI research org with strong safety focus?

AnonResearchAug 9, 2022, 4:12 PM

6 points

1 comment1 min readLW link

Using GPT-3 to augment human intelligence

Henrik KarlssonAug 10, 2022, 3:54 PM

48 points

7 comments18 min readLW link

(escapingflatland.substack.com)

Emergent Abilities of Large Language Models [Linkpost]

aogAug 10, 2022, 6:02 PM

25 points

2 comments1 min readLW link

(arxiv.org)

How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It)

Peter S. Park, NickyP and Stephen Fowler

Aug 10, 2022, 6:14 PM

26 points

30 comments11 min readLW link

The alignment problem from a deep learning perspective

Richard_NgoAug 10, 2022, 10:46 PM

93 points

13 comments27 min readLW link

How much alignment data will we need in the long run?

Jacob_HiltonAug 10, 2022, 9:39 PM

34 points

15 comments4 min readLW link

Thoughts on the good regulator theorem

JonasMossAug 11, 2022, 12:08 PM

8 points

0 comments4 min readLW link

Language models seem to be much better than humans at next-token prediction

Buck, Fabien Roger and LawrenceC

Aug 11, 2022, 5:45 PM

164 points

56 comments13 min readLW link

[Question] Seriously, what goes wrong with “reward the agent when it makes you smile”?

TurnTroutAug 11, 2022, 10:22 PM

76 points

41 comments2 min readLW link

Dissected boxed AI

Nathan1123Aug 12, 2022, 2:37 AM

−8 points

2 comments1 min readLW link

Steelmining via Analogy

Paul BricmanAug 13, 2022, 9:59 AM

24 points

0 comments2 min readLW link

(paulbricman.com)

Refining the Sharp Left Turn threat model, part 1: claims and mechanisms

Vika, Vikrant Varma, Ramana Kumar and Mary Phuong

Aug 12, 2022, 3:17 PM

71 points

3 comments3 min readLW link

(vkrakovna.wordpress.com)

Oversight Misses 100% of Thoughts The AI Does Not Think

johnswentworthAug 12, 2022, 4:30 PM

85 points

49 comments1 min readLW link

Timelines explanation post part 1 of ?

Nathan Helm-BurgerAug 12, 2022, 4:13 PM

10 points

1 comment2 min readLW link

A little playing around with Blenderbot3

Nathan Helm-BurgerAug 12, 2022, 4:06 PM

9 points

0 comments1 min readLW link

DeepMind alignment team opinions on AGI ruin arguments

VikaAug 12, 2022, 9:06 PM

364 points

34 comments14 min readLW link

the Insulated Goal-Program idea

Tamsin LeakeAug 13, 2022, 9:57 AM

39 points

3 comments2 min readLW link

(carado.moe)

goal-program bricks

Tamsin LeakeAug 13, 2022, 10:08 AM

27 points

2 comments2 min readLW link

(carado.moe)

How I think about alignment

Linda LinseforsAug 13, 2022, 10:01 AM

30 points

11 comments5 min readLW link

Refine’s First Blog Post Day

adamShimiAug 13, 2022, 10:23 AM

55 points

3 comments1 min readLW link

Shapes of Mind and Pluralism in Alignment

adamShimiAug 13, 2022, 10:01 AM

30 points

1 comment2 min readLW link

An extended rocket alignment analogy

rememberAug 13, 2022, 6:22 PM

25 points

3 comments4 min readLW link

Cultivating Valiance

Shoshannah TekofskyAug 13, 2022, 6:47 PM

35 points

4 comments4 min readLW link

Evolution is a bad analogy for AGI: inner alignment

Quintin PopeAug 13, 2022, 10:15 PM

52 points

6 comments8 min readLW link

A brief note on Simplicity Bias

carboniferous_umbraculum Aug 14, 2022, 2:05 AM

16 points

0 comments4 min readLW link

Seeking Interns/RAs for Mechanistic Interpretability Projects

Neel NandaAug 15, 2022, 7:11 AM

61 points

0 comments2 min readLW link

Extreme Security

lcAug 15, 2022, 12:11 PM

39 points

4 comments5 min readLW link

On Preference Manipulation in Reward Learning Processes

Felix HofstätterAug 15, 2022, 7:32 PM

8 points

0 comments4 min readLW link

Limits of Asking ELK if Models are Deceptive

Oam PatelAug 15, 2022, 8:44 PM

6 points

2 comments4 min readLW link

What Makes an Idea Understandable? On Architecturally and Culturally Natural Ideas.

NickyP, Peter S. Park and Stephen Fowler

Aug 16, 2022, 2:09 AM

17 points

2 comments16 min readLW link

Deception as the optimal: mesa-optimizers and inner alignment

Eleni AngelouAug 16, 2022, 4:49 AM

10 points

0 comments5 min readLW link

Understanding differences between humans and intelligence-in-general to build safe AGI

Florian_DietzAug 16, 2022, 8:27 AM

7 points

8 comments1 min readLW link

Autonomy as taking responsibility for reference maintenance

Ramana KumarAug 17, 2022, 12:50 PM

52 points

3 comments5 min readLW link

Thoughts on ‘List of Lethalities’

Alex Lawsen Aug 17, 2022, 6:33 PM

25 points

0 comments10 min readLW link

Human Mimicry Mainly Works When We’re Already Close

johnswentworthAug 17, 2022, 6:41 PM

68 points

16 comments5 min readLW link

The Core of the Alignment Problem is...

Thomas Larsen, Jeremy Gillen and JamesH

Aug 17, 2022, 8:07 PM

58 points

10 comments9 min readLW link

Concrete Advice for Forming Inside Views on AI Safety

Neel NandaAug 17, 2022, 10:02 PM

18 points

6 comments10 min readLW link

Announcing Encultured AI: Building a Video Game

Andrew_Critch and Nick Hay

Aug 18, 2022, 2:16 AM

103 points

26 comments4 min readLW link

Announcing the Distillation for Alignment Practicum (DAP)

Jonas Hallgren and CallumMcDougall

Aug 18, 2022, 7:50 PM

21 points

3 comments3 min readLW link

Alignment’s phlogiston

Eleni AngelouAug 18, 2022, 10:27 PM

10 points

2 comments2 min readLW link

[Question] Are language models close to the superhuman level in philosophy?

Roman LeventovAug 19, 2022, 4:43 AM

5 points

2 comments2 min readLW link

How to do theoretical research, a personal perspective

Mark XuAug 19, 2022, 7:41 PM

84 points

4 comments15 min readLW link

Refine’s Second Blog Post Day

adamShimiAug 20, 2022, 1:01 PM

19 points

0 comments1 min readLW link

No One-Size-Fit-All Epistemic Strategy

adamShimiAug 20, 2022, 12:56 PM

23 points

1 comment2 min readLW link

Reducing Goodhart: Announcement, Executive Summary

Charlie SteinerAug 20, 2022, 9:49 AM

14 points

0 comments1 min readLW link

Pivotal acts using an unaligned AGI?

Simon FischerAug 21, 2022, 5:13 PM

26 points

3 comments8 min readLW link

Beyond Hyperanthropomorphism

PointlessOneAug 21, 2022, 5:55 PM

3 points

17 comments1 min readLW link

(studio.ribbonfarm.com)

AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler

DanielFilanAug 21, 2022, 11:50 PM

16 points

0 comments34 min readLW link

[Question] What if we solve AI Safety but no one cares

142857Aug 22, 2022, 5:38 AM

18 points

5 comments1 min readLW link

Finding Goals in the World Model

Jeremy Gillen, JamesH and Thomas Larsen

Aug 22, 2022, 6:06 PM

55 points

8 comments13 min readLW link

[Question] AI Box Experiment: Are people still interested?

DoubleAug 31, 2022, 3:04 AM

31 points

13 comments1 min readLW link

Stable Diffusion has been released

P.Aug 22, 2022, 7:42 PM

15 points

7 comments1 min readLW link

(stability.ai)

Discussion on utilizing AI for alignment

eliflandAug 23, 2022, 2:36 AM

16 points

3 comments1 min readLW link

(www.foxy-scout.com)

It Looks Like You’re Trying To Take Over The Narrative

George3d6Aug 24, 2022, 1:36 PM

2 points

20 comments9 min readLW link

(www.epistem.ink)

Thoughts about OOD alignment

CatneeAug 24, 2022, 3:31 PM

11 points

10 comments2 min readLW link

Vingean Agency

abramdemskiAug 24, 2022, 8:08 PM

57 points

13 comments3 min readLW link

Interspecies diplomacy as a potentially productive lens on AGI alignment

Shariq HashmeAug 24, 2022, 5:59 PM

5 points

1 comment2 min readLW link

OpenAI’s Alignment Plans

dkirmaniAug 24, 2022, 7:39 PM

60 points

17 comments5 min readLW link

(openai.com)

What Makes A Good Measurement Device?

johnswentworthAug 24, 2022, 10:45 PM

35 points

7 comments2 min readLW link

Evaluating OpenAI’s alignment plans using training stories

ojorgensenAug 25, 2022, 4:12 PM

3 points

0 comments5 min readLW link

A Test for Language Model Consciousness

Ethan PerezAug 25, 2022, 7:41 PM

18 points

14 comments10 min readLW link

Seeking Student Submissions: Edit Your Source Code Contest

ArisAug 26, 2022, 2:08 AM

28 points

5 comments2 min readLW link

Basin broadness depends on the size and number of orthogonal features

CallumMcDougall, Avery and Lucius Bushnaq

Aug 27, 2022, 5:29 PM

34 points

21 comments6 min readLW link

Sufficiently many Godzillas as an alignment strategy

142857Aug 28, 2022, 12:08 AM

8 points

3 comments1 min readLW link

Artificial Moral Advisors: A New Perspective from Moral Psychology

David GrossAug 28, 2022, 4:37 PM

25 points

1 comment1 min readLW link

(dl.acm.org)

First thing AI will do when it takes over is get fission going

visiaxAug 28, 2022, 5:56 AM

−2 points

0 comments1 min readLW link

Robert Long On Why Artificial Sentience Might Matter

Michaël TrazziAug 28, 2022, 5:30 PM

26 points

5 comments5 min readLW link

(theinsideview.ai)

How Do AI Timelines Affect Existential Risk?

Stephen McAleeseAug 29, 2022, 4:57 PM

7 points

9 comments23 min readLW link

[Question] What is the best critique of AI existential risk arguments?

joshcAug 30, 2022, 2:18 AM

5 points

10 comments1 min readLW link

Can We Align a Self-Improving AGI?

Peter S. ParkAug 30, 2022, 12:14 AM

8 points

5 comments11 min readLW link

LessWrong’s prediction on apocalypse due to AGI (Aug 2022)

LetUsTalkAug 29, 2022, 6:46 PM

7 points

13 comments1 min readLW link

[Question] How can I reconcile the two most likely requirements for humanities near-term survival.

Erlja Jkdf.Aug 29, 2022, 6:46 PM

1 point

6 comments1 min readLW link

How likely is deceptive alignment?

evhubAug 30, 2022, 7:34 PM

72 points

21 comments60 min readLW link

Inner Alignment via Superpowers

JamesH, Thomas Larsen and Jeremy Gillen

Aug 30, 2022, 8:01 PM

37 points

13 comments4 min readLW link

Three scenarios of pseudo-alignment

Eleni AngelouSep 3, 2022, 12:47 PM

9 points

0 comments3 min readLW link

New 80,000 Hours problem profile on existential risks from AI

Benjamin HiltonAug 31, 2022, 5:36 PM

28 points

7 comments7 min readLW link

(80000hours.org)

Survey of NLP Researchers: NLP is contributing to AGI progress; major catastrophe plausible

Sam BowmanAug 31, 2022, 1:39 AM

89 points

6 comments2 min readLW link

Infra-Exercises, Part 1

Diffractor, Jack Parker and Connall Garrod

Sep 1, 2022, 5:06 AM

49 points

9 comments1 min readLW link

Alignment is hard. Communicating that, might be harder

Eleni AngelouSep 1, 2022, 4:57 PM

7 points

8 comments3 min readLW link

A Survey of Foundational Methods in Inverse Reinforcement Learning

adamkSep 1, 2022, 6:21 PM

16 points

0 comments12 min readLW link

AI Safety and Neighboring Communities: A Quick-Start Guide, as of Summer 2022

Sam BowmanSep 1, 2022, 7:15 PM

74 points

2 comments7 min readLW link

A Richly Interactive AGI Alignment Chart

lisperatiSep 2, 2022, 12:44 AM

14 points

6 comments1 min readLW link

Replacement for PONR concept

Daniel KokotajloSep 2, 2022, 12:09 AM

44 points

6 comments2 min readLW link

AI coordination needs clear wins

evhubSep 1, 2022, 11:41 PM

134 points

15 comments2 min readLW link

Simulators

janusSep 2, 2022, 12:45 PM

472 points

103 comments44 min readLW link

(generative.ink)

Laziness in AI

Richard HenageSep 2, 2022, 5:04 PM

11 points

5 comments1 min readLW link

Agency engineering: is AI-alignment “to human intent” enough?

catubcSep 2, 2022, 6:14 PM

9 points

10 comments6 min readLW link

Sticky goals: a concrete experiment for understanding deceptive alignment

evhubSep 2, 2022, 9:57 PM

35 points

13 comments3 min readLW link

[Question] Request for Alignment Research Project Recommendations

Rauno ArikeSep 3, 2022, 3:29 PM

10 points

2 comments1 min readLW link

[Question] Request for Alignment Research Project Recommendations

Rauno ArikeSep 3, 2022, 3:29 PM

10 points

2 comments1 min readLW link

Bugs or Features?

qbolecSep 3, 2022, 7:04 AM

69 points

9 comments2 min readLW link

Private alignment research sharing and coordination

porbySep 4, 2022, 12:01 AM

54 points

10 comments5 min readLW link

AXRP Episode 18 - Concept Extrapolation with Stuart Armstrong

DanielFilanSep 3, 2022, 11:12 PM

10 points

1 comment39 min readLW link

[Question] Help me find a good Hackathon subject

Charbel-RaphaëlSep 4, 2022, 8:40 AM

6 points

18 comments1 min readLW link

How To Know What the AI Knows—An ELK Distillation

Fabien RogerSep 4, 2022, 12:46 AM

5 points

0 comments5 min readLW link

AI Governance Needs Technical Work

MauSep 5, 2022, 10:28 PM

39 points

1 comment9 min readLW link

Community Building for Graduate Students: A Targeted Approach

Neil CrawfordSep 6, 2022, 5:17 PM

6 points

0 comments3 min readLW link

program searches

Tamsin LeakeSep 5, 2022, 8:04 PM

21 points

2 comments2 min readLW link

(carado.moe)

Alex Lawsen On Forecasting AI Progress

Michaël TrazziSep 6, 2022, 9:32 AM

18 points

0 comments2 min readLW link

(theinsideview.ai)

It’s (not) how you use it

Eleni AngelouSep 7, 2022, 5:15 PM

8 points

1 comment2 min readLW link

AI-assisted list of ten concrete alignment things to do right now

lemonhopeSep 7, 2022, 8:38 AM

8 points

5 comments4 min readLW link

Progress Report 7: making GPT go hurrdurr instead of brrrrrrr

Nathan Helm-BurgerSep 7, 2022, 3:28 AM

21 points

0 comments4 min readLW link

Is there a list of projects to get started with Interpretability?

Franziska FischerSep 7, 2022, 4:27 AM

8 points

2 comments1 min readLW link

Understanding and avoiding value drift

TurnTroutSep 9, 2022, 4:16 AM

40 points

9 comments6 min readLW link

Linkpost: Github Copilot productivity experiment

Daniel KokotajloSep 8, 2022, 4:41 AM

88 points

4 comments1 min readLW link

(github.blog)

Thoughts on AGI consciousness / sentience

Steven ByrnesSep 8, 2022, 4:40 PM

37 points

37 comments6 min readLW link

What Should AI Owe To Us? Accountable and Aligned AI Systems via Contractualist AI Alignment

xuanSep 8, 2022, 3:04 PM

30 points

15 comments25 min readLW link

A rough idea for solving ELK: An approach for training generalist agents like GATO to make plans and describe them to humans clearly and honestly.

Michael SoareverixSep 8, 2022, 3:20 PM

2 points

2 comments2 min readLW link

Dath Ilan’s Views on Stopgap Corrigibility

David UdellSep 22, 2022, 4:16 PM

50 points

17 comments13 min readLW link

(www.glowfic.com)

Most People Start With The Same Few Bad Ideas

johnswentworthSep 9, 2022, 12:29 AM

161 points

30 comments3 min readLW link

Oversight Leagues: The Training Game as a Feature

Paul BricmanSep 9, 2022, 10:08 AM

20 points

6 comments10 min readLW link

AI alignment with humans… but with which humans?

geoffreymillerSep 9, 2022, 6:21 PM

11 points

33 comments3 min readLW link

Evaluations project @ ARC is hiring a researcher and a webdev/engineer

Beth BarnesSep 9, 2022, 10:46 PM

94 points

7 comments10 min readLW link

Swap and Scale

Stephen FowlerSep 9, 2022, 10:41 PM

17 points

3 comments1 min readLW link

AlexaTM − 20 Billion Parameter Model With Impressive Performance

MrThinkSep 9, 2022, 9:46 PM

5 points

0 comments1 min readLW link

[Fun][Link] Alignment SMBC Comic

Gunnar_ZarnckeSep 9, 2022, 9:38 PM

7 points

2 comments1 min readLW link

(www.smbc-comics.com)

Path dependence in ML inductive biases

Vivek Hebbar and evhub

Sep 10, 2022, 1:38 AM

43 points

13 comments10 min readLW link

ethics and anthropics of homomorphically encrypted computations

Tamsin LeakeSep 9, 2022, 10:49 AM

43 points

49 comments3 min readLW link

(carado.moe)

Join ASAP! (AI Safety Accountability Programme) 🚀

CallumMcDougallSep 10, 2022, 11:15 AM

19 points

0 comments3 min readLW link

AI Safety field-building projects I’d like to see

Orpheus16Sep 11, 2022, 11:43 PM

44 points

7 comments6 min readLW link

[Question] Why do People Think Intelligence Will be “Easy”?

DragonGodSep 12, 2022, 5:32 PM

15 points

32 comments2 min readLW link

Black Box Investigation Research Hackathon

Esben Kran and Jonas Hallgren

Sep 12, 2022, 7:20 AM

9 points

4 comments2 min readLW link

Argument against 20% GDP growth from AI within 10 years [Linkpost]

aogSep 12, 2022, 4:08 AM

58 points

21 comments5 min readLW link

(twitter.com)

Ideological Inference Engines: Making Deontology Differentiable*

Paul BricmanSep 12, 2022, 12:00 PM

6 points

0 comments14 min readLW link

Deep Q-Networks Explained

Jay BaileySep 13, 2022, 12:01 PM

37 points

4 comments22 min readLW link

Git Re-Basin: Merging Models modulo Permutation Symmetries [Linkpost]

aogSep 14, 2022, 8:55 AM

21 points

0 comments2 min readLW link

(arxiv.org)

Some ideas for epistles to the AI ethicists

Charlie SteinerSep 14, 2022, 9:07 AM

19 points

0 comments4 min readLW link

The problem with the media presentation of “believing in AI”

Roman LeventovSep 14, 2022, 9:05 PM

3 points

0 comments1 min readLW link

When is intent alignment sufficient or necessary to reduce AGI conflict?

JesseClifton, Sammy Martin and Anthony DiGiovanni

Sep 14, 2022, 7:39 PM

32 points

0 comments9 min readLW link

When would AGIs engage in conflict?

JesseClifton, Sammy Martin and Anthony DiGiovanni

Sep 14, 2022, 7:38 PM

37 points

3 comments13 min readLW link

Responding to ‘Beyond Hyperanthropomorphism’

ukc10014Sep 14, 2022, 8:37 PM

8 points

0 comments16 min readLW link

How should DeepMind’s Chinchilla revise our AI forecasts?

Cleo NardoSep 15, 2022, 5:54 PM

34 points

12 comments13 min readLW link

Rational Animations’ Script Writing Contest

WriterSep 15, 2022, 4:56 PM

22 points

1 comment3 min readLW link

Representational Tethers: Tying AI Latents To Human Ones

Paul BricmanSep 16, 2022, 2:45 PM

30 points

0 comments16 min readLW link

[Question] Why are we sure that AI will “want” something?

ShmiSep 16, 2022, 8:35 PM

31 points

58 comments1 min readLW link

Refine Blogpost Day #3: The shortforms I did write

Alexander Gietelink OldenzielSep 16, 2022, 9:03 PM

23 points

0 comments1 min readLW link

Takeaways from our robust injury classifier project [Redwood Research]

dmzSep 17, 2022, 3:55 AM

135 points

9 comments6 min readLW link

Refine’s Third Blog Post Day/Week

adamShimiSep 17, 2022, 5:03 PM

18 points

0 comments1 min readLW link

There is no royal road to alignment

Eleni AngelouSep 18, 2022, 3:33 AM

4 points

2 comments3 min readLW link

Prize and fast track to alignment research at ALTER

Vanessa KosoySep 17, 2022, 4:58 PM

65 points

4 comments3 min readLW link

[Question] Updates on FLI’s Value Aligment Map?

T431Sep 17, 2022, 10:27 PM

17 points

4 comments1 min readLW link

[Question] Updates on FLI’s Value Aligment Map?

T431Sep 17, 2022, 10:27 PM

17 points

4 comments1 min readLW link

Apply for mentorship in AI Safety field-building

Orpheus16Sep 17, 2022, 7:06 PM

9 points

0 comments1 min readLW link

(forum.effectivealtruism.org)

Sparse trinary weighted RNNs as a path to better language model interpretability

Am8ryllisSep 17, 2022, 7:48 PM

19 points

13 comments3 min readLW link

Podcasts on surveys, slower AI, AI arguments, etc

KatjaGraceSep 18, 2022, 7:30 AM

13 points

0 comments1 min readLW link

(worldspiritsockpuppet.com)

Inner alignment: what are we pointing at?

lemonhopeSep 18, 2022, 11:09 AM

7 points

2 comments1 min readLW link

The Inter-Agent Facet of AI Alignment

Michael OesterleSep 18, 2022, 8:39 PM

12 points

1 comment5 min readLW link

Quintin’s alignment papers roundup—week 2

Quintin PopeSep 19, 2022, 1:41 PM

60 points

2 comments10 min readLW link

Safety timelines: How long will it take to solve alignment?

Esben Kran, JonathanRystroem and Steinthal

Sep 19, 2022, 12:53 PM

35 points

7 comments6 min readLW link

(forum.effectivealtruism.org)

Prize idea: Transmit MIRI and Eliezer’s worldviews

eliflandSep 19, 2022, 9:21 PM

45 points

18 comments2 min readLW link

A noob goes to the SERI MATS presentations

Lowell DenningsSep 19, 2022, 5:35 PM

26 points

0 comments5 min readLW link

How to make your CPU as fast as a GPU—Advances in Sparsity w/ Nir Shavit

the gears to ascensionSep 20, 2022, 3:48 AM

0 points

0 comments27 min readLW link

(www.youtube.com)

Towards deconfusing wireheading and reward maximization

leogaoSep 21, 2022, 12:36 AM

69 points

7 comments4 min readLW link

Here Be AGI Dragons

Eris DiscordiaSep 21, 2022, 10:28 PM

−2 points

0 comments5 min readLW link

Announcing AISIC 2022 - the AI Safety Israel Conference, October 19-20

DavidmanheimSep 21, 2022, 7:32 PM

13 points

0 comments1 min readLW link

AI Risk Intro 2: Solving The Problem

CallumMcDougall and L Rudolf L

Sep 22, 2022, 1:55 PM

13 points

0 comments27 min readLW link

[Question] AI career

ondragonSep 22, 2022, 3:48 AM

2 points

0 comments1 min readLW link

Shahar Avin On How To Regulate Advanced AI Systems

Michaël TrazziSep 23, 2022, 3:46 PM

31 points

0 comments4 min readLW link

(theinsideview.ai)

The heterogeneity of human value types: Implications for AI alignment

geoffreymillerSep 23, 2022, 5:03 PM

10 points

2 comments10 min readLW link

Intelligence as a Platform

Robert KennedySep 23, 2022, 5:51 AM

10 points

5 comments3 min readLW link

Interpreting Neural Networks through the Polytope Lens

Sid Black, Lee Sharkey, Connor Leahy, beren, CRG, merizian, Eric Winsor and Dan Braun

Sep 23, 2022, 5:58 PM

123 points

26 comments33 min readLW link

Under what circumstances have governments cancelled AI-type systems?

David GrossSep 23, 2022, 9:11 PM

7 points

1 comment1 min readLW link

(www.carnegieuktrust.org.uk)

[Question] I’m planning to start creating more write-ups summarizing my thoughts on various issues, mostly related to AI existential safety. What do you want to hear my nuanced takes on?

David Scott Krueger (formerly: capybaralet)Sep 24, 2022, 12:38 PM

9 points

10 comments1 min readLW link

[Question] Why Do AI researchers Rate the Probability of Doom So Low?

AorouSep 24, 2022, 2:33 AM

7 points

6 comments3 min readLW link

AI coöperation is more possible than you think

423175Sep 24, 2022, 9:26 PM

6 points

0 comments2 min readLW link

An Unexpected GPT-3 Decision in a Simple Gamble

casualphysicsenjoyerSep 25, 2022, 4:46 PM

8 points

4 comments1 min readLW link

Prioritizing the Arts in response to AI automation

CaseySep 25, 2022, 2:25 AM

18 points

11 comments2 min readLW link

Planning capacity and daemons

lemonhopeSep 26, 2022, 12:15 AM

2 points

0 comments5 min readLW link

Recall and Regurgitation in GPT2

Megan KinnimentOct 3, 2022, 7:35 PM

33 points

1 comment26 min readLW link

[MLSN #5]: Prize Compilation

Dan HSep 26, 2022, 9:55 PM

14 points

1 comment2 min readLW link

Loss of Alignment is not the High-Order Bit for AI Risk

yieldthoughtSep 26, 2022, 9:16 PM

14 points

20 comments2 min readLW link

Inverse Scaling Prize: Round 1 Winners

Ethan Perez and Ian McKenzie

Sep 26, 2022, 7:57 PM

88 points

16 comments4 min readLW link

(irmckenzie.co.uk)

[Question] Does the existence of shared human values imply alignment is “easy”?

MorpheusSep 26, 2022, 6:01 PM

7 points

14 comments1 min readLW link

Why we’re not founding a human-data-for-alignment org

L Rudolf L and Matt Putz

Sep 27, 2022, 8:14 PM

80 points

5 comments29 min readLW link

(forum.effectivealtruism.org)

Be Not Afraid

Alex BeymanSep 27, 2022, 10:04 PM

8 points

0 comments6 min readLW link

Strange Loops—Self-Reference from Number Theory to AI

ojorgensenSep 28, 2022, 2:10 PM

9 points

5 comments18 min readLW link

AI Safety Endgame Stories

Ivan VendrovSep 28, 2022, 4:58 PM

27 points

11 comments11 min readLW link

Estimating the Current and Future Number of AI Safety Researchers

Stephen McAleeseSep 28, 2022, 9:11 PM

24 points

11 comments9 min readLW link

(forum.effectivealtruism.org)

Clarifying the Agent-Like Structure Problem

johnswentworthSep 29, 2022, 9:28 PM

53 points

14 comments6 min readLW link

Emergency learning

Stuart_ArmstrongJan 28, 2017, 10:05 AM

13 points

10 comments4 min readLW link

EAG DC: Meta-Bottlenecks in Preventing AI Doom

Joseph BloomSep 30, 2022, 5:53 PM

5 points

0 comments1 min readLW link

Interesting papers: formally verifying DNNs

the gears to ascensionSep 30, 2022, 8:49 AM

13 points

0 comments3 min readLW link

linkpost: loss basin visualization

Nathan Helm-BurgerSep 30, 2022, 3:42 AM

14 points

1 comment1 min readLW link

Four usages of “loss” in AI

TurnTroutOct 2, 2022, 12:52 AM

42 points

18 comments5 min readLW link

Announcing the AI Safety Nudge Competition to Help Beat Procrastination

Marc CarauleanuOct 1, 2022, 1:49 AM

10 points

0 comments1 min readLW link

Google could build a conscious AI in three months

derek shillerOct 1, 2022, 1:24 PM

9 points

18 comments1 min readLW link

AGI by 2050 probability less than 1%

fuminOct 1, 2022, 7:45 PM

−10 points

4 comments9 min readLW link

(docs.google.com)

[Question] Do anthropic considerations undercut the evolution anchor from the Bio Anchors report?

Ege ErdilOct 1, 2022, 8:02 PM

20 points

13 comments2 min readLW link

A review of the Bio-Anchors report

jylin04Oct 3, 2022, 10:27 AM

45 points

4 comments1 min readLW link

(docs.google.com)

Data for IRL: What is needed to learn human values?

Jan WehnerOct 3, 2022, 9:23 AM

18 points

6 comments12 min readLW link

my current outlook on AI risk mitigation

Tamsin LeakeOct 3, 2022, 8:06 PM

58 points

4 comments11 min readLW link

(carado.moe)

No free lunch theorem is irrelevant

CatneeOct 4, 2022, 12:21 AM

12 points

7 comments1 min readLW link

Paper+Summary: OMNIGROK: GROKKING BEYOND ALGORITHMIC DATA

Marius HobbhahnOct 4, 2022, 7:22 AM

44 points

11 comments1 min readLW link

(arxiv.org)

How are you dealing with ontology identification?

Erik JennerOct 4, 2022, 11:28 PM

33 points

10 comments3 min readLW link

Reflection Mechanisms as an Alignment target: A follow-up survey

Marius Hobbhahn, elandgre and Beth Barnes

Oct 5, 2022, 2:03 PM

13 points

2 comments7 min readLW link

Tracking Compute Stocks and Flows: Case Studies?

CullenOct 5, 2022, 5:57 PM

11 points

5 comments1 min readLW link

Charitable Reads of Anti-AGI-X-Risk Arguments, Part 1

sstichOct 5, 2022, 5:03 AM

3 points

4 comments3 min readLW link

Neural Tangent Kernel Distillation

Thomas Larsen and Jeremy Gillen

Oct 5, 2022, 6:11 PM

68 points

20 comments8 min readLW link

More Recent Progress in the Theory of Neural Networks

jylin04Oct 6, 2022, 4:57 PM

78 points

6 comments4 min readLW link

Analysing a 2036 Takeover Scenario

ukc10014Oct 6, 2022, 8:48 PM

8 points

2 comments27 min readLW link

Warning Shots Probably Wouldn’t Change The Picture Much

So8resOct 6, 2022, 5:15 AM

111 points

40 comments2 min readLW link

Alignment Might Never Be Solved, By Humans or AI

intersticeOct 7, 2022, 4:14 PM

30 points

6 comments3 min readLW link

linkpost: neuro-symbolic hybrid ai

Nathan Helm-BurgerOct 6, 2022, 9:52 PM

16 points

0 comments1 min readLW link

(youtu.be)

Polysemanticity and Capacity in Neural Networks

Buck, Adam Jermyn and Kshitij Sachan

Oct 7, 2022, 5:51 PM

78 points

9 comments3 min readLW link

[Question] Deliberate practice for research?

Alex_AltairOct 8, 2022, 3:45 AM

16 points

2 comments1 min readLW link

[Question] How many GPUs does NVIDIA make?

leogaoOct 8, 2022, 5:54 PM

27 points

2 comments1 min readLW link

SERI MATS Program—Winter 2022 Cohort

Ryan Kidd, Victor Warlop and Christian Smith

Oct 8, 2022, 7:09 PM

71 points

12 comments4 min readLW link

[Question] Toy alignment problem: Social Nework KPI design

qbolecOct 8, 2022, 10:14 PM

7 points

1 comment1 min readLW link

My tentative interpretability research agenda—topology matching.

Maxwell ClarkeOct 8, 2022, 10:14 PM

10 points

2 comments4 min readLW link

[Question] AI Risk Microdynamics Survey

FroolowOct 9, 2022, 8:04 PM

3 points

0 comments1 min readLW link

Possible miracles

Orpheus16 and Thomas Larsen

Oct 9, 2022, 6:17 PM

60 points

33 comments8 min readLW link

The Lebowski Theorem — Charitable Reads of Anti-AGI-X-Risk Arguments, Part 2

sstichOct 8, 2022, 10:39 PM

1 point

10 comments7 min readLW link

Embedding AI into AR goggles

aixarOct 9, 2022, 8:08 PM

−12 points

0 comments1 min readLW link

Cataloguing Priors in Theory and Practice

Paul BricmanOct 13, 2022, 12:36 PM

13 points

8 comments7 min readLW link

Results from the language model hackathon

Esben KranOct 10, 2022, 8:29 AM

21 points

1 comment4 min readLW link

Don’t expect AGI anytime soon

cveresOct 10, 2022, 10:38 PM

−14 points

6 comments1 min readLW link

Disentangling inner alignment failures

Erik JennerOct 10, 2022, 6:50 PM

14 points

5 comments4 min readLW link

Anonymous advice: If you want to reduce AI risk, should you take roles that advance AI capabilities?

Benjamin HiltonOct 11, 2022, 2:16 PM

54 points

10 comments1 min readLW link

Prettified AI Safety Game Cards

abramdemskiOct 11, 2022, 7:35 PM

46 points

6 comments1 min readLW link

Power-Seeking AI and Existential Risk

Antonio FrancaOct 11, 2022, 10:50 PM

5 points

0 comments9 min readLW link

Alignment 201 curriculum

Richard_NgoOct 12, 2022, 6:03 PM

102 points

3 comments1 min readLW link

(www.agisafetyfundamentals.com)

Article Review: Google’s AlphaTensor

Robert_AIZIOct 12, 2022, 6:04 PM

8 points

2 comments10 min readLW link

[Question] Previous Work on Recreating Neural Network Input from Intermediate Layer Activations

bglassOct 12, 2022, 7:28 PM

1 point

3 comments1 min readLW link

You are better at math (and alignment) than you think

trevorOct 13, 2022, 3:07 AM

37 points

7 comments22 min readLW link

(www.lesswrong.com)

Counterarguments to the basic AI x-risk case

KatjaGraceOct 14, 2022, 1:00 PM

336 points

122 comments34 min readLW link

(aiimpacts.org)

Another problem with AI confinement: ordinary CPUs can work as radio transmitters

RomanSOct 14, 2022, 8:28 AM

34 points

1 comment1 min readLW link

(news.softpedia.com)

“AGI soon, but Narrow works Better”

AnthonyRepettoOct 14, 2022, 9:35 PM

1 point

9 comments2 min readLW link

[Question] Best resource to go from “typical smart tech-savvy person” to “person who gets AGI risk urgency”?

LironOct 15, 2022, 10:26 PM

14 points

8 comments1 min readLW link

[Question] Questions about the alignment problem

GG10Oct 17, 2022, 1:42 AM

−5 points

13 comments3 min readLW link

[Question] Creating superintelligence without AGI

AntbOct 17, 2022, 7:01 PM

7 points

3 comments1 min readLW link

AI Safety Ideas: An Open AI Safety Research Platform

Esben KranOct 17, 2022, 5:01 PM

24 points

0 comments1 min readLW link

Is GPT-N bounded by human capacities? No.

Cleo NardoOct 17, 2022, 11:26 PM

5 points

4 comments2 min readLW link

A pragmatic metric for Artificial General Intelligence

lorepieriOct 17, 2022, 10:07 PM

6 points

0 comments1 min readLW link

(lorenzopieri.com)

Is GitHub Copilot in legal trouble?

tcelferactOct 18, 2022, 4:19 PM

34 points

2 comments1 min readLW link

Metaculus is building a team dedicated to AI forecasting

ChristianWilliamsOct 18, 2022, 4:08 PM

3 points

0 comments1 min readLW link

[Question] Where can I find solution to the exercises of AGISF?

Charbel-RaphaëlOct 18, 2022, 2:11 PM

7 points

0 comments1 min readLW link

A conversation about Katja’s counterarguments to AI risk

Matthew Barnett, Ege Erdil and Brangus Brangus

Oct 18, 2022, 6:40 PM

43 points

9 comments33 min readLW link

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers

Neel NandaOct 18, 2022, 9:08 PM

66 points

5 comments12 min readLW link

(www.neelnanda.io)

Distilled Representations Research Agenda

Hoagy and mishajw

Oct 18, 2022, 8:59 PM

15 points

2 comments8 min readLW link

[Question] Should we push for requiring AI training data to be licensed?

ChristianKlOct 19, 2022, 5:49 PM

38 points

32 comments1 min readLW link

Hacker-AI and Digital Ghosts – Pre-AGI

Erland WittkotterOct 19, 2022, 3:33 PM

9 points

7 comments8 min readLW link

Scaling Laws for Reward Model Overoptimization

leogao, John Schulman and Jacob_Hilton

Oct 20, 2022, 12:20 AM

86 points

11 comments1 min readLW link

(arxiv.org)

The heritability of human values: A behavior genetic critique of Shard Theory

geoffreymillerOct 20, 2022, 3:51 PM

63 points

58 comments21 min readLW link

aisafety.community—A living document of AI safety communities

zeshen and plex

Oct 28, 2022, 5:50 PM

52 points

22 comments1 min readLW link

Trajectories to 2036

ukc10014Oct 20, 2022, 8:23 PM

1 point

1 comment14 min readLW link

Intelligent behaviour across systems, scales and substrates

Nora_AmmannOct 21, 2022, 5:09 PM

11 points

0 comments10 min readLW link

A framework and open questions for game theoretic shard modeling

Garrett BakerOct 21, 2022, 9:40 PM

11 points

4 comments4 min readLW link

[Question] The Last Year - is there an existing novel about the last year before AI doom?

Luca PetrolatiOct 22, 2022, 8:44 PM

4 points

4 comments1 min readLW link

Empowerment is (almost) All We Need

jacob_cannellOct 23, 2022, 9:48 PM

36 points

43 comments17 min readLW link

The optimal timing of spending on AGI safety work; why we should probably be spending more now

Tristan CookOct 24, 2022, 5:42 PM

62 points

0 comments1 min readLW link

A Barebones Guide to Mechanistic Interpretability Prerequisites

Neel NandaOct 24, 2022, 8:45 PM

62 points

8 comments3 min readLW link

(neelnanda.io)

Consider trying Vivek Hebbar’s alignment exercises

Orpheus16Oct 24, 2022, 7:46 PM

36 points

1 comment4 min readLW link

POWERplay: An open-source toolchain to study AI power-seeking

Edouard HarrisOct 24, 2022, 8:03 PM

22 points

0 comments1 min readLW link

(github.com)

What does it take to defend the world against out-of-control AGIs?

Steven ByrnesOct 25, 2022, 2:47 PM

141 points

31 comments30 min readLW link

Mechanism Design for AI Safety—Reading Group Curriculum

Rubi J. HudsonOct 25, 2022, 3:54 AM

7 points

1 comment1 min readLW link

Maps and Blueprint; the Two Sides of the Alignment Equation

Nora_AmmannOct 25, 2022, 4:29 PM

21 points

1 comment5 min readLW link

A Walkthrough of A Mathematical Framework for Transformer Circuits

Neel NandaOct 25, 2022, 8:24 PM

49 points

5 comments1 min readLW link

(www.youtube.com)

Paper: In-context Reinforcement Learning with Algorithm Distillation [Deepmind]

LawrenceCOct 26, 2022, 6:45 PM

28 points

5 comments1 min readLW link

(arxiv.org)

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

maxnadeau, Xander Davies, Buck and Nate Thomas

Oct 27, 2022, 1:32 AM

134 points

14 comments12 min readLW link

You won’t solve alignment without agent foundations

Mikhail SaminNov 6, 2022, 8:07 AM

21 points

3 comments8 min readLW link

AI & ML Safety Updates W43

Esben Kran and Steinthal

Oct 28, 2022, 1:18 PM

9 points

3 comments3 min readLW link

Prizes for ML Safety Benchmark Ideas

joshcOct 28, 2022, 2:51 AM

36 points

3 comments1 min readLW link

Me (Steve Byrnes) on the “Brain Inspired” podcast

Steven ByrnesOct 30, 2022, 7:15 PM

26 points

1 comment1 min readLW link

(braininspired.co)

Join the interpretability research hackathon

Esben KranOct 28, 2022, 4:26 PM

15 points

0 comments1 min readLW link

Instrumental ignoring AI, Dumb but not useless.

Donald HobsonOct 30, 2022, 4:55 PM

7 points

6 comments2 min readLW link

«Boundaries», Part 3a: Defining boundaries as directed Markov blankets

Andrew_CritchOct 30, 2022, 6:31 AM

58 points

13 comments15 min readLW link

[Book] Interpretable Machine Learning: A Guide for Making Black Box Models Explainable

Esben KranOct 31, 2022, 11:38 AM

19 points

1 comment1 min readLW link

(christophm.github.io)

“Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability

David Scott Krueger (formerly: capybaralet)Oct 31, 2022, 9:26 PM

47 points

25 comments2 min readLW link

ML Safety Scholars Summer 2022 Retrospective

TW123Nov 1, 2022, 3:09 AM

29 points

0 comments1 min readLW link

What sorts of systems can be deceptive?

Andrei AlexandruOct 31, 2022, 10:00 PM

14 points

0 comments7 min readLW link

All AGI Safety questions welcome (especially basic ones) [~monthly thread]

Robert MilesNov 1, 2022, 11:23 PM

67 points

100 comments2 min readLW link

Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?

Neel NandaNov 1, 2022, 11:56 PM

68 points

14 comments1 min readLW link

(youtu.be)

On the correspondence between AI-misalignment and cognitive dissonance using a behavioral economics model

Stijn BruersNov 1, 2022, 5:39 PM

4 points

0 comments6 min readLW link

WFW?: Opportunity and Theory of Impact

DavidCorfieldNov 2, 2022, 1:24 AM

1 point

0 comments1 min readLW link

AI Safety Needs Great Product Builders

goodgravyNov 2, 2022, 11:33 AM

14 points

2 comments1 min readLW link

A Mystery About High Dimensional Concept Encoding

Fabien RogerNov 3, 2022, 5:05 PM

46 points

13 comments7 min readLW link

Ethan Caballero on Broken Neural Scaling Laws, Deception, and Recursive Self Improvement

Michaël Trazzi and Ethan Caballero

Nov 4, 2022, 6:09 PM

14 points

11 comments5 min readLW link

(theinsideview.ai)

Can we predict the abilities of future AI? MLAISU W44

Esben Kran and Steinthal

Nov 4, 2022, 3:19 PM

10 points

0 comments3 min readLW link

(newsletter.apartresearch.com)

My summary of “Pragmatic AI Safety”

Eleni AngelouNov 5, 2022, 12:54 PM

2 points

0 comments5 min readLW link

Review of the Challenge

SD MarlowNov 5, 2022, 6:38 AM

−14 points

5 comments2 min readLW link

How to store human values on a computer

Oliver SiegelNov 5, 2022, 7:17 PM

−12 points

17 comments1 min readLW link

Should AI focus on problem-solving or strategic planning? Why not both?

Oliver SiegelNov 5, 2022, 7:17 PM

−12 points

3 comments1 min readLW link

Instead of technical research, more people should focus on buying time

Orpheus16, OliviaJ and Thomas Larsen

Nov 5, 2022, 8:43 PM

80 points

51 comments14 min readLW link

[Question] Is there some kind of backlog or delay for data center AI?

trevorNov 7, 2022, 8:18 AM

5 points

2 comments1 min readLW link

A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Neel NandaNov 7, 2022, 10:39 PM

29 points

15 comments3 min readLW link

(youtu.be)

How could we know that an AGI system will have good consequences?

So8resNov 7, 2022, 10:42 PM

86 points

24 comments5 min readLW link

People care about each other even though they have imperfect motivational pointers?

TurnTroutNov 8, 2022, 6:15 PM

32 points

25 comments7 min readLW link

[ASoT] Thoughts on GPT-N

Ulisse MiniNov 8, 2022, 7:14 AM

8 points

0 comments1 min readLW link

Inverse scaling can become U-shaped

Edouard HarrisNov 8, 2022, 7:04 PM

27 points

15 comments1 min readLW link

(arxiv.org)

Counterfactability

Scott GarrabrantNov 7, 2022, 5:39 AM

36 points

4 comments11 min readLW link

Takeaways from a survey on AI alignment resources

DanielFilanNov 5, 2022, 11:40 PM

73 points

9 comments6 min readLW link

(danielfilan.com)

[ASoT] Instrumental convergence is useful

Ulisse MiniNov 9, 2022, 8:20 PM

5 points

9 comments1 min readLW link

Mesatranslation and Metatranslation

jdpNov 9, 2022, 6:46 PM

23 points

4 comments11 min readLW link

The Interpretability Playground

Esben KranNov 10, 2022, 5:15 PM

8 points

0 comments1 min readLW link

(alignmentjam.com)

Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

TurnTroutNov 29, 2022, 6:23 AM

55 points

27 comments15 min readLW link

[Question] What are some low-cost outside-the-box ways to do/fund alignment research?

trevorNov 11, 2022, 5:25 AM

10 points

0 comments1 min readLW link

Instrumental convergence is what makes general intelligence possible

tailcalledNov 11, 2022, 4:38 PM

72 points

11 comments4 min readLW link

A short critique of Vanessa Kosoy’s PreDCA

Martín SotoNov 13, 2022, 4:00 PM

25 points

8 comments4 min readLW link

[Question] Why don’t we have self driving cars yet?

Linda LinseforsNov 14, 2022, 12:19 PM

21 points

16 comments1 min readLW link

Winners of the AI Safety Nudge Competition

Marc CarauleanuNov 15, 2022, 1:06 AM

4 points

0 comments1 min readLW link

[Question] Will nanotech/biotech be what leads to AI doom?

tailcalledNov 15, 2022, 5:38 PM

4 points

8 comments2 min readLW link

[Question] What is our current best infohazard policy for AGI (safety) research?

Roman LeventovNov 15, 2022, 10:33 PM

12 points

2 comments1 min readLW link

Disagreement with bio anchors that lead to shorter timelines

Marius HobbhahnNov 16, 2022, 2:40 PM

72 points

16 comments7 min readLW link

Current themes in mechanistic interpretability research

Lee Sharkey, Sid Black and beren

Nov 16, 2022, 2:14 PM

82 points

3 comments12 min readLW link

[Question] Is there some reason LLMs haven’t seen broader use?

tailcalledNov 16, 2022, 8:04 PM

25 points

27 comments1 min readLW link

AI Forecasting Research Ideas

JsevillamolNov 17, 2022, 5:37 PM

21 points

2 comments1 min readLW link

Results from the interpretability hackathon

Esben Kran and Neel Nanda

Nov 17, 2022, 2:51 PM

80 points

0 comments6 min readLW link

Don’t design agents which exploit adversarial inputs

TurnTrout and Garrett Baker

Nov 18, 2022, 1:48 AM

60 points

61 comments12 min readLW link

AI Ethics != Ai Safety

DentinNov 18, 2022, 3:02 AM

2 points

0 comments1 min readLW link

Update to Mysteries of mode collapse: text-davinci-002 not RLHF

janusNov 19, 2022, 11:51 PM

69 points

8 comments2 min readLW link

Limits to the Controllability of AGI

Roman_Yampolskiy, Remmelt Ellen and Karl von Wendt

Nov 20, 2022, 7:18 PM

10 points

2 comments9 min readLW link

[ASoT] Reflectivity in Narrow AI

Ulisse MiniNov 21, 2022, 12:51 AM

6 points

1 comment1 min readLW link

Here’s the exit.

ValentineNov 21, 2022, 6:07 PM

85 points

138 comments10 min readLW link

Clarifying wireheading terminology

leogaoNov 24, 2022, 4:53 AM

53 points

6 comments1 min readLW link

A Walkthrough of In-Context Learning and Induction Heads (w/ Charles Frye) Part 1 of 2

Neel NandaNov 22, 2022, 5:12 PM

20 points

0 comments1 min readLW link

(www.youtube.com)

Announcing AI safety Mentors and Mentees

Marius HobbhahnNov 23, 2022, 3:21 PM

54 points

7 comments10 min readLW link

My take on Jacob Cannell’s take on AGI safety

Steven ByrnesNov 28, 2022, 2:01 PM

61 points

13 comments30 min readLW link

Don’t align agents to evaluations of plans

TurnTroutNov 26, 2022, 9:16 PM

37 points

46 comments18 min readLW link

[Question] Dumb and ill-posed question: Is conceptual research like this MIRI paper on the shutdown problem/Corrigibility “real”

joraineNov 24, 2022, 5:08 AM

25 points

11 comments1 min readLW link

Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Vika, Vikrant Varma, Ramana Kumar and Rohin Shah

Nov 25, 2022, 2:36 PM

36 points

4 comments6 min readLW link

(vkrakovna.wordpress.com)

Podcast: Shoshannah Tekofsky on skilling up in AI safety, visiting Berkeley, and developing novel research ideas

Orpheus16Nov 25, 2022, 8:47 PM

37 points

2 comments9 min readLW link

Mechanistic anomaly detection and ELK

paulfchristianoNov 25, 2022, 6:50 PM

121 points

17 comments21 min readLW link

(ai-alignment.com)

The First Filter

adamShimi and Gabriel Alfour

Nov 26, 2022, 7:37 PM

55 points

5 comments1 min readLW link

Discussing how to align Transformative AI if it’s developed very soon

elifland and CharlotteS

Nov 28, 2022, 4:17 PM

36 points

2 comments30 min readLW link

On the Diplomacy AI

ZviNov 28, 2022, 1:20 PM

119 points

29 comments11 min readLW link

(thezvi.wordpress.com)

Why Would AI “Aim” To Defeat Humanity?

HoldenKarnofskyNov 29, 2022, 7:30 PM

68 points

9 comments33 min readLW link

(www.cold-takes.com)

Distinguishing test from training

So8resNov 29, 2022, 9:41 PM

65 points

10 comments6 min readLW link

[Question] Do any of the AI Risk evaluations focus on humans as the risk?

jmhNov 30, 2022, 3:09 AM

10 points

8 comments1 min readLW link

Apply to attend winter AI alignment workshops (Dec 28-30 & Jan 3-5) near Berkeley

Orpheus16, OliviaJ and Thomas Larsen

Dec 1, 2022, 8:46 PM

25 points

1 comment1 min readLW link

Theories of impact for Science of Deep Learning

Marius HobbhahnDec 1, 2022, 2:39 PM

16 points

0 comments11 min readLW link

Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTroutDec 2, 2022, 2:43 AM

96 points

18 comments53 min readLW link

The Plan − 2022 Update

johnswentworthDec 1, 2022, 8:43 PM

211 points

33 comments8 min readLW link

Finding gliders in the game of life

paulfchristianoDec 1, 2022, 8:40 PM

91 points

7 comments16 min readLW link

(ai-alignment.com)

Take 1: We’re not going to reverse-engineer the AI.

Charlie SteinerDec 1, 2022, 10:41 PM

38 points

4 comments4 min readLW link

 Understanding goals in complex systems

Johannes C. MayerDec 1, 2022, 11:49 PM

9 points

0 comments1 min readLW link

(www.youtube.com)

Mastering Stratego (Deepmind)

svemirskiDec 2, 2022, 2:21 AM

6 points

0 comments1 min readLW link

(www.deepmind.com)

Jailbreaking ChatGPT on Release Day

ZviDec 2, 2022, 1:10 PM

237 points

74 comments6 min readLW link

(thezvi.wordpress.com)

[Question] Did I just catch GPTchat doing something unexpectedly insightful?

trevorDec 2, 2022, 7:48 AM

9 points

0 comments1 min readLW link

Take 2: Building tools to help build FAI is a legitimate strategy, but it’s dual-use.

Charlie SteinerDec 3, 2022, 12:54 AM

16 points

1 comment2 min readLW link

Causal scrubbing: results on induction heads

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

Dec 3, 2022, 12:59 AM

32 points

0 comments17 min readLW link

Logical induction for software engineers

Alex FlintDec 3, 2022, 7:55 PM

124 points

2 comments27 min readLW link

ChatGPT is surprisingly and uncanningly good at pretending to be sentient

Victor NovikovDec 3, 2022, 2:47 PM

17 points

11 comments18 min readLW link

Monthly Shorts 11/22

CelerDec 5, 2022, 7:30 AM

8 points

0 comments3 min readLW link

(keller.substack.com)

Take 4: One problem with natural abstractions is there’s too many of them.

Charlie SteinerDec 5, 2022, 10:39 AM

34 points

4 comments1 min readLW link

The No Free Lunch theorem for dummies

Steven ByrnesDec 5, 2022, 9:46 PM

28 points

16 comments3 min readLW link

[Link] Why I’m optimistic about OpenAI’s alignment approach

janleikeDec 5, 2022, 10:51 PM

93 points

13 comments1 min readLW link

(aligned.substack.com)

Updating my AI timelines

Matthew BarnettDec 5, 2022, 8:46 PM

134 points

40 comments2 min readLW link

ChatGPT and Ideological Turing Test

ViliamDec 5, 2022, 9:45 PM

41 points

1 comment1 min readLW link

Verification Is Not Easier Than Generation In General

johnswentworthDec 6, 2022, 5:20 AM

56 points

23 comments1 min readLW link

[Question] What are the major underlying divisions in AI safety?

Chris LeongDec 6, 2022, 3:28 AM

5 points

2 comments1 min readLW link

Take 5: Another problem for natural abstractions is laziness.

Charlie SteinerDec 6, 2022, 7:00 AM

30 points

4 comments3 min readLW link

Mesa-Optimizers via Grokking

orthonormalDec 6, 2022, 8:05 PM

35 points

4 comments6 min readLW link

[Question] How do finite factored sets compare with phase space?

Alex_AltairDec 6, 2022, 8:05 PM

14 points

1 comment1 min readLW link

Using GPT-Eliezer against ChatGPT Jailbreaking

Stuart_Armstrong and rgorman

Dec 6, 2022, 7:54 PM

159 points

77 comments9 min readLW link

Take 6: CAIS is actually Orwellian.

Charlie SteinerDec 7, 2022, 1:50 PM

14 points

5 comments2 min readLW link

[Question] Looking for ideas of public assets (stocks, funds, ETFs) that I can invest in to have a chance at profiting from the mass adoption and commercialization of AI technology

AnnapurnaDec 7, 2022, 10:35 PM

15 points

9 comments1 min readLW link

You should consider launching an AI startup

joshcDec 8, 2022, 12:28 AM

5 points

16 comments4 min readLW link

Machine Learning Consent

jefftkDec 8, 2022, 3:50 AM

38 points

14 comments3 min readLW link

(www.jefftk.com)

Relevant to natural abstractions: Euclidean Symmetry Equivariant Machine Learning—Overview, Applications, and Open Questions

the gears to ascensionDec 8, 2022, 6:01 PM

7 points

0 comments1 min readLW link

(youtu.be)

AI Safety Seems Hard to Measure

HoldenKarnofskyDec 8, 2022, 7:50 PM

68 points

5 comments14 min readLW link

(www.cold-takes.com)

[Question] How is the “sharp left turn defined”?

Chris_LeongDec 9, 2022, 12:04 AM

13 points

3 comments1 min readLW link

Linkpost for a generalist algorithmic learner: capable of carrying out sorting, shortest paths, string matching, convex hull finding in one network

lovetheusersDec 9, 2022, 12:02 AM

7 points

1 comment1 min readLW link

(twitter.com)

Timelines ARE relevant to alignment research (timelines 2 of ?)

Nathan Helm-BurgerAug 24, 2022, 12:19 AM

11 points

5 comments6 min readLW link

Prosaic misalignment from the Solomonoff Predictor

Cleo NardoDec 9, 2022, 5:53 PM

11 points

0 comments5 min readLW link

[Question] Does a LLM have a utility function?

DagonDec 9, 2022, 5:19 PM

16 points

6 comments1 min readLW link

ML Safety at NeurIPS & Paradigmatic AI Safety? MLAISU W49

Esben Kran and Steinthal

Dec 9, 2022, 10:38 AM

14 points

0 comments4 min readLW link

(newsletter.apartresearch.com)

Take 8: Queer the inner/outer alignment dichotomy.

Charlie SteinerDec 9, 2022, 5:46 PM

26 points

2 comments2 min readLW link

My thoughts on OpenAI’s Alignment plan

Donald HobsonDec 10, 2022, 10:35 AM

20 points

0 comments6 min readLW link

[ASoT] Natural abstractions and AlphaZero

Ulisse MiniDec 10, 2022, 5:53 PM

31 points

1 comment1 min readLW link

(arxiv.org)

[Question] How promising are legal avenues to restrict AI training data?

thehalliardDec 10, 2022, 4:31 PM

9 points

2 comments1 min readLW link

Consider using reversible automata for alignment research

Alex_AltairDec 11, 2022, 1:00 AM

81 points

29 comments2 min readLW link

[fiction] Our Final Hour

Mati_RoyDec 11, 2022, 5:49 AM

16 points

5 comments3 min readLW link

A crisis for online communication: bots and bot users will overrun the Internet?

Mitchell_PorterDec 11, 2022, 9:11 PM

23 points

11 comments1 min readLW link

Reframing inner alignment

davidadDec 11, 2022, 1:53 PM

47 points

13 comments4 min readLW link

Side-channels: input versus output

davidadDec 12, 2022, 12:32 PM

35 points

9 comments2 min readLW link

Psychological Disorders and Problems

adamShimi and Gabriel Alfour

Dec 12, 2022, 6:15 PM

35 points

5 comments1 min readLW link

Prodding ChatGPT to solve a basic algebra problem

ShmiDec 12, 2022, 4:09 AM

14 points

6 comments1 min readLW link

(twitter.com)

A brainteaser for language models

Adam ScherlisDec 12, 2022, 2:43 AM

46 points

3 comments2 min readLW link

Take 9: No, RLHF/IDA/debate doesn’t solve outer alignment.

Charlie SteinerDec 12, 2022, 11:51 AM

36 points

14 comments2 min readLW link

12 career-related questions that may (or may not) be helpful for people interested in alignment research

Orpheus16Dec 12, 2022, 10:36 PM

18 points

0 comments2 min readLW link

Finite Factored Sets in Pictures

Magdalena WacheDec 11, 2022, 6:49 PM

149 points

31 comments12 min readLW link

Concept extrapolation for hypothesis generation

Stuart_Armstrong, Patrick Leask and rgorman

Dec 12, 2022, 10:09 PM

20 points

2 comments3 min readLW link

Take 10: Fine-tuning with RLHF is aesthetically unsatisfying.

Charlie SteinerDec 13, 2022, 7:04 AM

30 points

3 comments2 min readLW link

AI alignment is distinct from its near-term applications

paulfchristianoDec 13, 2022, 7:10 AM

233 points

5 comments2 min readLW link

(ai-alignment.com)

Okay, I feel it now

g1Dec 13, 2022, 11:01 AM

84 points

14 comments1 min readLW link

What Does It Mean to Align AI With Human Values?

AlgonDec 13, 2022, 4:56 PM

8 points

3 comments1 min readLW link

(www.quantamagazine.org)

[Question] Is the ChatGPT-simulated Linux virtual machine real?

KenoubiDec 13, 2022, 3:41 PM

18 points

7 comments1 min readLW link

[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey, Dan Braun and beren

Dec 13, 2022, 3:41 PM

80 points

10 comments22 min readLW link

Existential AI Safety is NOT separate from near-term applications

scasperDec 13, 2022, 2:47 PM

37 points

16 comments3 min readLW link

My AGI safety research—2022 review, ’23 plans

Steven ByrnesDec 14, 2022, 3:15 PM

34 points

6 comments6 min readLW link

Trying to disambiguate different questions about whether RLHF is “good”

BuckDec 14, 2022, 4:03 AM

92 points

40 comments7 min readLW link

Predicting GPU performance

Marius Hobbhahn and Tamay

Dec 14, 2022, 4:27 PM

59 points

24 comments1 min readLW link

(epochai.org)

[Question] Is the AI timeline too short to have children?

YorethDec 14, 2022, 6:32 PM

33 points

20 comments1 min readLW link

«Boundaries», Part 3b: Alignment problems in terms of boundaries

Andrew_CritchDec 14, 2022, 10:34 PM

49 points

2 comments13 min readLW link

[Question] Is Paul Christiano still as optimistic about Approval-Directed Agents as he was in 2018?

Chris_LeongDec 14, 2022, 11:28 PM

8 points

0 comments1 min readLW link

Aligning alignment with performance

Marv KDec 14, 2022, 10:19 PM

2 points

0 comments2 min readLW link

AI Neorealism: a threat model & success criterion for existential safety

davidadDec 15, 2022, 1:42 PM

39 points

0 comments3 min readLW link

The next decades might be wild

Marius HobbhahnDec 15, 2022, 4:10 PM

157 points

27 comments41 min readLW link

High-level hopes for AI alignment

HoldenKarnofskyDec 15, 2022, 6:00 PM

42 points

3 comments19 min readLW link

(www.cold-takes.com)

[Question] How is ARC planning to use ELK?

jacquesthibsDec 15, 2022, 8:11 PM

23 points

5 comments1 min readLW link

AI overhangs depend on whether algorithms, compute and data are substitutes or complements

NathanBarnardDec 16, 2022, 2:23 AM

2 points

0 comments3 min readLW link

Paper: Transformers learn in-context by gradient descent

LawrenceCDec 16, 2022, 11:10 AM

26 points

11 comments2 min readLW link

(arxiv.org)

How important are accurate AI timelines for the optimal spending schedule on AI risk interventions?

Tristan CookDec 16, 2022, 4:05 PM

27 points

2 comments1 min readLW link

Will Machines Ever Rule the World? MLAISU W50

Esben KranDec 16, 2022, 11:03 AM

12 points

7 comments4 min readLW link

(newsletter.apartresearch.com)

Can we efficiently explain model behaviors?

paulfchristianoDec 16, 2022, 7:40 PM

63 points

0 comments9 min readLW link

(ai-alignment.com)

[Question] College Selection Advice for Technical Alignment

TempCollegeAskDec 16, 2022, 5:11 PM

11 points

8 comments1 min readLW link

Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceCDec 16, 2022, 10:12 PM

60 points

10 comments1 min readLW link

(www.anthropic.com)

Positive values seem more robust and lasting than prohibitions

TurnTroutDec 17, 2022, 9:43 PM

42 points

12 comments2 min readLW link

Take 11: “Aligning language models” should be weirder.

Charlie SteinerDec 18, 2022, 2:14 PM

29 points

0 comments2 min readLW link

Why I think that teaching philosophy is high impact

Eleni AngelouDec 19, 2022, 3:11 AM

5 points

0 comments2 min readLW link

Event [Berkeley]: Alignment Collaborator Speed-Meeting

AlexMennen and Carson Jones

Dec 19, 2022, 2:24 AM

18 points

2 comments1 min readLW link

The ‘Old AI’: Lessons for AI governance from early electricity regulation

Sam Clarke and Di Cooke

Dec 19, 2022, 2:42 AM

7 points

0 comments13 min readLW link

Note on algorithms with multiple trained components

Steven ByrnesDec 20, 2022, 5:08 PM

19 points

4 comments2 min readLW link

Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)

RemmeltDec 19, 2022, 12:02 PM

8 points

6 comments31 min readLW link

Next Level Seinfeld

ZviDec 19, 2022, 1:30 PM

45 points

6 comments1 min readLW link

(thezvi.wordpress.com)

Solution to The Alignment Problem

AlgonDec 19, 2022, 8:12 PM

10 points

0 comments2 min readLW link

Shard Theory in Nine Theses: a Distillation and Critical Appraisal

LawrenceCDec 19, 2022, 10:52 PM

80 points

14 comments17 min readLW link

The “Minimal Latents” Approach to Natural Abstractions

johnswentworthDec 20, 2022, 1:22 AM

41 points

14 comments12 min readLW link

Take 12: RLHF’s use is evidence that orgs will jam RL at real-world problems.

Charlie SteinerDec 20, 2022, 5:01 AM

23 points

0 comments3 min readLW link

[link, 2019] AI paradigm: interactive learning from unlabeled instructions

the gears to ascensionDec 20, 2022, 6:45 AM

2 points

0 comments2 min readLW link

(jgrizou.github.io)

Discovering Language Model Behaviors with Model-Written Evaluations

evhub and Ethan Perez

Dec 20, 2022, 8:08 PM

45 points

6 comments1 min readLW link

(www.anthropic.com)

Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic

Orpheus16Dec 20, 2022, 9:39 PM

14 points

2 comments11 min readLW link

Google Search loses to ChatGPT fair and square

ShmiDec 21, 2022, 8:11 AM

12 points

6 comments1 min readLW link

(www.surgehq.ai)

A Comprehensive Mechanistic Interpretability Explainer & Glossary

Neel NandaDec 21, 2022, 12:35 PM

40 points

0 comments2 min readLW link

(neelnanda.io)

Price’s equation for neural networks

tailcalledDec 21, 2022, 1:09 PM

22 points

3 comments2 min readLW link

[Question] [DISC] Are Values Robust?

DragonGodDec 21, 2022, 1:00 AM

12 points

5 comments2 min readLW link

Metaphor.systems

the gears to ascensionDec 21, 2022, 9:31 PM

9 points

2 comments1 min readLW link

(metaphor.systems)

The Human’s Hidden Utility Function (Maybe)

lukeprogJan 23, 2012, 7:39 PM

64 points

90 comments3 min readLW link

Using vector fields to visualise preferences and make them consistent

MichaelA and JustinShovelain

Jan 28, 2020, 7:44 PM

41 points

32 comments11 min readLW link

[Article review] Artificial Intelligence, Values, and Alignment

MichaelAMar 9, 2020, 12:42 PM

13 points

5 comments10 min readLW link

Clarifying some key hypotheses in AI alignment

Ben Cottier and Rohin Shah

Aug 15, 2019, 9:29 PM

78 points

12 comments9 min readLW link

Failures in technology forecasting? A reply to Ord and Yudkowsky

MichaelAMay 8, 2020, 12:41 PM

44 points

19 comments11 min readLW link

[Link and commentary] The Offense-Defense Balance of Scientific Knowledge: Does Publishing AI Research Reduce Misuse?

MichaelAFeb 16, 2020, 7:56 PM

24 points

4 comments3 min readLW link

How can Interpretability help Alignment?

RobertKirk, Tomáš Gavenčiak and axioman

May 23, 2020, 4:16 PM

37 points

3 comments9 min readLW link

A Problem With Patternism

B JacobsMay 19, 2020, 8:16 PM

5 points

52 comments1 min readLW link

Goal-directedness is behavioral, not structural

adamShimiJun 8, 2020, 11:05 PM

6 points

12 comments3 min readLW link

Learning Deep Learning: Joining data science research as a mathematician

magfrumpOct 19, 2017, 7:14 PM

10 points

4 comments3 min readLW link

Will AI undergo discontinuous progress?

Sammy MartinFeb 21, 2020, 10:16 PM

26 points

21 comments20 min readLW link

The Value Definition Problem

Sammy MartinNov 18, 2019, 7:56 PM

14 points

6 comments11 min readLW link

Life at Three Tails of the Bell Curve

lsusrJun 27, 2020, 8:49 AM

63 points

10 comments4 min readLW link

How do takeoff speeds affect the probability of bad outcomes from AGI?

KRJun 29, 2020, 10:06 PM

15 points

2 comments8 min readLW link

AI Benefits Post 2: How AI Benefits Differs from AI Alignment & AI for Good

CullenJun 29, 2020, 5:00 PM

8 points

7 comments2 min readLW link

Null-boxing Newcomb’s Problem

YitzJul 13, 2020, 4:32 PM

33 points

10 comments4 min readLW link

No nonsense version of the “racial algorithm bias”

Yuxi_LiuJul 13, 2019, 3:39 PM

115 points

20 comments2 min readLW link

Education 2.0 — A brand new education system

aryanJul 15, 2020, 10:09 AM

−8 points

3 comments6 min readLW link

What it means to optimise

Neel NandaJul 25, 2020, 9:40 AM

5 points

0 comments8 min readLW link

(www.neelnanda.io)

[Question] Where are people thinking and talking about global coordination for AI safety?

Wei DaiMay 22, 2019, 6:24 AM

103 points

22 comments1 min readLW link

The strategy-stealing assumption

paulfchristianoSep 16, 2019, 3:23 PM

72 points

46 comments12 min readLW link 3 reviews

Conversation with Paul Christiano

abergalSep 11, 2019, 11:20 PM

44 points

6 comments30 min readLW link

(aiimpacts.org)

Transcription of Eliezer’s January 2010 video Q&A

curiousepicNov 14, 2011, 5:02 PM

112 points

9 comments56 min readLW link

Resources for AI Alignment Cartography

GyrodiotApr 4, 2020, 2:20 PM

45 points

8 comments9 min readLW link

Thoughts on Ben Garfinkel’s “How sure are we about this AI stuff?”

David Scott Krueger (formerly: capybaralet)Feb 6, 2019, 7:09 PM

25 points

17 comments1 min readLW link

Announcement: AI alignment prize round 2 winners and next round

cousin_itApr 16, 2018, 3:08 AM

64 points

29 comments2 min readLW link

Announcement: AI alignment prize round 3 winners and next round

cousin_itJul 15, 2018, 7:40 AM

93 points

7 comments1 min readLW link

Security Mindset and the Logistic Success Curve

Eliezer YudkowskyNov 26, 2017, 3:58 PM

76 points

48 comments20 min readLW link

Arbital scrape

emmabJun 6, 2019, 11:11 PM

89 points

23 comments1 min readLW link

The Strangest Thing An AI Could Tell You

Eliezer YudkowskyJul 15, 2009, 2:27 AM

116 points

605 comments2 min readLW link

Self-fulfilling correlations

PhilGoetzAug 26, 2010, 9:07 PM

144 points

50 comments3 min readLW link

Zoom In: An Introduction to Circuits

evhubMar 10, 2020, 7:36 PM

84 points

11 comments2 min readLW link

(distill.pub)

Should ethicists be inside or outside a profession?

Eliezer YudkowskyDec 12, 2018, 1:40 AM

87 points

6 comments9 min readLW link

Implicit extortion

paulfchristianoApr 13, 2018, 4:33 PM

29 points

16 comments6 min readLW link

(ai-alignment.com)

Bayesian Judo

Eliezer YudkowskyJul 31, 2007, 5:53 AM

87 points

108 comments1 min readLW link

Announcing AlignmentForum.org Beta

RaemonJul 10, 2018, 8:19 PM

67 points

35 comments2 min readLW link

Announcing the Alignment Newsletter

Rohin ShahApr 9, 2018, 9:16 PM

29 points

3 comments1 min readLW link

Helen Toner on China, CSET, and AI

Rob BensingerApr 21, 2019, 4:10 AM

68 points

3 comments7 min readLW link

(rationallyspeakingpodcast.org)

A simple environment for showing mesa misalignment

Matthew BarnettSep 26, 2019, 4:44 AM

70 points

9 comments2 min readLW link

The E-Coli Test for AI Alignment

johnswentworthDec 16, 2018, 8:10 AM

69 points

24 comments1 min readLW link

Recent Progress in the Theory of Neural Networks

intersticeDec 4, 2019, 11:11 PM

76 points

9 comments9 min readLW link

The Art of the Artificial: Insights from ‘Artificial Intelligence: A Modern Approach’

TurnTroutMar 25, 2018, 6:55 AM

31 points

8 comments15 min readLW link

Heading off a near-term AGI arms race

lincolnquirkAug 22, 2012, 2:23 PM

10 points

70 comments1 min readLW link

Outperforming the human Atari benchmark

VaniverMar 31, 2020, 7:33 PM

58 points

5 comments1 min readLW link

(deepmind.com)

Conversational Presentation of Why Automation is Different This Time

ryan_bJan 17, 2018, 10:11 PM

33 points

26 comments1 min readLW link

A rant against robots

Lê Nguyên HoangJan 14, 2020, 10:03 PM

64 points

7 comments5 min readLW link

Clarifying “AI Alignment”

paulfchristianoNov 15, 2018, 2:41 PM

64 points

82 comments3 min readLW link 2 reviews

Tiling Agents for Self-Modifying AI (OPFAI #2)

Eliezer YudkowskyJun 6, 2013, 8:24 PM

84 points

259 comments3 min readLW link

EDT solves 5 and 10 with conditional oracles

jessicataSep 30, 2018, 7:57 AM

59 points

8 comments13 min readLW link

AGI and Friendly AI in the dominant AI textbook

lukeprogMar 11, 2011, 4:12 AM

73 points

27 comments3 min readLW link

Tabooing ‘Agent’ for Prosaic Alignment

Hjalmar_WijkAug 23, 2019, 2:55 AM

54 points

10 comments6 min readLW link

Is this what FAI outreach success looks like?

Charlie SteinerMar 9, 2018, 1:12 PM

17 points

3 comments1 min readLW link

(www.youtube.com)

Aligning a toy model of optimization

paulfchristianoJun 28, 2019, 8:23 PM

52 points

26 comments3 min readLW link

DeepMind article: AI Safety Gridworlds

scarcegreengrassNov 30, 2017, 4:13 PM

24 points

5 comments1 min readLW link

(deepmind.com)

Botworld: a cellular automaton for studying self-modifying agents embedded in their environment

So8resApr 12, 2014, 12:56 AM

78 points

55 comments7 min readLW link

“UDT2” and “against UD+ASSA”

Wei DaiMay 12, 2019, 4:18 AM

50 points

7 comments7 min readLW link

Using lying to detect human values

Stuart_ArmstrongMar 15, 2018, 11:37 AM

19 points

6 comments1 min readLW link

Another AI Winter?

PeterMcCluskeyDec 25, 2019, 12:58 AM

47 points

14 comments4 min readLW link

(www.bayesianinvestor.com)

Modeling AGI Safety Frameworks with Causal Influence Diagrams

Ramana KumarJun 21, 2019, 12:50 PM

43 points

6 comments1 min readLW link

(arxiv.org)

The Urgent Meta-Ethics of Friendly Artificial Intelligence

lukeprogFeb 1, 2011, 2:15 PM

76 points

252 comments1 min readLW link

Henry Kissinger: AI Could Mean the End of Human History

ESRogsMay 15, 2018, 8:11 PM

17 points

12 comments1 min readLW link

(www.theatlantic.com)

Self-confirming predictions can be arbitrarily bad

Stuart_ArmstrongMay 3, 2019, 11:34 AM

46 points

11 comments5 min readLW link

A Visualization of Nick Bostrom’s Superintelligence

[deleted]Jul 23, 2014, 12:24 AM

62 points

28 comments3 min readLW link

[Question] What are the most plausible “AI Safety warning shot” scenarios?

Daniel KokotajloMar 26, 2020, 8:59 PM

35 points

16 comments1 min readLW link

AGI in a vulnerable world

AI Impacts and abergal

Mar 26, 2020, 12:10 AM

42 points

21 comments1 min readLW link

(aiimpacts.org)

Three Kinds of Competitiveness

Daniel KokotajloMar 31, 2020, 1:00 AM

36 points

18 comments5 min readLW link

Biological humans and the rising tide of AI

cousin_itJan 29, 2018, 4:04 PM

22 points

23 comments1 min readLW link

HLAI 2018 Field Report

Gordon Seidoh WorleyAug 29, 2018, 12:11 AM

48 points

12 comments5 min readLW link

Magical Categories

Eliezer YudkowskyAug 24, 2008, 7:51 PM

65 points

133 comments9 min readLW link

Alignment as Translation

johnswentworthMar 19, 2020, 9:40 PM

62 points

39 comments4 min readLW link

Resolving human values, completely and adequately

Stuart_ArmstrongMar 30, 2018, 3:35 AM

32 points

30 comments12 min readLW link

Will transparency help catch deception? Perhaps not

Matthew BarnettNov 4, 2019, 8:52 PM

43 points

5 comments7 min readLW link

A dilemma for prosaic AI alignment

Daniel KokotajloDec 17, 2019, 10:11 PM

40 points

30 comments3 min readLW link

[1911.08265] Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | Arxiv

DragonGodNov 21, 2019, 1:18 AM

52 points

4 comments1 min readLW link

(arxiv.org)

Glenn Beck discusses the Singularity, cites SI researchers

BrihaspatiJun 12, 2012, 4:45 PM

73 points

183 comments10 min readLW link

Siren worlds and the perils of over-optimised search

Stuart_ArmstrongApr 7, 2014, 11:00 AM

73 points

417 comments7 min readLW link

Human-Aligned AI Summer School: A Summary

Michaël TrazziAug 11, 2018, 8:11 AM

39 points

5 comments4 min readLW link

Top 9+2 myths about AI risk

Stuart_ArmstrongJun 29, 2015, 8:41 PM

68 points

45 comments2 min readLW link

Learning biases and rewards simultaneously

Rohin ShahJul 6, 2019, 1:45 AM

41 points

3 comments4 min readLW link

Looking for AI Safety Experts to Provide High Level Guidance for RAISE

OferMay 6, 2018, 2:06 AM

17 points

5 comments1 min readLW link

[Question] How much funding and researchers were in AI, and AI Safety, in 2018?

RaemonMar 3, 2019, 9:46 PM

41 points

11 comments1 min readLW link

Deep learning—deeper flaws?

Richard_NgoSep 24, 2018, 6:40 PM

39 points

17 comments4 min readLW link

(thinkingcomplete.blogspot.com)

A model of UDT with a concrete prior over logical statements

BenyaAug 28, 2012, 9:45 PM

62 points

24 comments4 min readLW link

Malign generalization without internal search

Matthew BarnettJan 12, 2020, 6:03 PM

43 points

12 comments4 min readLW link

Announcing the second AI Safety Camp

LachouetteJun 11, 2018, 6:59 PM

34 points

0 comments1 min readLW link

Vaniver’s View on Factored Cognition

VaniverAug 23, 2019, 2:54 AM

48 points

4 comments8 min readLW link

Detached Lever Fallacy

Eliezer YudkowskyJul 31, 2008, 6:57 PM

70 points

41 comments7 min readLW link

When to use quantilization

RyanCareyFeb 5, 2019, 5:17 PM

65 points

5 comments4 min readLW link

The first AI Safety Camp & onwards

RemmeltJun 7, 2018, 8:13 PM

45 points

0 comments8 min readLW link

Learning preferences by looking at the world

Rohin ShahFeb 12, 2019, 10:25 PM

43 points

10 comments7 min readLW link

(bair.berkeley.edu)

Selling Nonapples

Eliezer YudkowskyNov 13, 2008, 8:10 PM

71 points

78 comments7 min readLW link

The AI Alignment Problem Has Already Been Solved(?) Once

SquirrelInHellApr 22, 2017, 1:24 PM

50 points

45 comments4 min readLW link

(squirrelinhell.blogspot.com)

Trace README

johnswentworthMar 11, 2020, 9:08 PM

35 points

1 comment8 min readLW link

[Link] Computer improves its Civilization II gameplay by reading the manual

Kaj_SotalaJul 13, 2011, 12:00 PM

49 points

5 comments4 min readLW link

Idea: Open Access AI Safety Journal

Gordon Seidoh WorleyMar 23, 2018, 6:27 PM

28 points

11 comments1 min readLW link

Another take on agent foundations: formalizing zero-shot reasoning

zhukeepaJul 1, 2018, 6:12 AM

59 points

20 comments12 min readLW link

Logical Updatelessness as a Robust Delegation Problem

Scott GarrabrantOct 27, 2017, 9:16 PM

30 points

2 comments2 min readLW link

Some thoughts after reading Artificial Intelligence: A Modern Approach

swift_spiralMar 19, 2019, 11:39 PM

38 points

4 comments2 min readLW link

AI safety without goal-directed behavior

Rohin ShahJan 7, 2019, 7:48 AM

65 points

15 comments4 min readLW link

No Universally Compelling Arguments

Eliezer YudkowskyJun 26, 2008, 8:29 AM

62 points

57 comments5 min readLW link

What AI Safety Researchers Have Written About the Nature of Human Values

avturchinJan 16, 2019, 1:59 PM

50 points

3 comments15 min readLW link

Disambiguating “alignment” and related notions

David Scott Krueger (formerly: capybaralet)Jun 5, 2018, 3:35 PM

22 points

21 comments2 min readLW link

Inductive biases stick around

evhubDec 18, 2019, 7:52 PM

63 points

14 comments3 min readLW link

Bill Gates: problem of strong AI with conflicting goals “very worthy of study and time”

Paul CrowleyJan 22, 2015, 8:21 PM

73 points

18 comments1 min readLW link

So You Want to Save the World

lukeprogJan 1, 2012, 7:39 AM

54 points

149 comments12 min readLW link

Metaphilosophical competence can’t be disentangled from alignment

zhukeepaApr 1, 2018, 12:38 AM

32 points

39 comments3 min readLW link

Some Thoughts on Metaphilosophy

Wei DaiFeb 10, 2019, 12:28 AM

62 points

27 comments4 min readLW link

Reasons compute may not drive AI capabilities growth

Tristan HDec 19, 2018, 10:13 PM

42 points

10 comments8 min readLW link

Distance Functions are Hard

Grue_SlinkyAug 13, 2019, 5:33 PM

31 points

19 comments6 min readLW link

Takeaways from safety by default interviews

AI Impacts and abergal

Apr 3, 2020, 5:20 PM

28 points

2 comments13 min readLW link

(aiimpacts.org)

Bridge Collapse: Reductionism as Engineering Problem

Rob BensingerFeb 18, 2014, 10:03 PM

78 points

62 comments15 min readLW link

Probability as Minimal Map

johnswentworthSep 1, 2019, 7:19 PM

49 points

10 comments5 min readLW link

Policy Alignment

abramdemskiJun 30, 2018, 12:24 AM

50 points

25 comments8 min readLW link

Stable Pointers to Value: An Agent Embedded in Its Own Utility Function

abramdemskiAug 17, 2017, 12:22 AM

15 points

9 comments5 min readLW link

Stable Pointers to Value II: Environmental Goals

abramdemskiFeb 9, 2018, 6:03 AM

18 points

2 comments4 min readLW link

The Argument from Philosophical Difficulty

Wei DaiFeb 10, 2019, 12:28 AM

54 points

31 comments1 min readLW link

human psycholinguists: a critical appraisal

nostalgebraistDec 31, 2019, 12:20 AM

174 points

59 comments16 min readLW link 2 reviews

(nostalgebraist.tumblr.com)

My take on agent foundations: formalizing metaphilosophical competence

zhukeepaApr 1, 2018, 6:33 AM

20 points

6 comments1 min readLW link

Critique my Model: The EV of AGI to Selfish Individuals

ozziegooenApr 8, 2018, 8:04 PM

19 points

9 comments4 min readLW link

AI Safety Debate and Its Applications

VojtaKovarikJul 23, 2019, 10:31 PM

36 points

5 comments12 min readLW link

TAISU 2019 Field Report

Gordon Seidoh WorleyOct 15, 2019, 1:09 AM

36 points

5 comments5 min readLW link

Human-AI Collaboration

Rohin ShahOct 22, 2019, 6:32 AM

42 points

7 comments2 min readLW link

(bair.berkeley.edu)

Analyzing the Problem GPT-3 is Trying to Solve

adamShimiAug 6, 2020, 9:58 PM

16 points

2 comments4 min readLW link

[LINK] Speed superintelligence?

Stuart_ArmstrongAug 14, 2014, 3:57 PM

53 points

20 comments1 min readLW link

A big Singularity-themed Hollywood movie out in April offers many opportunities to talk about AI risk

chaosmageJan 7, 2014, 5:48 PM

49 points

85 comments1 min readLW link

New paper: (When) is Truth-telling Favored in AI debate?

VojtaKovarikDec 26, 2019, 7:59 PM

32 points

7 comments5 min readLW link

(medium.com)

Artificial Addition

Eliezer YudkowskyNov 20, 2007, 7:58 AM

68 points

129 comments6 min readLW link

Exploring safe exploration

evhubJan 6, 2020, 9:07 PM

37 points

8 comments3 min readLW link

‘Dumb’ AI observes and manipulates controllers

Stuart_ArmstrongJan 13, 2015, 1:35 PM

52 points

19 comments2 min readLW link

AI Reading Group Thoughts (1/?): The Mandate of Heaven

AlicornAug 10, 2018, 12:24 AM

45 points

18 comments4 min readLW link

AI Reading Group Thoughts (2/?): Reconstructive Psychosurgery

AlicornSep 25, 2018, 4:25 AM

27 points

6 comments3 min readLW link

(notes on) Policy Desiderata for Superintelligent AI: A Vector Field Approach

Ben PaceFeb 4, 2019, 10:08 PM

43 points

5 comments7 min readLW link

AI Governance: A Research Agenda

habrykaSep 5, 2018, 6:00 PM

25 points

3 comments1 min readLW link

(www.fhi.ox.ac.uk)

Global online debate on the governance of AI

CarolineJJan 5, 2018, 3:31 PM

8 points

5 comments1 min readLW link

[AN #61] AI policy and governance, from two people in the field

Rohin ShahAug 5, 2019, 5:00 PM

12 points

2 comments9 min readLW link

(mailchi.mp)

2019 AI Alignment Literature Review and Charity Comparison

LarksDec 19, 2019, 3:00 AM

130 points

18 comments62 min readLW link

[Question] What’s wrong with these analogies for understanding Informed Oversight and IDA?

Wei DaiMar 20, 2019, 9:11 AM

35 points

3 comments1 min readLW link

The Alignment Newsletter #1: 04/09/18

Rohin ShahApr 9, 2018, 4:00 PM

12 points

3 comments4 min readLW link

The Alignment Newsletter #2: 04/16/18

Rohin ShahApr 16, 2018, 4:00 PM

8 points

0 comments5 min readLW link

The Alignment Newsletter #3: 04/23/18

Rohin ShahApr 23, 2018, 4:00 PM

9 points

0 comments6 min readLW link

The Alignment Newsletter #4: 04/30/18

Rohin ShahApr 30, 2018, 4:00 PM

8 points

0 comments3 min readLW link

The Alignment Newsletter #5: 05/07/18

Rohin ShahMay 7, 2018, 4:00 PM

8 points

0 comments7 min readLW link

The Alignment Newsletter #6: 05/14/18

Rohin ShahMay 14, 2018, 4:00 PM

8 points

0 comments2 min readLW link

The Alignment Newsletter #7: 05/21/18

Rohin ShahMay 21, 2018, 4:00 PM

8 points

0 comments5 min readLW link

The Alignment Newsletter #8: 05/28/18

Rohin ShahMay 28, 2018, 4:00 PM

8 points

0 comments6 min readLW link

The Alignment Newsletter #9: 06/04/18

Rohin ShahJun 4, 2018, 4:00 PM

8 points

0 comments2 min readLW link

The Alignment Newsletter #10: 06/11/18

Rohin ShahJun 11, 2018, 4:00 PM

16 points

0 comments9 min readLW link

The Alignment Newsletter #11: 06/18/18

Rohin ShahJun 18, 2018, 4:00 PM

8 points

0 comments10 min readLW link

The Alignment Newsletter #12: 06/25/18

Rohin ShahJun 25, 2018, 4:00 PM

15 points

0 comments3 min readLW link

Alignment Newsletter #13: 07/02/18

Rohin ShahJul 2, 2018, 4:10 PM

70 points

12 comments8 min readLW link

(mailchi.mp)

Alignment Newsletter #14

Rohin ShahJul 9, 2018, 4:20 PM

14 points

0 comments9 min readLW link

(mailchi.mp)

Alignment Newsletter #15: 07/16/18

Rohin ShahJul 16, 2018, 4:10 PM

42 points

0 comments15 min readLW link

(mailchi.mp)

Alignment Newsletter #17

Rohin ShahJul 30, 2018, 4:10 PM

32 points

0 comments13 min readLW link

(mailchi.mp)

Alignment Newsletter #18

Rohin ShahAug 6, 2018, 4:00 PM

17 points

0 comments10 min readLW link

(mailchi.mp)

Alignment Newsletter #19

Rohin ShahAug 14, 2018, 2:10 AM

18 points

0 comments13 min readLW link

(mailchi.mp)

Alignment Newsletter #20

Rohin ShahAug 20, 2018, 4:00 PM

12 points

2 comments6 min readLW link

(mailchi.mp)

Alignment Newsletter #21

Rohin ShahAug 27, 2018, 4:20 PM

25 points

0 comments7 min readLW link

(mailchi.mp)

Alignment Newsletter #22

Rohin ShahSep 3, 2018, 4:10 PM

18 points

0 comments6 min readLW link

(mailchi.mp)

Alignment Newsletter #23

Rohin ShahSep 10, 2018, 5:10 PM

16 points

0 comments7 min readLW link

(mailchi.mp)

Alignment Newsletter #24

Rohin ShahSep 17, 2018, 4:20 PM

10 points

6 comments12 min readLW link

(mailchi.mp)

Alignment Newsletter #25

Rohin ShahSep 24, 2018, 4:10 PM

18 points

3 comments9 min readLW link

(mailchi.mp)

Alignment Newsletter #26

Rohin ShahOct 2, 2018, 4:10 PM

13 points

0 comments7 min readLW link

(mailchi.mp)

Alignment Newsletter #27

Rohin ShahOct 9, 2018, 1:10 AM

16 points

0 comments9 min readLW link

(mailchi.mp)

Alignment Newsletter #28

Rohin ShahOct 15, 2018, 9:20 PM

11 points

0 comments8 min readLW link

(mailchi.mp)

Alignment Newsletter #29

Rohin ShahOct 22, 2018, 4:20 PM

15 points

0 comments9 min readLW link

(mailchi.mp)

Alignment Newsletter #30

Rohin ShahOct 29, 2018, 4:10 PM

29 points

2 comments6 min readLW link

(mailchi.mp)

Alignment Newsletter #31

Rohin ShahNov 5, 2018, 11:50 PM

17 points

0 comments12 min readLW link

(mailchi.mp)

Alignment Newsletter #32

Rohin ShahNov 12, 2018, 5:20 PM

18 points

0 comments12 min readLW link

(mailchi.mp)

Alignment Newsletter #33

Rohin ShahNov 19, 2018, 5:20 PM

23 points

0 comments9 min readLW link

(mailchi.mp)

Alignment Newsletter #34

Rohin ShahNov 26, 2018, 11:10 PM

24 points

0 comments10 min readLW link

(mailchi.mp)

Alignment Newsletter #35

Rohin ShahDec 4, 2018, 1:10 AM

15 points

0 comments6 min readLW link

(mailchi.mp)

Alignment Newsletter #37

Rohin ShahDec 17, 2018, 7:10 PM

25 points

4 comments10 min readLW link

(mailchi.mp)

Alignment Newsletter #38

Rohin ShahDec 25, 2018, 4:10 PM

9 points

0 comments8 min readLW link

(mailchi.mp)

Alignment Newsletter #39

Rohin ShahJan 1, 2019, 8:10 AM

32 points

2 comments5 min readLW link

(mailchi.mp)

Alignment Newsletter #40

Rohin ShahJan 8, 2019, 8:10 PM

21 points

2 comments5 min readLW link

(mailchi.mp)

Alignment Newsletter #41

Rohin ShahJan 17, 2019, 8:10 AM

22 points

6 comments10 min readLW link

(mailchi.mp)

Alignment Newsletter #42

Rohin ShahJan 22, 2019, 2:00 AM

20 points

1 comment10 min readLW link

(mailchi.mp)

Alignment Newsletter #43

Rohin ShahJan 29, 2019, 9:10 PM

14 points

2 comments13 min readLW link

(mailchi.mp)

Alignment Newsletter #44

Rohin ShahFeb 6, 2019, 8:30 AM

18 points

0 comments9 min readLW link

(mailchi.mp)

Alignment Newsletter #45

Rohin ShahFeb 14, 2019, 2:10 AM

25 points

2 comments8 min readLW link

(mailchi.mp)

Alignment Newsletter #46

Rohin ShahFeb 22, 2019, 12:10 AM

12 points

0 comments9 min readLW link

(mailchi.mp)

Alignment Newsletter #48

Rohin ShahMar 11, 2019, 9:10 PM

29 points

14 comments9 min readLW link

(mailchi.mp)

Alignment Newsletter #49

Rohin ShahMar 20, 2019, 4:20 AM

23 points

1 comment11 min readLW link

(mailchi.mp)

Alignment Newsletter #50

Rohin ShahMar 28, 2019, 6:10 PM

15 points

2 comments10 min readLW link

(mailchi.mp)

Alignment Newsletter #51

Rohin ShahApr 3, 2019, 4:10 AM

25 points

2 comments15 min readLW link

(mailchi.mp)

Alignment Newsletter #52

Rohin ShahApr 6, 2019, 1:20 AM

19 points

1 comment8 min readLW link

(mailchi.mp)

Alignment Newsletter One Year Retrospective

Rohin ShahApr 10, 2019, 6:58 AM

93 points

31 comments21 min readLW link

Alignment Newsletter #53

Rohin ShahApr 18, 2019, 5:20 PM

20 points

0 comments8 min readLW link

(mailchi.mp)

[AN #54] Boxing a finite-horizon AI system to keep it unambitious

Rohin ShahApr 28, 2019, 5:20 AM

20 points

0 comments8 min readLW link

(mailchi.mp)

[AN #55] Regulatory markets and international standards as a means of ensuring beneficial AI

Rohin ShahMay 5, 2019, 2:20 AM

17 points

2 comments8 min readLW link

(mailchi.mp)

[AN #56] Should ML researchers stop running experiments before making hypotheses?

Rohin ShahMay 21, 2019, 2:20 AM

21 points

8 comments9 min readLW link

(mailchi.mp)

[AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming

Rohin ShahJun 5, 2019, 11:20 PM

26 points

15 comments7 min readLW link

(mailchi.mp)

[AN #58] Mesa optimization: what it is, and why we should care

Rohin ShahJun 24, 2019, 4:10 PM

54 points

9 comments8 min readLW link

(mailchi.mp)

[AN #59] How arguments for AI risk have changed over time

Rohin ShahJul 8, 2019, 5:20 PM

43 points

4 comments7 min readLW link

(mailchi.mp)

[AN #60] A new AI challenge: Minecraft agents that assist human players in creative mode

Rohin ShahJul 22, 2019, 5:00 PM

23 points

6 comments9 min readLW link

(mailchi.mp)

[AN #62] Are adversarial examples caused by real but imperceptible features?

Rohin ShahAug 22, 2019, 5:10 PM

27 points

10 comments9 min readLW link

(mailchi.mp)

[AN #63] How architecture search, meta learning, and environment design could lead to general intelligence

Rohin ShahSep 10, 2019, 7:10 PM

21 points

12 comments8 min readLW link

(mailchi.mp)

[AN #64]: Using Deep RL and Reward Uncertainty to Incentivize Preference Learning

Rohin ShahSep 16, 2019, 5:10 PM

11 points

8 comments7 min readLW link

(mailchi.mp)

[AN #65]: Learning useful skills by watching humans “play”

Rohin ShahSep 23, 2019, 5:30 PM

11 points

0 comments9 min readLW link

(mailchi.mp)

[AN #66]: Decomposing robustness into capability robustness and alignment robustness

Rohin ShahSep 30, 2019, 6:00 PM

12 points

1 comment7 min readLW link

(mailchi.mp)

[AN #67]: Creating environments in which to study inner alignment failures

Rohin ShahOct 7, 2019, 5:10 PM

17 points

0 comments8 min readLW link

(mailchi.mp)

[AN #68]: The attainable utility theory of impact

Rohin ShahOct 14, 2019, 5:00 PM

17 points

0 comments8 min readLW link

(mailchi.mp)

[AN #69] Stuart Russell’s new book on why we need to replace the standard model of AI

Rohin ShahOct 19, 2019, 12:30 AM

60 points

12 comments15 min readLW link

(mailchi.mp)

[AN #70]: Agents that help humans who are still learning about their own preferences

Rohin ShahOct 23, 2019, 5:10 PM

16 points

0 comments9 min readLW link

(mailchi.mp)

[AN #71]: Avoiding reward tampering through current-RF optimization

Rohin ShahOct 30, 2019, 5:10 PM

12 points

0 comments7 min readLW link

(mailchi.mp)

[AN #72]: Alignment, robustness, methodology, and system building as research priorities for AI safety

Rohin ShahNov 6, 2019, 6:10 PM

26 points

4 comments10 min readLW link

(mailchi.mp)

[AN #73]: Detecting catastrophic failures by learning how agents tend to break

Rohin ShahNov 13, 2019, 6:10 PM

11 points

0 comments7 min readLW link

(mailchi.mp)

[AN #74]: Separating beneficial AI into competence, alignment, and coping with impacts

Rohin ShahNov 20, 2019, 6:20 PM

19 points

0 comments7 min readLW link

(mailchi.mp)

[AN #75]: Solving Atari and Go with learned game models, and thoughts from a MIRI employee

Rohin ShahNov 27, 2019, 6:10 PM

38 points

1 comment10 min readLW link

(mailchi.mp)

[AN #76]: How dataset size affects robustness, and benchmarking safe exploration by measuring constraint violations

Rohin ShahDec 4, 2019, 6:10 PM

14 points

6 comments9 min readLW link

(mailchi.mp)

[AN #77]: Double descent: a unification of statistical theory and modern ML practice

Rohin ShahDec 18, 2019, 6:30 PM

21 points

4 comments14 min readLW link

(mailchi.mp)

[AN #78] Formalizing power and instrumental convergence, and the end-of-year AI safety charity comparison

Rohin ShahDec 26, 2019, 1:10 AM

26 points

10 comments9 min readLW link

(mailchi.mp)

[AN #79]: Recursive reward modeling as an alignment technique integrated with deep RL

Rohin ShahJan 1, 2020, 6:00 PM

13 points

0 comments12 min readLW link

(mailchi.mp)

[AN #81]: Universality as a potential solution to conceptual difficulties in intent alignment

Rohin ShahJan 8, 2020, 6:00 PM

31 points

4 comments11 min readLW link

(mailchi.mp)

[AN #82]: How OpenAI Five distributed their training computation

Rohin ShahJan 15, 2020, 6:20 PM

19 points

0 comments8 min readLW link

(mailchi.mp)

[AN #83]: Sample-efficient deep learning with ReMixMatch

Rohin ShahJan 22, 2020, 6:10 PM

15 points

4 comments11 min readLW link

(mailchi.mp)

[AN #84] Reviewing AI alignment work in 2018-19

Rohin ShahJan 29, 2020, 6:30 PM

23 points

0 comments6 min readLW link

(mailchi.mp)

[AN #85]: The normative questions we should be asking for AI alignment, and a surprisingly good chatbot

Rohin ShahFeb 5, 2020, 6:20 PM

14 points

2 comments7 min readLW link

(mailchi.mp)

[AN #86]: Improving debate and factored cognition through human experiments

Rohin ShahFeb 12, 2020, 6:10 PM

14 points

0 comments9 min readLW link

(mailchi.mp)

[AN #87]: What might happen as deep learning scales even further?

Rohin ShahFeb 19, 2020, 6:20 PM

28 points

0 comments4 min readLW link

(mailchi.mp)

[AN #88]: How the principal-agent literature relates to AI risk

Rohin ShahFeb 27, 2020, 9:10 AM

18 points

0 comments9 min readLW link

(mailchi.mp)

[AN #89]: A unifying formalism for preference learning algorithms

Rohin ShahMar 4, 2020, 6:20 PM

16 points

0 comments9 min readLW link

(mailchi.mp)

[AN #90]: How search landscapes can contain self-reinforcing feedback loops

Rohin ShahMar 11, 2020, 5:30 PM

11 points

6 comments8 min readLW link

(mailchi.mp)

[AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement

Rohin ShahMar 18, 2020, 5:10 PM

15 points

10 comments13 min readLW link

(mailchi.mp)

[AN #92]: Learning good representations with contrastive predictive coding

Rohin ShahMar 25, 2020, 5:20 PM

18 points

1 comment10 min readLW link

(mailchi.mp)

[AN #93]: The Precipice we’re standing at, and how we can back away from it

Rohin ShahApr 1, 2020, 5:10 PM

24 points

0 comments7 min readLW link

(mailchi.mp)

Forecasting AI Progress: A Research Agenda

rossg and axioman

Aug 10, 2020, 1:04 AM

39 points

4 comments1 min readLW link

The Steering Problem

paulfchristianoNov 13, 2018, 5:14 PM

43 points

12 comments7 min readLW link

Will humans build goal-directed agents?

Rohin ShahJan 5, 2019, 1:33 AM

51 points

43 comments5 min readLW link

Prosaic AI alignment

paulfchristianoNov 20, 2018, 1:56 PM

40 points

10 comments8 min readLW link

David Chalmers’ “The Singularity: A Philosophical Analysis”

lukeprogJan 29, 2011, 2:52 AM

55 points

203 comments4 min readLW link

[Talk] Paul Christiano on his alignment taxonomy

jpSep 27, 2019, 6:37 PM

31 points

1 comment1 min readLW link

(www.youtube.com)

Dreams of AI Design

Eliezer YudkowskyAug 27, 2008, 4:04 AM

26 points

61 comments5 min readLW link

Qualitative Strategies of Friendliness

Eliezer YudkowskyAug 30, 2008, 2:12 AM

30 points

56 comments12 min readLW link

Oracles, sequence predictors, and self-confirming predictions

Stuart_ArmstrongMay 3, 2019, 2:09 PM

22 points

0 comments3 min readLW link

Self-confirming prophecies, and simplified Oracle designs

Stuart_ArmstrongJun 28, 2019, 9:57 AM

6 points

1 comment5 min readLW link

Investment idea: basket of tech stocks weighted towards AI

ioannesAug 12, 2020, 9:30 PM

14 points

7 comments3 min readLW link

Conceptual issues in AI safety: the paradigmatic gap

vedevazzJun 24, 2018, 3:09 PM

33 points

0 comments1 min readLW link

(www.foldl.me)

Disagreement with Paul: alignment induction

Stuart_ArmstrongSep 10, 2018, 1:54 PM

31 points

6 comments1 min readLW link

Largest open collection quotes about AI

teradimichJul 12, 2019, 5:18 PM

35 points

2 comments3 min readLW link

(drive.google.com)

S.E.A.R.L.E’s COBOL room

Stuart_ArmstrongFeb 1, 2013, 8:29 PM

52 points

36 comments2 min readLW link

Introducing Corrigibility (an FAI research subfield)

So8resOct 20, 2014, 9:09 PM

52 points

28 comments3 min readLW link

NES-game playing AI [video link and AI-boxing-related comment]

Dr_ManhattanApr 12, 2013, 1:11 PM

42 points

22 comments1 min readLW link

On unfixably unsafe AGI architectures

Steven ByrnesFeb 19, 2020, 9:16 PM

33 points

8 comments5 min readLW link

To contribute to AI safety, consider doing AI research

VikaJan 16, 2016, 8:42 PM

39 points

39 comments2 min readLW link

Ghosts in the Machine

Eliezer YudkowskyJun 17, 2008, 11:29 PM

54 points

30 comments4 min readLW link

Technical AGI safety research outside AI

Richard_NgoOct 18, 2019, 3:00 PM

43 points

3 comments3 min readLW link

Deciphering China’s AI Dream

Qiaochu_YuanMar 18, 2018, 3:26 AM

12 points

2 comments1 min readLW link

(www.fhi.ox.ac.uk)

Above-Average AI Scientists

Eliezer YudkowskySep 28, 2008, 11:04 AM

57 points

97 comments8 min readLW link

The Nature of Logic

Eliezer YudkowskyNov 15, 2008, 6:20 AM

37 points

12 comments10 min readLW link

Oracle paper

Stuart_ArmstrongDec 13, 2017, 2:59 PM

12 points

7 comments1 min readLW link

AI Alignment Writing Day Roundup #1

Ben PaceAug 30, 2019, 1:26 AM

32 points

12 comments1 min readLW link

Notes on the Safety in Artificial Intelligence conference

UmamiSalamiJul 1, 2016, 12:36 AM

40 points

15 comments13 min readLW link

Reinterpreting “AI and Compute”

habrykaDec 25, 2018, 9:12 PM

30 points

10 comments1 min readLW link

(aiimpacts.org)

AI Safety Prerequisites Course: Revamp and New Lessons

philip_bFeb 3, 2019, 9:04 PM

24 points

5 comments1 min readLW link

An angle of attack on Open Problem #1

BenyaAug 18, 2012, 12:08 PM

47 points

85 comments7 min readLW link

Evaluating the feasibility of SI’s plan

JoshuaFoxJan 10, 2013, 8:17 AM

38 points

188 comments4 min readLW link

Only humans can have human values

PhilGoetzApr 26, 2010, 6:57 PM

51 points

161 comments17 min readLW link

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

DragonGodDec 6, 2017, 6:01 AM

13 points

4 comments1 min readLW link

(arxiv.org)

Cake, or death!

Stuart_ArmstrongOct 25, 2012, 10:33 AM

46 points

13 comments4 min readLW link

Self-regulation of safety in AI research

Gordon Seidoh WorleyFeb 25, 2018, 11:17 PM

12 points

6 comments2 min readLW link

How safe “safe” AI development?

Gordon Seidoh WorleyFeb 28, 2018, 11:21 PM

9 points

1 comment1 min readLW link

Stanford Intro to AI course to be taught for free online

Psy-KoshJul 30, 2011, 4:22 PM

38 points

39 comments1 min readLW link

Bayesian Utility: Representing Preference by Probability Measures

Vladimir_NesovJul 27, 2009, 2:28 PM

45 points

37 comments2 min readLW link

Gains from trade: Slug versus Galaxy—how much would I give up to control you?

Stuart_ArmstrongJul 23, 2013, 7:06 PM

55 points

67 comments7 min readLW link

Defeating Mundane Holocausts With Robots

lsparrishMay 30, 2011, 10:34 PM

34 points

28 comments2 min readLW link

Assuming we’ve solved X, could we do Y...

Stuart_ArmstrongDec 11, 2018, 6:13 PM

31 points

16 comments2 min readLW link

The Stamp Collector

So8resMay 1, 2015, 11:11 PM

45 points

14 comments6 min readLW link

Saving the world in 80 days: Prologue

Logan RiggsMay 9, 2018, 9:16 PM

12 points

16 comments2 min readLW link

Project Proposal: Considerations for trading off capabilities and safety impacts of AI research

David Scott Krueger (formerly: capybaralet)Aug 6, 2019, 10:22 PM

25 points

11 comments2 min readLW link

AI Safety Prerequisites Course: Basic abstract representations of computation

RAISEMar 13, 2019, 7:38 PM

28 points

2 comments1 min readLW link

What I Think, If Not Why

Eliezer YudkowskyDec 11, 2008, 5:41 PM

41 points

103 comments4 min readLW link

RFC: Philosophical Conservatism in AI Alignment Research

Gordon Seidoh WorleyMay 15, 2018, 3:29 AM

17 points

13 comments1 min readLW link

Predicted AI alignment event/meeting calendar

rmoehnAug 14, 2019, 7:14 AM

29 points

14 comments1 min readLW link

Simplified preferences needed; simplified preferences sufficient

Stuart_ArmstrongMar 5, 2019, 7:39 PM

29 points

6 comments3 min readLW link

Reward function learning: the value function

Stuart_ArmstrongApr 24, 2018, 4:29 PM

9 points

0 comments11 min readLW link

Reward function learning: the learning process

Stuart_ArmstrongApr 24, 2018, 12:56 PM

6 points

11 comments8 min readLW link

Utility versus Reward function: partial equivalence

Stuart_ArmstrongApr 13, 2018, 2:58 PM

17 points

5 comments5 min readLW link

Full toy model for preference learning

Stuart_ArmstrongOct 16, 2019, 11:06 AM

20 points

2 comments12 min readLW link

New(ish) AI control ideas

Stuart_ArmstrongOct 31, 2017, 12:52 PM

0 points

0 comments4 min readLW link

Rigging is a form of wireheading

Stuart_ArmstrongMay 3, 2018, 12:50 PM

11 points

2 comments1 min readLW link

The reward engineering problem

paulfchristianoJan 16, 2019, 6:47 PM

26 points

3 comments7 min readLW link

AI cooperation in practice

cousin_itJul 30, 2010, 4:21 PM

37 points

166 comments1 min readLW link

Examples of AI’s behaving badly

Stuart_ArmstrongJul 16, 2015, 10:01 AM

41 points

37 comments1 min readLW link

Controlling Constant Programs

Vladimir_NesovSep 5, 2010, 1:45 PM

34 points

33 comments5 min readLW link

Autism, Watson, the Turing test, and General Intelligence

Stuart_ArmstrongSep 24, 2013, 11:00 AM

11 points

22 comments1 min readLW link

Pessimism About Unknown Unknowns Inspires Conservatism

michaelcohenFeb 3, 2020, 2:48 PM

31 points

2 comments5 min readLW link

The National Security Commission on Artificial Intelligence Wants You (to submit essays and articles on the future of government AI policy)

quanticleJul 18, 2019, 5:21 PM

30 points

0 comments1 min readLW link

(warontherocks.com)

Systems Engineering and the META Program

ryan_bDec 20, 2018, 8:19 PM

30 points

3 comments1 min readLW link

Human errors, human values

PhilGoetzApr 9, 2011, 2:50 AM

45 points

138 comments1 min readLW link

ISO: Name of Problem

johnswentworthJul 24, 2018, 5:15 PM

28 points

15 comments1 min readLW link

Muehlhauser-Goertzel Dialogue, Part 1

lukeprogMar 16, 2012, 5:12 PM

42 points

161 comments33 min readLW link

Specification gaming examples in AI

Samuel RødalNov 10, 2018, 12:00 PM

24 points

6 comments1 min readLW link

(docs.google.com)

Superintelligence Reading Group—Section 1: Past Developments and Present Capabilities

KatjaGraceSep 16, 2014, 1:00 AM

43 points

233 comments7 min readLW link

[Question] What are the differences between all the iterative/recursive approaches to AI alignment?

riceissaSep 21, 2019, 2:09 AM

30 points

14 comments2 min readLW link

Algorithmic Similarity

LukasMAug 23, 2019, 4:39 PM

27 points

10 comments11 min readLW link

Directions and desiderata for AI alignment

paulfchristianoJan 13, 2019, 7:47 AM

47 points

1 comment14 min readLW link

Friendly AI Research and Taskification

multifoliateroseDec 14, 2010, 6:30 AM

30 points

47 comments5 min readLW link

Against easy superintelligence: the unforeseen friction argument

Stuart_ArmstrongJul 10, 2013, 1:47 PM

39 points

48 comments5 min readLW link

[Question] Why are the people who could be doing safety research, but aren’t, doing something else?

Adam SchollAug 29, 2019, 8:51 AM

27 points

19 comments1 min readLW link

TV’s “Elementary” Tackles Friendly AI and X-Risk—“Bella” (Possible Spoilers)

pjebyNov 22, 2014, 7:51 PM

48 points

18 comments2 min readLW link

Universality Unwrapped

adamShimiAug 21, 2020, 6:53 PM

28 points

2 comments18 min readLW link

AI Risk and Opportunity: Humanity’s Efforts So Far

lukeprogMar 21, 2012, 2:49 AM

53 points

49 comments23 min readLW link

Learning with catastrophes

paulfchristianoJan 23, 2019, 3:01 AM

27 points

9 comments4 min readLW link

[Question] Degree of duplication and coordination in projects that examine computing prices, AI progress, and related topics?

riceissaApr 23, 2019, 12:27 PM

26 points

1 comment2 min readLW link

Solving the AI Race Finalists

Gordon Seidoh WorleyJul 19, 2018, 9:04 PM

24 points

0 comments1 min readLW link

(medium.com)

An Agent is a Worldline in Tegmark V

komponistoJul 12, 2018, 5:12 AM

24 points

12 comments2 min readLW link

Towards formalizing universality

paulfchristianoJan 13, 2019, 8:39 PM

27 points

19 comments18 min readLW link

Conceptual Analysis for AI Alignment

David Scott Krueger (formerly: capybaralet)Dec 30, 2018, 12:46 AM

26 points

3 comments2 min readLW link

Gwern’s “Why Tool AIs Want to Be Agent AIs: The Power of Agency”

habrykaMay 5, 2019, 5:11 AM

26 points

3 comments1 min readLW link

(www.gwern.net)

[Question] Why not tool AI?

smitheeJan 19, 2019, 10:18 PM

19 points

10 comments1 min readLW link

Superintelligence 16: Tool AIs

KatjaGraceDec 30, 2014, 2:00 AM

12 points

37 comments7 min readLW link

Thinking of tool AIs

Michele CampoloNov 20, 2019, 9:47 PM

6 points

2 comments4 min readLW link

Reply to Holden on ‘Tool AI’

Eliezer YudkowskyJun 12, 2012, 6:00 PM

152 points

357 comments17 min readLW link

Reply to Holden on The Singularity Institute

lukeprogJul 10, 2012, 11:20 PM

69 points

215 comments26 min readLW link

Levels of AI Self-Improvement

avturchinApr 29, 2018, 11:45 AM

11 points

0 comments39 min readLW link

AI: requirements for pernicious policies

Stuart_ArmstrongJul 17, 2015, 2:18 PM

11 points

3 comments3 min readLW link

Tools want to become agents

Stuart_ArmstrongJul 4, 2014, 10:12 AM

24 points

81 comments1 min readLW link

Superintelligence reading group

KatjaGraceAug 31, 2014, 2:59 PM

31 points

2 comments2 min readLW link

Superintelligence Reading Group 2: Forecasting AI

KatjaGraceSep 23, 2014, 1:00 AM

17 points

109 comments11 min readLW link

Superintelligence Reading Group 3: AI and Uploads

KatjaGraceSep 30, 2014, 1:00 AM

17 points

139 comments6 min readLW link

SRG 4: Biological Cognition, BCIs, Organizations

KatjaGraceOct 7, 2014, 1:00 AM

14 points

139 comments5 min readLW link

Superintelligence 5: Forms of Superintelligence

KatjaGraceOct 14, 2014, 1:00 AM

22 points

114 comments5 min readLW link

Superintelligence 6: Intelligence explosion kinetics

KatjaGraceOct 21, 2014, 1:00 AM

15 points

68 comments8 min readLW link

Superintelligence 7: Decisive strategic advantage

KatjaGraceOct 28, 2014, 1:01 AM

18 points

60 comments6 min readLW link

Superintelligence 8: Cognitive superpowers

KatjaGraceNov 4, 2014, 2:01 AM

14 points

96 comments6 min readLW link

Superintelligence 9: The orthogonality of intelligence and goals

KatjaGraceNov 11, 2014, 2:00 AM

13 points

80 comments7 min readLW link

Superintelligence 10: Instrumentally convergent goals

KatjaGraceNov 18, 2014, 2:00 AM

13 points

33 comments5 min readLW link

Superintelligence 11: The treacherous turn

KatjaGraceNov 25, 2014, 2:00 AM

16 points

50 comments6 min readLW link

Superintelligence 12: Malignant failure modes

KatjaGraceDec 2, 2014, 2:02 AM

15 points

51 comments5 min readLW link

Superintelligence 13: Capability control methods

KatjaGraceDec 9, 2014, 2:00 AM

14 points

48 comments6 min readLW link

Superintelligence 14: Motivation selection methods

KatjaGraceDec 16, 2014, 2:00 AM

9 points

28 comments5 min readLW link

Superintelligence 15: Oracles, genies and sovereigns

KatjaGraceDec 23, 2014, 2:01 AM

11 points

30 comments7 min readLW link

Superintelligence 17: Multipolar scenarios

KatjaGraceJan 6, 2015, 6:44 AM

9 points

38 comments6 min readLW link

Superintelligence 18: Life in an algorithmic economy

KatjaGraceJan 13, 2015, 2:00 AM

10 points

52 comments6 min readLW link

Superintelligence 19: Post-transition formation of a singleton

KatjaGraceJan 20, 2015, 2:00 AM

12 points

35 comments7 min readLW link

Superintelligence 20: The value-loading problem

KatjaGraceJan 27, 2015, 2:00 AM

8 points

21 comments6 min readLW link

Superintelligence 21: Value learning

KatjaGraceFeb 3, 2015, 2:01 AM

12 points

33 comments4 min readLW link

Superintelligence 22: Emulation modulation and institutional design

KatjaGraceFeb 10, 2015, 2:06 AM

13 points

11 comments6 min readLW link

Superintelligence 23: Coherent extrapolated volition

KatjaGraceFeb 17, 2015, 2:00 AM

11 points

97 comments7 min readLW link

Superintelligence 24: Morality models and “do what I mean”

KatjaGraceFeb 24, 2015, 2:00 AM

13 points

47 comments6 min readLW link

Objections to Coherent Extrapolated Volition

XiXiDuNov 22, 2011, 10:32 AM

12 points

56 comments3 min readLW link

CEV: coherence versus extrapolation

Stuart_ArmstrongSep 22, 2014, 11:24 AM

21 points

17 comments2 min readLW link

What if AI doesn’t quite go FOOM?

Mass_DriverJun 20, 2010, 12:03 AM

16 points

191 comments5 min readLW link

Superintelligence 25: Components list for acquiring values

KatjaGraceMar 3, 2015, 2:01 AM

11 points

12 comments8 min readLW link

Superintelligence 26: Science and technology strategy

KatjaGraceMar 10, 2015, 1:43 AM

14 points

21 comments6 min readLW link

Superintelligence 27: Pathways and enablers

KatjaGraceMar 17, 2015, 1:00 AM

15 points

21 comments8 min readLW link

Superintelligence 28: Collaboration

KatjaGraceMar 24, 2015, 1:29 AM

13 points

21 comments6 min readLW link

Superintelligence 29: Crunch time

KatjaGraceMar 31, 2015, 4:24 AM

14 points

27 comments6 min readLW link

Universal agents and utility functions

AnjaNov 14, 2012, 4:05 AM

43 points

38 comments6 min readLW link

Looking for remote writing partners (for AI alignment research)

rmoehnOct 1, 2019, 2:16 AM

23 points

4 comments2 min readLW link

Self-Supervised Learning and AGI Safety

Steven ByrnesAug 7, 2019, 2:21 PM

29 points

9 comments12 min readLW link

Which of these five AI alignment research projects ideas are no good?

rmoehnAug 8, 2019, 7:17 AM

25 points

13 comments1 min readLW link

Understanding understanding

mthqAug 23, 2019, 6:10 PM

24 points

1 comment2 min readLW link

Evaluating Existing Approaches to AGI Alignment

Gordon Seidoh WorleyMar 27, 2018, 7:57 PM

12 points

0 comments4 min readLW link

(mapandterritory.org)

CEV: a utilitarian critique

PabloJan 26, 2013, 4:12 PM

32 points

94 comments5 min readLW link

Vingean Reflection: Reliable Reasoning for Self-Improving Agents

So8resJan 15, 2015, 10:47 PM

37 points

5 comments9 min readLW link

Slide deck: Introduction to AI Safety

Aryeh EnglanderJan 29, 2020, 3:57 PM

22 points

0 comments1 min readLW link

(drive.google.com)

The Self-Unaware AI Oracle

Steven ByrnesJul 22, 2019, 7:04 PM

21 points

38 comments8 min readLW link

May Gwern.net newsletter (w/GPT-3 commentary)

gwernJun 2, 2020, 3:40 PM

32 points

7 comments1 min readLW link

(www.gwern.net)

Build a Causal Decision Theorist

michaelcohenSep 23, 2019, 8:43 PM

1 point

14 comments4 min readLW link

A trick for Safer GPT-N

RaziedAug 23, 2020, 12:39 AM

7 points

1 comment2 min readLW link

Introduction To The Infra-Bayesianism Sequence

Diffractor and Vanessa Kosoy

Aug 26, 2020, 8:31 PM

104 points

64 comments14 min readLW link 2 reviews

Model splintering: moving from one imperfect model to another

Stuart_ArmstrongAug 27, 2020, 11:53 AM

74 points

10 comments33 min readLW link

Algorithmic Progress in Six Domains

lukeprogAug 3, 2013, 2:29 AM

38 points

32 comments1 min readLW link

[Question] What are some good examples of incorrigibility?

RyanCareyApr 28, 2019, 12:22 AM

23 points

17 comments1 min readLW link

Safely and usefully spectating on AIs optimizing over toy worlds

AlexMennenJul 31, 2018, 6:30 PM

24 points

16 comments2 min readLW link

Updates and additions to “Embedded Agency”

Rob Bensinger and abramdemski

Aug 29, 2020, 4:22 AM

73 points

1 comment3 min readLW link

[LINK] Terrorists target AI researchers

RobertLumleySep 15, 2011, 2:22 PM

32 points

35 comments1 min readLW link

Analysing: Dangerous messages from future UFAI via Oracles

Stuart_ArmstrongNov 22, 2019, 2:17 PM

22 points

16 comments4 min readLW link

Exploring Botworld

So8resApr 30, 2014, 10:29 PM

34 points

2 comments6 min readLW link

interpreting GPT: the logit lens

nostalgebraistAug 31, 2020, 2:47 AM

158 points

32 comments11 min readLW link

From GPT to AGI

ChristianKlAug 31, 2020, 1:28 PM

6 points

7 comments1 min readLW link

Logical or Connectionist AI?

Eliezer YudkowskyNov 17, 2008, 8:03 AM

39 points

26 comments9 min readLW link

Artificial Intelligence and Life Sciences (Why Big Data is not enough to capture biological systems?)

HansNaujJan 15, 2020, 1:59 AM

6 points

3 comments6 min readLW link

The Case against Killer Robots (link)

D_AlexNov 20, 2012, 7:47 AM

12 points

25 comments1 min readLW link

Near-Term Risk: Killer Robots a Threat to Freedom and Democracy

EpiphanyJun 14, 2013, 6:28 AM

15 points

105 comments2 min readLW link

Muehlhauser-Wang Dialogue

lukeprogApr 22, 2012, 10:40 PM

34 points

288 comments12 min readLW link

Google may be trying to take over the world

[deleted]Jan 27, 2014, 9:33 AM

33 points

133 comments1 min readLW link

Gwern about centaurs: there is no chance that any useful man+machine combination will work together for more than 10 years, as humans soon will be only a liability

avturchinDec 15, 2018, 9:32 PM

31 points

4 comments1 min readLW link

(www.reddit.com)

Q&A with Abram Demski on risks from AI

XiXiDuJan 17, 2012, 9:43 AM

33 points

71 comments9 min readLW link

Q&A with experts on risks from AI #2

XiXiDuJan 9, 2012, 7:40 PM

22 points

29 comments7 min readLW link

Let the AI teach you how to flirt

DirectedEvolutionSep 17, 2020, 7:04 PM

47 points

11 comments2 min readLW link

Online AI Safety Discussion Day

Linda LinseforsOct 8, 2020, 12:11 PM

5 points

0 comments1 min readLW link

New(ish) AI control ideas

Stuart_ArmstrongMar 5, 2015, 5:03 PM

34 points

14 comments3 min readLW link

Not Taking Over the World

Eliezer YudkowskyDec 15, 2008, 10:18 PM

35 points

97 comments4 min readLW link

Naturalistic trust among AIs: The parable of the thesis advisor’s theorem

BenyaDec 15, 2013, 8:32 AM

36 points

20 comments6 min readLW link

The Solomonoff Prior is Malign

Mark XuOct 14, 2020, 1:33 AM

148 points

52 comments16 min readLW link 3 reviews

Twenty-three AI alignment research project definitions

rmoehnFeb 3, 2020, 10:21 PM

23 points

0 comments6 min readLW link

When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors

Stuart_ArmstrongDec 19, 2019, 1:55 PM

24 points

18 comments7 min readLW link

[Question] As a Washed Up Former Data Scientist and Machine Learning Researcher What Direction Should I Go In Now?

DarklightOct 19, 2020, 8:13 PM

13 points

7 comments3 min readLW link

Artificial Mysterious Intelligence

Eliezer YudkowskyDec 7, 2008, 8:05 PM

29 points

24 comments5 min readLW link

A Premature Word on AI

Eliezer YudkowskyMay 31, 2008, 5:48 PM

26 points

69 comments8 min readLW link

Let’s reimplement EURISKO!

cousin_itJun 11, 2009, 4:28 PM

23 points

162 comments1 min readLW link

Corrigibility thoughts III: manipulating versus deceiving

Stuart_ArmstrongJan 18, 2017, 3:57 PM

3 points

0 comments1 min readLW link

[Question] [Meta] Do you want AIS Webinars?

Linda LinseforsMar 21, 2020, 4:01 PM

18 points

7 comments1 min readLW link

New article from Oren Etzioni

Aryeh EnglanderFeb 25, 2020, 3:25 PM

19 points

19 comments2 min readLW link

Singletons Rule OK

Eliezer YudkowskyNov 30, 2008, 4:45 PM

20 points

47 comments5 min readLW link

“On the Impossibility of Supersized Machines”

crmflynnMar 31, 2017, 11:32 PM

24 points

4 comments1 min readLW link

(philpapers.org)

Nonsentient Optimizers

Eliezer YudkowskyDec 27, 2008, 2:32 AM

34 points

48 comments6 min readLW link

Building Something Smarter

Eliezer YudkowskyNov 2, 2008, 5:00 PM

22 points

57 comments4 min readLW link

Let’s Read: an essay on AI Theology

Yuxi_LiuJul 4, 2019, 7:50 AM

22 points

9 comments7 min readLW link

Wanted: Python open source volunteers

Eliezer YudkowskyMar 11, 2009, 4:59 AM

16 points

13 comments1 min readLW link

Equilibrium and prior selection problems in multipolar deployment

JesseCliftonApr 2, 2020, 8:06 PM

20 points

11 comments11 min readLW link

[Question] The Simulation Epiphany Problem

Koen.HoltmanOct 31, 2019, 10:12 PM

15 points

13 comments4 min readLW link

Changing accepted public opinion and Skynet

RokoMay 22, 2009, 11:05 AM

17 points

71 comments2 min readLW link

Introducing CADIE

MBlumeApr 1, 2009, 7:32 AM

0 points

8 comments1 min readLW link

Deepmind Plans for Rat-Level AI

moridinamaelAug 18, 2016, 4:26 PM

34 points

9 comments1 min readLW link

“Robot scientists can think for themselves”

CronoDASApr 2, 2009, 9:16 PM

−1 points

11 comments1 min readLW link

Automating reasoning about the future at Ought

jungofthewonNov 9, 2020, 9:51 PM

17 points

0 comments1 min readLW link

(ought.org)

Neural program synthesis is a dangerous technology

syllogismJan 12, 2018, 4:19 PM

10 points

6 comments2 min readLW link

New, Brief Popular-Level Introduction to AI Risks and Superintelligence

LyleNJan 23, 2015, 3:43 PM

33 points

3 comments1 min readLW link

In the beginning, Dartmouth created the AI and the hype

Stuart_ArmstrongJan 24, 2013, 4:49 PM

33 points

22 comments1 min readLW link

Fundamental Philosophical Problems Inherent in AI discourse

AlexSadlerSep 16, 2018, 9:03 PM

23 points

1 comment17 min readLW link

Research Priorities for Artificial Intelligence: An Open Letter

jimrandomhJan 11, 2015, 7:52 PM

38 points

11 comments1 min readLW link

[Question] How can I help research Friendly AI?

avichapmanJul 9, 2019, 12:15 AM

22 points

3 comments1 min readLW link

FAI Research Constraints and AGI Side Effects

JustinShovelainJun 3, 2015, 7:25 PM

26 points

59 comments7 min readLW link

[Question] How to deal with a misleading conference talk about AI risk?

rmoehnJun 27, 2019, 9:04 PM

21 points

13 comments4 min readLW link

Implications of Quantum Computing for Artificial Intelligence Alignment Research

Jsevillamol and PabloAMC

Aug 22, 2019, 10:33 AM

24 points

3 comments13 min readLW link

[Question] How can labour productivity growth be an indicator of automation?

PolytoposNov 16, 2020, 9:16 PM

2 points

5 comments1 min readLW link

[Question] Should I do it?

MrLightNov 19, 2020, 1:08 AM

−3 points

16 comments2 min readLW link

My intellectual influences

Richard_NgoNov 22, 2020, 6:00 PM

92 points

1 comment5 min readLW link

(thinkingcomplete.blogspot.com)

Delegated agents in practice: How companies might end up selling AI services that act on behalf of consumers and coalitions, and what this implies for safety research

RemmeltNov 26, 2020, 11:17 AM

7 points

5 comments4 min readLW link

SETI Predictions

hippkeNov 30, 2020, 8:09 PM

23 points

8 comments1 min readLW link

What happens when your beliefs fully propagate

AlexeiFeb 14, 2012, 7:53 AM

29 points

79 comments7 min readLW link

Interactive exploration of LessWrong and other large collections of documents

vpetukhov and FriendlyOwl

Dec 20, 2020, 7:06 PM

49 points

9 comments10 min readLW link

[Question] Will AGI have “human” flaws?

Agustinus TheodorusDec 23, 2020, 3:43 AM

1 point

2 comments1 min readLW link

Optimum number of single points of failure

Douglas_ReayMar 14, 2018, 1:30 PM

7 points

4 comments4 min readLW link

Don’t put all your eggs in one basket

Douglas_ReayMar 15, 2018, 8:07 AM

5 points

0 comments7 min readLW link

Defect or Cooperate

Douglas_ReayMar 16, 2018, 2:12 PM

4 points

5 comments6 min readLW link

Environments for killing AIs

Douglas_ReayMar 17, 2018, 3:23 PM

3 points

1 comment9 min readLW link

The advantage of not being open-ended

Douglas_ReayMar 18, 2018, 1:50 PM

7 points

2 comments6 min readLW link

Metamorphosis

Douglas_ReayApr 12, 2018, 9:53 PM

2 points

0 comments4 min readLW link

Believable Promises

Douglas_ReayApr 16, 2018, 4:17 PM

5 points

0 comments5 min readLW link

Trustworthy Computing

Douglas_ReayApr 10, 2018, 7:55 AM

9 points

1 comment6 min readLW link

Edge of the Cliff

akaTricksterJan 5, 2021, 5:21 PM

1 point

0 comments5 min readLW link

[Question] How is reinforcement learning possible in non-sentient agents?

SomeoneKindJan 5, 2021, 8:57 PM

3 points

5 comments1 min readLW link

AI Alignment Using Reverse Simulation

Sven NilsenJan 12, 2021, 8:48 PM

1 point

0 comments1 min readLW link

A toy model of the control problem

Stuart_ArmstrongSep 16, 2015, 2:59 PM

36 points

24 comments3 min readLW link

On the nature of purpose

Nora_AmmannJan 22, 2021, 8:30 AM

28 points

15 comments9 min readLW link

Learning Normativity: Language

BunthutFeb 5, 2021, 10:26 PM

14 points

4 comments8 min readLW link

Singularity&phase transition-2. A priori probability and ways to check.

Valentin2026Feb 8, 2021, 2:21 AM

1 point

0 comments3 min readLW link

Nonperson Predicates

Eliezer YudkowskyDec 27, 2008, 1:47 AM

52 points

176 comments6 min readLW link

Mapping the Conceptual Territory in AI Existential Safety and Alignment

jbkjrFeb 12, 2021, 7:55 AM

15 points

0 comments26 min readLW link

2021-03-01 National Library of Medicine Presentation: “Atlas of AI: Mapping the social and economic forces behind AI”

IrenicTruthFeb 17, 2021, 6:23 PM

1 point

0 comments2 min readLW link

Chaotic era: avoid or survive?

Valentin2026Feb 22, 2021, 1:34 AM

3 points

3 comments2 min readLW link

Suffering-Focused Ethics in the Infinite Universe. How can we redeem ourselves if Multiverse Immortality is real and subjective death is impossible.

Szymon KucharskiFeb 24, 2021, 9:02 PM

−3 points

4 comments70 min readLW link

AIDungeon 3.1

Yair HalberstadtMar 1, 2021, 5:56 AM

2 points

0 comments2 min readLW link

Physicalism implies experience never dies. So what am I going to experience after it does?

Szymon KucharskiMar 14, 2021, 2:45 PM

−2 points

1 comment30 min readLW link

An Antropic Argument for Post-singularity Antinatalism

monkaapMar 16, 2021, 5:40 PM

3 points

4 comments3 min readLW link

[Question] Is a Self-Iterating AGI Vulnerable to Thompson-style Trojans?

sxaeMar 25, 2021, 2:46 PM

15 points

7 comments3 min readLW link

AI oracles on blockchain

CaravaggioApr 6, 2021, 8:13 PM

5 points

0 comments3 min readLW link

What if AGI is near?

Wulky WilkinsenApr 14, 2021, 12:05 AM

11 points

5 comments1 min readLW link

Review of “Why AI is Harder Than We Think”

electroswingApr 30, 2021, 6:14 PM

40 points

10 comments8 min readLW link

Thoughts on the Alignment Implications of Scaling Language Models

leogaoJun 2, 2021, 9:32 PM

79 points

11 comments17 min readLW link

[Question] Suppose $1 billion is given to AI Safety. How should it be spent?

hunterglennMay 15, 2021, 11:24 PM

23 points

2 comments1 min readLW link

Controlling Intelligent Agents The Only Way We Know How: Ideal Bureaucratic Structure (IBS)

Justin BullockMay 24, 2021, 12:53 PM

11 points

11 comments6 min readLW link

Curated conversations with brilliant rationalists

spencergMay 28, 2021, 2:23 PM

153 points

18 comments6 min readLW link

Security Mindset and Ordinary Paranoia

Eliezer YudkowskyNov 25, 2017, 5:53 PM

98 points

24 comments29 min readLW link

The Anti-Carter Basilisk

Jon GilbertMay 26, 2021, 10:56 PM

0 points

0 comments2 min readLW link

Parameter counts in Machine Learning

Jsevillamol and Pablo Villalobos

Jun 19, 2021, 4:04 PM

47 points

16 comments7 min readLW link

Irrational Modesty

Tomás B.Jun 20, 2021, 7:38 PM

132 points

7 comments1 min readLW link

[Question] Thoughts on a “Sequences Inspired” PhD Topic

goose000Jun 17, 2021, 8:36 PM

7 points

2 comments2 min readLW link

Some alternatives to “Friendly AI”

lukeprogJun 15, 2014, 7:53 PM

30 points

44 comments2 min readLW link

Intelligence without Consciousness

Andrew VlahosJul 7, 2021, 5:27 AM

13 points

5 comments1 min readLW link

[Question] What would it look like if it looked like AGI was very near?

Tomás B.Jul 12, 2021, 3:22 PM

52 points

25 comments1 min readLW link

Is the argument that AI is an xrisk valid?

MACannonJul 19, 2021, 1:20 PM

5 points

62 comments1 min readLW link

(onlinelibrary.wiley.com)

[Question] Jaynesian interpretation—How does “estimating probabilities” make sense?

Haziq MuhammadJul 21, 2021, 9:36 PM

4 points

40 comments1 min readLW link

The biological intelligence explosion

Rob LucasJul 25, 2021, 1:08 PM

8 points

6 comments4 min readLW link

[Question] Do Bayesians like Bayesian model Averaging?

Haziq MuhammadAug 2, 2021, 12:24 PM

4 points

13 comments1 min readLW link

[Question] Question about Test-sets and Bayesian machine learning

Haziq MuhammadAug 9, 2021, 5:16 PM

2 points

8 comments1 min readLW link

[Question] Halpern’s paper—A refutation of Cox’s theorem?

Haziq MuhammadAug 11, 2021, 9:25 AM

11 points

7 comments1 min readLW link

New GPT-3 competitor

Quintin PopeAug 12, 2021, 7:05 AM

32 points

10 comments1 min readLW link

[Question] Jaynes-Cox Probability: Are plausibilities objective?

Haziq MuhammadAug 12, 2021, 2:23 PM

9 points

17 comments1 min readLW link

A gentle apocalypse

pchvykovAug 16, 2021, 5:03 AM

3 points

5 comments3 min readLW link

[Question] Is it worth making a database for moral predictions?

Jonas HallgrenAug 16, 2021, 2:51 PM

1 point

0 comments2 min readLW link

Cynical explanations of FAI critics (including myself)

Wei DaiAug 13, 2012, 9:19 PM

31 points

49 comments1 min readLW link

[Question] Has Van Horn fixed Cox’s theorem?

Haziq MuhammadAug 29, 2021, 6:36 PM

9 points

1 comment1 min readLW link

The Governance Problem and the “Pretty Good” X-Risk

Zach Stein-PerlmanAug 29, 2021, 6:00 PM

5 points

2 comments11 min readLW link

Limits of and to (artificial) Intelligence

MoritzGAug 25, 2019, 10:16 PM

1 point

3 comments7 min readLW link

Grokking the Intentional Stance

jbkjrAug 31, 2021, 3:49 PM

41 points

20 comments20 min readLW link

Intelligence, Fast and Slow

Mateusz MazurkiewiczSep 1, 2021, 7:52 PM

−3 points

2 comments2 min readLW link

[Question] Is LessWrong dead without Cox’s theorem?

Haziq MuhammadSep 4, 2021, 5:45 AM

−2 points

88 comments1 min readLW link

Alignment via manually implementing the utility function

ChantielSep 7, 2021, 8:20 PM

1 point

6 comments2 min readLW link

Pivot!

Carlos RamirezSep 12, 2021, 8:39 PM

−19 points

5 comments1 min readLW link

The Metaethics and Normative Ethics of AGI Value Alignment: Many Questions, Some Implications

Eleos Arete CitriniSep 16, 2021, 4:13 PM

6 points

0 comments8 min readLW link

Why will AI be dangerous?

LegionnaireFeb 4, 2022, 11:41 PM

37 points

14 comments1 min readLW link

Occam’s Razor and the Universal Prior

Peter ChatainOct 3, 2021, 3:23 AM

22 points

5 comments21 min readLW link

We’re Redwood Research, we do applied alignment research, AMA

Nate ThomasOct 6, 2021, 5:51 AM

56 points

3 comments2 min readLW link

(forum.effectivealtruism.org)

[LINK] Wait But Why—The AI Revolution Part 2

Adam ZernerFeb 4, 2015, 4:02 PM

27 points

88 comments1 min readLW link

Slate Star Codex Notes on the Asilomar Conference on Beneficial AI

Gunnar_ZarnckeFeb 7, 2017, 12:14 PM

24 points

8 comments1 min readLW link

(slatestarcodex.com)

Three Approaches to “Friendliness”

Wei DaiJul 17, 2013, 7:46 AM

32 points

86 comments3 min readLW link

P₂B: Plan to P₂B Better

Ramana Kumar and Daniel Kokotajlo

Oct 24, 2021, 3:21 PM

33 points

14 comments6 min readLW link

A Roadmap to a Post-Scarcity Economy

lorepieriOct 30, 2021, 9:04 AM

3 points

3 comments1 min readLW link

What is the link between altruism and intelligence?

Ruralvisitor83Nov 3, 2021, 11:59 PM

3 points

13 comments1 min readLW link

Modeling the impact of safety agendas

Ben CottierNov 5, 2021, 7:46 PM

51 points

6 comments10 min readLW link

[Question] Does anyone know what Marvin Minsky is talking about here?

delton137Nov 19, 2021, 12:56 AM

1 point

6 comments3 min readLW link

Integrating Three Models of (Human) Cognition

jbkjrNov 23, 2021, 1:06 AM

29 points

4 comments32 min readLW link

[Question] I currently translate AGI-related texts to Russian. Is that useful?

TapataktNov 27, 2021, 5:51 PM

29 points

7 comments1 min readLW link

Question/Issue with the 5/10 Problem

acgtNov 29, 2021, 10:45 AM

6 points

3 comments3 min readLW link

Can solipsism be disproven?

nx2059Dec 4, 2021, 8:24 AM

−2 points

5 comments2 min readLW link

[Question] Misc. questions about EfficientZero

Daniel KokotajloDec 4, 2021, 7:45 PM

51 points

17 comments1 min readLW link

Framing approaches to alignment and the hard problem of AI cognition

ryan_greenblattDec 15, 2021, 7:06 PM

8 points

15 comments27 min readLW link

HIRING: Inform and shape a new project on AI safety at Partnership on AI

madhu_likaDec 7, 2021, 7:37 PM

1 point

0 comments1 min readLW link

What role should evolutionary analogies play in understanding AI takeoff speeds?

anson.hoDec 11, 2021, 1:19 AM

14 points

0 comments42 min readLW link

Motivations, Natural Selection, and Curriculum Engineering

Oliver SourbutDec 16, 2021, 1:07 AM

16 points

0 comments42 min readLW link

Emergent modularity and safety

Richard_NgoOct 21, 2021, 1:54 AM

31 points

15 comments3 min readLW link

Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI

bayesian_kittenDec 16, 2021, 10:41 PM

22 points

10 comments21 min readLW link

Universality and the “Filter”

maggiehayesDec 16, 2021, 12:47 AM

10 points

3 comments11 min readLW link

[Question] Can you prove that 0 = 1?

purplelightFeb 4, 2022, 9:31 PM

−10 points

4 comments1 min readLW link

Expectations Influence Reality (and AI)

purplelightFeb 4, 2022, 9:31 PM

0 points

3 comments7 min readLW link

[Question] What questions do you have about doing work on AI safety?

peterbarnettDec 21, 2021, 4:36 PM

13 points

8 comments1 min readLW link

Reviews of “Is power-seeking AI an existential risk?”

Joe CarlsmithDec 16, 2021, 8:48 PM

76 points

20 comments1 min readLW link

Eliciting Latent Knowledge Via Hypothetical Sensors

John_MaxwellDec 30, 2021, 3:53 PM

38 points

2 comments6 min readLW link

Lateral Thinking (AI safety HPMOR fanfic)

SlytherinsMonsterJan 2, 2022, 11:50 PM

75 points

9 comments5 min readLW link

SONN : What’s Next ?

D𝜋Jan 9, 2022, 8:15 AM

−17 points

3 comments1 min readLW link

An Open Philanthropy grant proposal: Causal representation learning of human preferences

PabloAMCJan 11, 2022, 11:28 AM

19 points

6 comments8 min readLW link

Action: Help expand funding for AI Safety by coordinating on NSF response

Evan R. MurphyJan 19, 2022, 10:47 PM

23 points

8 comments3 min readLW link

Emotions = Reward Functions

jpyykkoJan 20, 2022, 6:46 PM

16 points

10 comments5 min readLW link

[Question] Is AI Alignment a pseudoscience?

mocny-chlapikJan 23, 2022, 10:32 AM

21 points

41 comments1 min readLW link

Deconfusing Deception

J BostockJan 29, 2022, 4:43 PM

26 points

6 comments2 min readLW link

Revisiting Brave New World Revisited (Chapter 3)

Justin BullockFeb 1, 2022, 5:17 PM

5 points

0 comments10 min readLW link

[Question] Do mesa-optimization problems correlate with low-slack?

sudoFeb 4, 2022, 9:11 PM

1 point

1 comment1 min readLW link

Can the laws of physics/nature prevent hell?

superads91Feb 6, 2022, 8:39 PM

−7 points

10 comments2 min readLW link

Ngo and Yudkowsky on scientific reasoning and pivotal acts

Eliezer Yudkowsky and Richard_Ngo

Feb 21, 2022, 8:54 PM

51 points

13 comments35 min readLW link

Better a Brave New World than a dead one

YitzFeb 25, 2022, 11:11 PM

8 points

5 comments4 min readLW link

Being an individual alignment grantmaker

A_donorFeb 28, 2022, 8:02 PM

64 points

5 comments2 min readLW link

How to develop safe superintelligence

martillopartMar 1, 2022, 9:57 PM

−5 points

3 comments13 min readLW link

Deep Dives: My Advice for Pursuing Work in Research

scasperMar 11, 2022, 5:56 PM

21 points

2 comments3 min readLW link

One possible approach to develop the best possible general learning algorithm

martillopartMar 14, 2022, 7:24 PM

3 points

0 comments7 min readLW link

[Question] Our time in history as evidence for simulation theory?

Garrett GarzonieMar 18, 2022, 3:35 AM

3 points

2 comments1 min readLW link

The weakest arguments for and against human level AI

Stuart_ArmstrongAug 15, 2012, 11:04 AM

22 points

34 comments1 min readLW link

Christiano and Yudkowsky on AI predictions and human intelligence

Eliezer YudkowskyFeb 23, 2022, 9:34 PM

69 points

35 comments42 min readLW link

Even more curated conversations with brilliant rationalists

spencergMar 21, 2022, 11:49 PM

57 points

0 comments15 min readLW link

Manhattan project for aligned AI

Chris van MerwijkMar 27, 2022, 11:41 AM

34 points

6 comments2 min readLW link

Gears-Level Mental Models of Transformer Interpretability

RowanWangMar 29, 2022, 8:09 PM

56 points

4 comments6 min readLW link

Meta wants to use AI to write Wikipedia articles; I am Nervous™

YitzMar 30, 2022, 7:05 PM

14 points

12 comments1 min readLW link

[Question] If AGI were coming in a year, what should we do?

MichaelStJulesApr 1, 2022, 12:41 AM

20 points

16 comments1 min readLW link

On Agent Incentives to Manipulate Human Feedback in Multi-Agent Reward Learning Scenarios

Francis Rhys WardApr 3, 2022, 6:20 PM

27 points

11 comments8 min readLW link

[Question] How to write a LW sequence to learn a topic?

PabloAMCApr 3, 2022, 8:09 PM

3 points

2 comments1 min readLW link

Save Humanity! Breed Sapient Octopuses!

Yair HalberstadtApr 5, 2022, 6:39 PM

54 points

17 comments1 min readLW link

What Should We Optimize—A Conversation

Johannes C. MayerApr 7, 2022, 3:47 AM

1 point

0 comments14 min readLW link

The Explanatory Gap of AI

David ValdmanApr 7, 2022, 6:28 PM

1 point

0 comments4 min readLW link

Progress report 3: clustering transformer neurons

Nathan Helm-BurgerApr 5, 2022, 11:13 PM

5 points

0 comments2 min readLW link

Godshatter Versus Legibility: A Fundamentally Different Approach To AI Alignment

LukeOnlineApr 9, 2022, 9:43 PM

11 points

14 comments7 min readLW link

Is Fisherian Runaway Gradient Hacking?

Ryan KiddApr 10, 2022, 1:47 PM

15 points

7 comments4 min readLW link

The Glitch And Notes On Digital Beings

GhvstApr 11, 2022, 7:46 PM

−4 points

0 comments2 min readLW link

(ghvsted.com)

Post-history is written by the martyrs

VeedracApr 11, 2022, 3:45 PM

37 points

2 comments19 min readLW link

(www.royalroad.com)

An AI-in-a-box success model

azsantoskApr 11, 2022, 10:28 PM

16 points

1 comment10 min readLW link

Rationalist Should Win. Not Dying with Dignity and Funding WBE.

CitizenTenApr 12, 2022, 2:14 AM

23 points

15 comments5 min readLW link

Reward model hacking as a challenge for reward learning

Erik JennerApr 12, 2022, 9:39 AM

25 points

1 comment9 min readLW link

Is technical AI alignment research a net positive?

cranberry_bearApr 12, 2022, 1:07 PM

4 points

2 comments2 min readLW link

Another list of theories of impact for interpretability

Beth BarnesApr 13, 2022, 1:29 PM

32 points

1 comment5 min readLW link

Some reasons why a predictor wants to be a consequentialist

Lauro LangoscoApr 15, 2022, 3:02 PM

23 points

16 comments5 min readLW link

Redwood Research is hiring for several roles (Operations and Technical)

Jessica W and billzito

Apr 14, 2022, 4:57 PM

29 points

0 comments1 min readLW link

[Question] Convince me that humanity isn’t doomed by AGI

YitzApr 15, 2022, 5:26 PM

60 points

53 comments1 min readLW link

Another argument that you will let the AI out of the box

Garrett BakerApr 19, 2022, 9:54 PM

8 points

16 comments2 min readLW link

For every choice of AGI difficulty, conditioning on gradual take-off implies shorter timelines.

Francis Rhys WardApr 21, 2022, 7:44 AM

29 points

13 comments3 min readLW link

Reflections on My Own Missing Mood

Lone PineApr 21, 2022, 4:19 PM

51 points

25 comments5 min readLW link

Key questions about artificial sentience: an opinionated guide

RobboApr 25, 2022, 12:09 PM

45 points

31 comments18 min readLW link

[Question] What is being improved in recursive self improvement?

Lone PineApr 25, 2022, 6:30 PM

7 points

7 comments1 min readLW link

Why Copilot Accelerates Timelines

Michaël TrazziApr 26, 2022, 10:06 PM

35 points

14 comments7 min readLW link

[Question] Is it desirable for the first AGI to be conscious?

Charbel-RaphaëlMay 1, 2022, 9:29 PM

5 points

12 comments1 min readLW link

[Question] What Was Your Best / Most Successful DALL-E 2 Prompt?

EvidentialMay 4, 2022, 3:16 AM

1 point

0 comments1 min readLW link

Negotiating Up and Down the Simulation Hierarchy: Why We Might Survive the Unaligned Singularity

David UdellMay 4, 2022, 4:21 AM

24 points

16 comments2 min readLW link

High-stakes alignment via adversarial training [Redwood Research report]

dmz, LawrenceC and Nate Thomas

May 5, 2022, 12:59 AM

136 points

29 comments9 min readLW link

Deriving Conditional Expected Utility from Pareto-Efficient Decisions

Thomas KwaMay 5, 2022, 3:21 AM

23 points

1 comment6 min readLW link

Transcripts of interviews with AI researchers

Vael GatesMay 9, 2022, 5:57 AM

160 points

8 comments2 min readLW link

Agency As a Natural Abstraction

Thane RuthenisMay 13, 2022, 6:02 PM

55 points

9 comments13 min readLW link

Predicting the Elections with Deep Learning—Part 1 - Results

Quentin ChenevierMay 14, 2022, 12:54 PM

0 points

0 comments1 min readLW link

On saving one’s world

Rob BensingerMay 17, 2022, 7:53 PM

190 points

5 comments1 min readLW link

In defence of flailing

acylhalideJun 18, 2022, 5:26 AM

10 points

14 comments4 min readLW link

Reshaping the AI Industry

Thane RuthenisMay 29, 2022, 10:54 PM

143 points

34 comments21 min readLW link

Science for the Possible World

Zechen ZhangMay 23, 2022, 2:01 PM

7 points

0 comments3 min readLW link

Synthetic Media and The Future of Film

ifalphaMay 24, 2022, 5:54 AM

35 points

13 comments8 min readLW link

Explaining inner alignment to myself

Jeremy GillenMay 24, 2022, 11:10 PM

9 points

2 comments10 min readLW link

A discussion of the paper, “Large Language Models are Zero-Shot Reasoners”

HiroSakurabaMay 26, 2022, 3:55 PM

7 points

0 comments4 min readLW link

On inner and outer alignment, and their confusion

Nina PanicksseryMay 26, 2022, 9:56 PM

6 points

7 comments4 min readLW link

RL with KL penalties is better seen as Bayesian inference

Tomek Korbak and Ethan Perez

May 25, 2022, 9:23 AM

90 points

15 comments12 min readLW link

Bits of Optimization Can Only Be Lost Over A Distance

johnswentworthMay 23, 2022, 6:55 PM

26 points

15 comments2 min readLW link

Gradations of Agency

Daniel KokotajloMay 23, 2022, 1:10 AM

40 points

6 comments5 min readLW link

Utilitarianism

C S SRUTHIMay 28, 2022, 7:35 PM

0 points

1 comment1 min readLW link

Distilled—AGI Safety from First Principles

Harrison GMay 29, 2022, 12:57 AM

8 points

1 comment14 min readLW link

Multiple AIs in boxes, evaluating each other’s alignment

Moebius314May 29, 2022, 8:36 AM

7 points

0 comments14 min readLW link

The impact you might have working on AI safety

Fabien RogerMay 29, 2022, 4:31 PM

5 points

1 comment4 min readLW link

My SERI MATS Application

Daniel PalekaMay 30, 2022, 2:04 AM

16 points

0 comments8 min readLW link

[Question] A terrifying variant of Boltzmann’s brains problem

Zeruel017May 30, 2022, 8:08 PM

5 points

12 comments4 min readLW link

The Reverse Basilisk

Dunning K.May 30, 2022, 11:10 PM

15 points

23 comments2 min readLW link

The Hard Intelligence Hypothesis and Its Bearing on Succession Induced Foom

DragonGodMay 31, 2022, 7:04 PM

10 points

7 comments4 min readLW link

Machines vs Memes Part 1: AI Alignment and Memetics

Harriet FarlowMay 31, 2022, 10:03 PM

16 points

0 comments6 min readLW link

[Question] What will happen when an all-reaching AGI starts attempting to fix human character flaws?

Michael BrightJun 1, 2022, 6:45 PM

1 point

6 comments1 min readLW link

New cooperation mechanism—quadratic funding without a matching pool

Filip SondejJun 5, 2022, 1:55 PM

11 points

0 comments5 min readLW link

Miriam Yevick on why both symbols and networks are necessary for artificial minds

Bill BenzonJun 6, 2022, 8:34 AM

1 point

0 comments4 min readLW link

Six Dimensions of Operational Adequacy in AGI Projects

Eliezer YudkowskyMay 30, 2022, 5:00 PM

270 points

65 comments13 min readLW link

Grokking “Forecasting TAI with biological anchors”

anson.hoJun 6, 2022, 6:58 PM

34 points

0 comments14 min readLW link

Who models the models that model models? An exploration of GPT-3′s in-context model fitting ability

LovreJun 7, 2022, 7:37 PM

112 points

14 comments9 min readLW link

Pitching an Alignment Softball

mu_(negative)Jun 7, 2022, 4:10 AM

47 points

13 comments10 min readLW link

[Question] Confused Thoughts on AI Afterlife (seriously)

EpiritoJun 7, 2022, 2:37 PM

−6 points

6 comments1 min readLW link

Transformer Research Questions from Stained Glass Windows

StefanHexJun 8, 2022, 12:38 PM

4 points

0 comments2 min readLW link

Eliciting Latent Knowledge (ELK) - Distillation/Summary

Marius HobbhahnJun 8, 2022, 1:18 PM

49 points

2 comments21 min readLW link

Towards Gears-Level Understanding of Agency

Thane RuthenisJun 16, 2022, 10:00 PM

24 points

4 comments18 min readLW link

Vael Gates: Risks from Advanced AI (June 2022)

Vael GatesJun 14, 2022, 12:54 AM

38 points

2 comments30 min readLW link

Exploring Mild Behaviour in Embedded Agents

Megan KinnimentJun 27, 2022, 6:56 PM

21 points

3 comments18 min readLW link

Operationalizing two tasks in Gary Marcus’s AGI challenge

Bill BenzonJun 9, 2022, 6:31 PM

10 points

3 comments8 min readLW link

A plausible story about AI risk.

DeLesley HutchinsJun 10, 2022, 2:08 AM

14 points

1 comment4 min readLW link

I No Longer Believe Intelligence to be “Magical”

DragonGodJun 10, 2022, 8:58 AM

31 points

34 comments6 min readLW link

[Question] Why don’t you introduce really impressive people you personally know to AI alignment (more often)?

VerdenJun 11, 2022, 3:59 PM

33 points

15 comments1 min readLW link

Godzilla Strategies

johnswentworthJun 11, 2022, 3:44 PM

151 points

65 comments3 min readLW link

Intuitive Explanation of AIXI

Thomas LarsenJun 12, 2022, 9:41 PM

13 points

0 comments5 min readLW link

Training Trace Priors

Adam JermynJun 13, 2022, 2:22 PM

12 points

17 comments4 min readLW link

Why multi-agent safety is important

Akbir KhanJun 14, 2022, 9:23 AM

8 points

2 comments10 min readLW link

Contra EY: Can AGI destroy us without trial & error?

nsokolskyJun 13, 2022, 6:26 PM

124 points

76 comments15 min readLW link

A Modest Pivotal Act

anonymousaisafetyJun 13, 2022, 7:24 PM

−15 points

1 comment5 min readLW link

OpenAI: GPT-based LLMs show ability to discriminate between its own wrong answers, but inability to explain how/why it makes that discrimination, even as model scales

Aditya JainJun 13, 2022, 11:33 PM

14 points

5 comments1 min readLW link

(openai.com)

Resources I send to AI researchers about AI safety

Vael GatesJun 14, 2022, 2:24 AM

62 points

12 comments10 min readLW link

Investigating causal understanding in LLMs

Marius Hobbhahn and Tom Lieberum

Jun 14, 2022, 1:57 PM

28 points

4 comments13 min readLW link

[Question] How Do You Quantify [Physics Interfacing] Real World Capabilities?

DragonGodJun 14, 2022, 2:49 PM

17 points

1 comment4 min readLW link

Cryptographic Life: How to transcend in a sub-lightspeed world via Homomorphic encryption

GololJun 14, 2022, 7:22 PM

1 point

0 comments3 min readLW link

Alignment Risk Doesn’t Require Superintelligence

JustisMillsJun 15, 2022, 3:12 AM

35 points

4 comments2 min readLW link

Multigate Priors

Adam JermynJun 15, 2022, 7:30 PM

4 points

0 comments3 min readLW link

Infohazards and inferential distances

acylhalideJun 16, 2022, 7:59 AM

8 points

0 comments6 min readLW link

Apply to the Machine Learning For Good bootcamp in France

Alexandre VariengienJun 17, 2022, 7:32 AM

10 points

0 comments1 min readLW link

Adaptation Executors and the Telos Margin

PlinthistJun 20, 2022, 1:06 PM

2 points

8 comments5 min readLW link

Causal confusion as an argument against the scaling hypothesis

RobertKirk and David Scott Krueger (formerly: capybaralet)

Jun 20, 2022, 10:54 AM

83 points

30 comments18 min readLW link

[Question] What is the most probable AI?

Zeruel017Jun 20, 2022, 11:26 PM

−2 points

0 comments3 min readLW link

Reflection Mechanisms as an Alignment target: A survey

Marius Hobbhahn, elandgre and Beth Barnes

Jun 22, 2022, 3:05 PM

28 points

1 comment14 min readLW link

The Limits of Automation

milkandcigarettesJun 23, 2022, 6:03 PM

5 points

1 comment5 min readLW link

(milkandcigarettes.com)

Conversation with Eliezer: What do you want the system to do?

Orpheus16Jun 25, 2022, 5:36 PM

112 points

38 comments2 min readLW link

[Yann Lecun] A Path Towards Autonomous Machine Intelligence

DragonGodJun 27, 2022, 7:24 PM

38 points

12 comments1 min readLW link

(openreview.net)

Yann LeCun, A Path Towards Autonomous Machine Intelligence [link]

Bill BenzonJun 27, 2022, 11:29 PM

5 points

1 comment1 min readLW link

Doom doubts—is inner alignment a likely problem?

CrissmanJun 28, 2022, 12:42 PM

6 points

7 comments1 min readLW link

What success looks like

Marius Hobbhahn, MaxRa, JasperGeh and Yannick_Muehlhaeuser

Jun 28, 2022, 2:38 PM

19 points

4 comments1 min readLW link

(forum.effectivealtruism.org)

Latent Adversarial Training

Adam JermynJun 29, 2022, 8:04 PM

24 points

9 comments5 min readLW link

Hedonistic Isotopes:

TrozxzrJun 30, 2022, 4:49 PM

1 point

0 comments1 min readLW link

[Question] What about transhumans and beyond?

AlignmentMirrorJul 2, 2022, 1:58 PM

7 points

6 comments1 min readLW link

New US Senate Bill on X-Risk Mitigation [Linkpost]

Evan R. MurphyJul 4, 2022, 1:25 AM

35 points

12 comments1 min readLW link

(www.hsgac.senate.gov)

When is it appropriate to use statistical models and probabilities for decision making ?

Younes KamelJul 5, 2022, 12:34 PM

10 points

7 comments4 min readLW link

(youneskamel.substack.com)

How humanity would respond to slow takeoff, with takeaways from the entire COVID-19 pandemic

Noosphere89Jul 6, 2022, 5:52 PM

4 points

1 comment2 min readLW link

Four Societal Interventions to Improve our AGI Position

Rafael CosmanJul 6, 2022, 6:32 PM

−6 points

2 comments6 min readLW link

(rafaelcosman.com)

Deep neural networks are not opaque.

jem-mosigJul 6, 2022, 6:03 PM

22 points

14 comments3 min readLW link

Cooperation with and between AGI\’s

PeterMcCluskeyJul 7, 2022, 4:45 PM

10 points

3 comments10 min readLW link

(www.bayesianinvestor.com)

Making it harder for an AGI to “trick” us, with STVs

Tor Økland BarstadJul 9, 2022, 2:42 PM

14 points

5 comments22 min readLW link

Grouped Loss may disfavor discontinuous capabilities

Adam JermynJul 9, 2022, 5:22 PM

14 points

2 comments4 min readLW link

We are now at the point of deepfake job interviews

trevorJul 10, 2022, 3:37 AM

6 points

0 comments1 min readLW link

(www.businessinsider.com)

Acceptability Verification: A Research Agenda

David Udell and evhub

Jul 12, 2022, 8:11 PM

43 points

0 comments1 min readLW link

(docs.google.com)

Finding Skeletons on Rashomon Ridge

David Udell, Peter S. Park and NickyP

Jul 24, 2022, 10:31 PM

30 points

2 comments7 min readLW link

A note about differential technological development

So8resJul 15, 2022, 4:46 AM

178 points

31 comments6 min readLW link

How Interpretability can be Impactful

Connall GarrodJul 18, 2022, 12:06 AM

18 points

0 comments37 min readLW link

AI Hiroshima (Does A Vivid Example Of Destruction Forestall Apocalypse?)

SableJul 18, 2022, 12:06 PM

4 points

4 comments2 min readLW link

Bounded complexity of solving ELK and its implications

Rubi J. HudsonJul 19, 2022, 6:56 AM

10 points

4 comments18 min readLW link

Abram Demski’s ELK thoughts and proposal—distillation

Rubi J. HudsonJul 19, 2022, 6:57 AM

15 points

4 comments16 min readLW link

Help ARC evaluate capabilities of current language models (still need people)

Beth BarnesJul 19, 2022, 4:55 AM

94 points

6 comments2 min readLW link

A Critique of AI Alignment Pessimism

ExCephJul 19, 2022, 2:28 AM

8 points

1 comment9 min readLW link

Modelling Deception

Garrett BakerJul 18, 2022, 9:21 PM

15 points

0 comments7 min readLW link

Enlightenment Values in a Vulnerable World

Maxwell TabarrokJul 20, 2022, 7:52 PM

15 points

6 comments31 min readLW link

(maximumprogress.substack.com)

AI Safety Cheatsheet / Quick Reference

Zohar JacksonJul 20, 2022, 9:39 AM

3 points

0 comments1 min readLW link

(github.com)

Countering arguments against working on AI safety

Rauno ArikeJul 20, 2022, 6:23 PM

6 points

2 comments7 min readLW link

Why AGI Timeline Research/Discourse Might Be Overrated

Noosphere89Jul 20, 2022, 8:26 PM

5 points

0 comments1 min readLW link

(forum.effectivealtruism.org)

Connor Leahy on Dying with Dignity, EleutherAI and Conjecture

Michaël TrazziJul 22, 2022, 6:44 PM

176 points

29 comments14 min readLW link

(theinsideview.ai)

Brainstorm of things that could force an AI team to burn their lead

So8resJul 24, 2022, 11:58 PM

103 points

4 comments13 min readLW link

Alignment being impossible might be better than it being really difficult

Martín SotoJul 25, 2022, 11:57 PM

12 points

2 comments2 min readLW link

AI ethics vs AI alignment

Wei DaiJul 26, 2022, 1:08 PM

4 points

1 comment1 min readLW link

NeurIPS ML Safety Workshop 2022

Dan HJul 26, 2022, 3:28 PM

72 points

2 comments1 min readLW link

(neurips2022.mlsafety.org)

Quantum Advantage in Learning from Experiments

Dennis TowneJul 27, 2022, 3:49 PM

5 points

5 comments1 min readLW link

(ai.googleblog.com)

AGI ruin scenarios are likely (and disjunctive)

So8resJul 27, 2022, 3:21 AM

148 points

37 comments6 min readLW link

A Quick Note on AI Scaling Asymptotes

alyssavanceMay 25, 2022, 2:55 AM

43 points

6 comments1 min readLW link

[Question] How likely do you think worse-than-extinction type fates to be?

span1Aug 1, 2022, 4:08 AM

3 points

3 comments1 min readLW link

[Question] I want to donate some money (not much, just what I can afford) to AGI Alignment research, to whatever organization has the best chance of making sure that AGI goes well and doesn’t kill us all. What are my best options, where can I make the most difference per dollar?

lumenwritesAug 2, 2022, 12:08 PM

15 points

9 comments1 min readLW link

Law-Following AI 4: Don’t Rely on Vicarious Liability

CullenAug 2, 2022, 11:26 PM

5 points

2 comments3 min readLW link

Externalized reasoning oversight: a research direction for language model alignment

tameraAug 3, 2022, 12:03 PM

103 points

22 comments6 min readLW link

Transformer language models are doing something more general

NumendilAug 3, 2022, 9:13 PM

44 points

6 comments2 min readLW link

Three pillars for avoiding AGI catastrophe: Technical alignment, deployment decisions, and coordination

LintzAAug 3, 2022, 11:15 PM

17 points

0 comments12 min readLW link

Surprised by ELK report’s counterexample to Debate, IDA

Evan R. MurphyAug 4, 2022, 2:12 AM

18 points

0 comments5 min readLW link

Bias towards simple functions; application to alignment?

DavidHolmesAug 18, 2022, 4:15 PM

3 points

7 comments2 min readLW link

What do ML researchers think about AI in 2022?

KatjaGraceAug 4, 2022, 3:40 PM

217 points

33 comments3 min readLW link

(aiimpacts.org)

Deontology and Tool AI

Nathan1123Aug 5, 2022, 5:20 AM

4 points

5 comments6 min readLW link

Bridging Expected Utility Maximization and Optimization

Daniel HerrmannAug 5, 2022, 8:18 AM

23 points

5 comments14 min readLW link

Counterfactuals are Confusing because of an Ontological Shift

Chris_LeongAug 5, 2022, 7:03 PM

17 points

35 comments2 min readLW link

A Data limited future

Donald HobsonAug 6, 2022, 2:56 PM

52 points

25 comments2 min readLW link

A Community for Understanding Consciousness: Raising r/MathPie

NavjotツAug 7, 2022, 8:17 AM

−12 points

0 comments3 min readLW link

(www.reddit.com)

Complexity No Bar to AI (Or, why Computational Complexity matters less than you think for real life problems)

Noosphere89Aug 7, 2022, 7:55 PM

17 points

14 comments3 min readLW link

(www.gwern.net)

A sufficiently paranoid paperclip maximizer

RomanSAug 8, 2022, 11:17 AM

17 points

10 comments2 min readLW link

Steganography in Chain of Thought Reasoning

A RayAug 8, 2022, 3:47 AM

49 points

13 comments6 min readLW link

Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

johnswentworthAug 8, 2022, 6:05 PM

111 points

8 comments3 min readLW link

How (not) to choose a research project

Garrett Baker, CatGoddess and Johannes C. Mayer

Aug 9, 2022, 12:26 AM

76 points

11 comments7 min readLW link

Team Shard Status Report

David UdellAug 9, 2022, 5:33 AM

38 points

8 comments3 min readLW link

[Question] How would two superintelligent AIs interact, if they are unaligned with each other?

Nathan1123Aug 9, 2022, 6:58 PM

4 points

6 comments1 min readLW link

The Host Minds of HBO’s Westworld.

NerretAug 12, 2022, 6:53 PM

1 point

0 comments3 min readLW link

Anti-squatted AI x-risk domains index

plexAug 12, 2022, 12:01 PM

50 points

3 comments1 min readLW link

The Dumbest Possible Gets There First

ArtaxerxesAug 13, 2022, 10:20 AM

35 points

7 comments2 min readLW link

[Question] The OpenAI playground for GPT-3 is a terrible interface. Is there any great local (or web) app for exploring/learning with language models?

avivAug 13, 2022, 4:34 PM

2 points

1 comment1 min readLW link

I missed the crux of the alignment problem the whole time

zeshenAug 13, 2022, 10:11 AM

53 points

7 comments3 min readLW link

An Uncanny Prison

Nathan1123Aug 13, 2022, 9:40 PM

3 points

3 comments2 min readLW link

[Question] What is the probability that a superintelligent, sentient AGI is actually infeasible?

Nathan1123Aug 14, 2022, 10:41 PM

−3 points

6 comments1 min readLW link

Reinforcement Learning Goal Misgeneralization: Can we guess what kind of goals are selected by default?

StefanHex and Julian_R

Oct 25, 2022, 8:48 PM

9 points

1 comment4 min readLW link

What’s General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?

johnswentworthAug 15, 2022, 10:48 PM

103 points

15 comments10 min readLW link

Discovering Agents

zac_kentonAug 18, 2022, 5:33 PM

56 points

8 comments6 min readLW link

Interpretability Tools Are an Attack Channel

Thane RuthenisAug 17, 2022, 6:47 PM

42 points

22 comments1 min readLW link

Conditioning, Prompts, and Fine-Tuning

Adam JermynAug 17, 2022, 8:52 PM

32 points

9 comments4 min readLW link

Debate AI and the Decision to Release an AI

Chris_LeongJan 17, 2019, 2:36 PM

9 points

18 comments3 min readLW link

What’s the Least Impressive Thing GPT-4 Won’t be Able to Do

AlgonAug 20, 2022, 7:48 PM

75 points

80 comments1 min readLW link

The Alignment Problem Needs More Positive Fiction

NetcentricaAug 21, 2022, 10:01 PM

4 points

2 comments5 min readLW link

AI alignment as “navigating the space of intelligent behaviour”

Nora_AmmannAug 23, 2022, 1:28 PM

18 points

0 comments6 min readLW link

AGI Timelines Are Mostly Not Strategically Relevant To Alignment

johnswentworthAug 23, 2022, 8:15 PM

44 points

35 comments1 min readLW link

[Question] Would you ask a genie to give you the solution to alignment?

sudoAug 24, 2022, 1:29 AM

6 points

1 comment1 min readLW link

Ethan Perez on the Inverse Scaling Prize, Language Feedback and Red Teaming

Michaël TrazziAug 24, 2022, 4:35 PM

25 points

0 comments3 min readLW link

(theinsideview.ai)

Preparing for the apocalypse might help prevent it

OcracokeAug 25, 2022, 12:18 AM

1 point

1 comment1 min readLW link

Your posts should be on arXiv

JanBAug 25, 2022, 10:35 AM

136 points

39 comments3 min readLW link

The Solomonoff prior is malign. It’s not a big deal.

Charlie SteinerAug 25, 2022, 8:25 AM

38 points

9 comments7 min readLW link

AI strategy nearcasting

HoldenKarnofskyAug 25, 2022, 5:26 PM

79 points

3 comments9 min readLW link

Common misconceptions about OpenAI

Jacob_HiltonAug 25, 2022, 2:02 PM

226 points

138 comments5 min readLW link

AI Risk in Terms of Unstable Nuclear Software

Thane RuthenisAug 26, 2022, 6:49 PM

29 points

1 comment6 min readLW link

What’s the Most Impressive Thing That GPT-4 Could Plausibly Do?

bayesedAug 26, 2022, 3:34 PM

23 points

24 comments1 min readLW link

Taking the parameters which seem to matter and rotating them until they don’t

Garrett BakerAug 26, 2022, 6:26 PM

117 points

48 comments1 min readLW link

Annual AGI Benchmarking Event

Lawrence PhillipsAug 27, 2022, 12:06 AM

24 points

3 comments2 min readLW link

(www.metaculus.com)

Is there a benefit in low capability AI Alignment research?

LettiAug 26, 2022, 11:51 PM

1 point

1 comment2 min readLW link

Help Understanding Preferences And Evil

NetcentricaAug 27, 2022, 3:42 AM

6 points

7 comments2 min readLW link

Solving Alignment by “solving” semantics

Q HomeAug 27, 2022, 4:17 AM

15 points

10 comments26 min readLW link

An Introduction to Current Theories of Consciousness

hohenheimAug 28, 2022, 5:55 PM

59 points

44 comments49 min readLW link

New Canada AI Safety & Governance community

Wyatt Tessari L'AlliéAug 29, 2022, 6:45 PM

21 points

0 comments1 min readLW link

Are Generative World Models a Mesa-Optimization Risk?

Thane RuthenisAug 29, 2022, 6:37 PM

12 points

2 comments3 min readLW link

How might we align transformative AI if it’s developed very soon?

HoldenKarnofskyAug 29, 2022, 3:42 PM

107 points

17 comments45 min readLW link

Worlds Where Iterative Design Fails

johnswentworthAug 30, 2022, 8:48 PM

144 points

26 comments10 min readLW link

[Question] How might we make better use of AI capabilities research for alignment purposes?

Jemal YoungAug 31, 2022, 4:19 AM

11 points

4 comments1 min readLW link

ML Model Attribution Challenge [Linkpost]

aogAug 30, 2022, 7:34 PM

11 points

0 comments1 min readLW link

(mlmac.io)

I Tripped and Became GPT! (And How This Updated My Timelines)

FrankophoneSep 1, 2022, 5:56 PM

31 points

0 comments4 min readLW link

[Question] Can someone explain to me why most researchers think alignment is probably something that is humanly tractable?

iamthouthouartiSep 3, 2022, 1:12 AM

32 points

11 comments1 min readLW link

An Update on Academia vs. Industry (one year into my faculty job)

David Scott Krueger (formerly: capybaralet)Sep 3, 2022, 8:43 PM

118 points

18 comments4 min readLW link

Framing AI Childhoods

David UdellSep 6, 2022, 11:40 PM

37 points

8 comments4 min readLW link

A Game About AI Alignment (& Meta-Ethics): What Are the Must Haves?

JonathanErhardtSep 5, 2022, 7:55 AM

18 points

13 comments2 min readLW link

Is training data going to be diluted by AI-generated content?

Hannes ThurnherrSep 7, 2022, 6:13 PM

10 points

7 comments1 min readLW link

Turning WhatsApp Chat Data into Prompt-Response Form for Fine-Tuning

casualphysicsenjoyerSep 8, 2022, 8:05 PM

1 point

0 comments1 min readLW link

[An email with a bunch of links I sent an experienced ML researcher interested in learning about Alignment / x-safety.]

David Scott Krueger (formerly: capybaralet)Sep 8, 2022, 10:28 PM

46 points

1 comment5 min readLW link

Monitoring for deceptive alignment

evhubSep 8, 2022, 11:07 PM

118 points

7 comments9 min readLW link

Samotsvety’s AI risk forecasts

eliflandSep 9, 2022, 4:01 AM

44 points

0 comments4 min readLW link

Ought will host a factored cognition “Lab Meeting”

jungofthewon and stuhlmueller

Sep 9, 2022, 11:46 PM

35 points

1 comment1 min readLW link

AI Risk Intro 1: Advanced AI Might Be Very Bad

CallumMcDougall and L Rudolf L

Sep 11, 2022, 10:57 AM

43 points

13 comments30 min readLW link

An investigation into when agents may be incentivized to manipulate our beliefs.

Felix HofstätterSep 13, 2022, 5:08 PM

15 points

0 comments14 min readLW link

Risk aversion and GPT-3

casualphysicsenjoyerSep 13, 2022, 8:50 PM

1 point

0 comments1 min readLW link

[Question] Would a Misaligned SSI Really Kill Us All?

DragonGodSep 14, 2022, 12:15 PM

6 points

7 comments6 min readLW link

[Question] Why Do People Think Humans Are Stupid?

DragonGodSep 14, 2022, 1:55 PM

21 points

39 comments3 min readLW link

Precise P(doom) isn’t very important for prioritization or strategy

harsimonySep 14, 2022, 5:19 PM

18 points

6 comments1 min readLW link

Coordinate-Free Interpretability Theory

johnswentworthSep 14, 2022, 11:33 PM

41 points

14 comments5 min readLW link

Capability and Agency as Cornerstones of AI risk — My current model

wilmSep 15, 2022, 8:25 AM

10 points

4 comments12 min readLW link

[Question] Are Human Brains Universal?

DragonGodSep 15, 2022, 3:15 PM

16 points

28 comments5 min readLW link

Should AI learn human values, human norms or something else?

Q HomeSep 17, 2022, 6:19 AM

5 points

2 comments4 min readLW link

The ELK Framing I’ve Used

sudoSep 19, 2022, 10:28 AM

4 points

1 comment1 min readLW link

[Question] If we have Human-level chatbots, won’t we end up being ruled by possible people?

Erlja Jkdf.Sep 20, 2022, 1:59 PM

5 points

13 comments1 min readLW link

Character alignment

p.b.Sep 20, 2022, 8:27 AM

22 points

0 comments2 min readLW link

Cryptocurrency Exploits Show the Importance of Proactive Policies for AI X-Risk

eSpencerSep 20, 2022, 5:53 PM

1 point

0 comments4 min readLW link

Doing oversight from the very start of training seems hard

peterbarnettSep 20, 2022, 5:21 PM

14 points

3 comments3 min readLW link

Trends in Training Dataset Sizes

Pablo VillalobosSep 21, 2022, 3:47 PM

24 points

2 comments5 min readLW link

(epochai.org)

Two reasons we might be closer to solving alignment than it seems

KatWoods and AmberDawn

Sep 24, 2022, 8:00 PM

56 points

9 comments4 min readLW link

Funding is All You Need: Getting into Grad School by Hacking the NSF GRFP Fellowship

hapaninSep 22, 2022, 9:39 PM

93 points

9 comments12 min readLW link

[Question] Papers to start getting into NLP-focused alignment research

FeraidoonSep 24, 2022, 11:53 PM

6 points

0 comments1 min readLW link

How to Study Unsafe AGI’s safely (and why we might have no choice)

PunoxysmMar 7, 2014, 7:24 AM

10 points

47 comments5 min readLW link

On Generality

Eris DiscordiaSep 26, 2022, 4:06 AM

2 points

0 comments5 min readLW link

Oren’s Field Guide of Bad AGI Outcomes

Eris DiscordiaSep 26, 2022, 4:06 AM

0 points

0 comments1 min readLW link

Summary of ML Safety Course

zeshenSep 27, 2022, 1:05 PM

6 points

0 comments6 min readLW link

My Thoughts on the ML Safety Course

zeshenSep 27, 2022, 1:15 PM

49 points

3 comments17 min readLW link

Reward IS the Optimization Target

CarnSep 28, 2022, 5:59 PM

−1 points

3 comments5 min readLW link

A Library and Tutorial for Factored Cognition with Language Models

stuhlmueller, Luke Stebbing, justin_dan and goodgravy

Sep 28, 2022, 6:15 PM

47 points

0 comments1 min readLW link

Will Values and Competition Decouple?

intersticeSep 28, 2022, 4:27 PM

15 points

11 comments17 min readLW link

Make-A-Video by Meta AI

P.Sep 29, 2022, 5:07 PM

9 points

4 comments1 min readLW link

(makeavideo.studio)

Open application to become an AI safety project mentor

Charbel-RaphaëlSep 29, 2022, 11:27 AM

7 points

0 comments1 min readLW link

(docs.google.com)

It matters when the first sharp left turn happens

Adam JermynSep 29, 2022, 8:12 PM

35 points

9 comments4 min readLW link

Eli’s review of “Is power-seeking AI an existential risk?”

eliflandSep 30, 2022, 12:21 PM

58 points

0 comments3 min readLW link

(docs.google.com)

[Question] Rank the following based on likelihood to nullify AI-risk

AorouSep 30, 2022, 11:15 AM

3 points

1 comment4 min readLW link

Distribution Shifts and The Importance of AI Safety

Leon LangSep 29, 2022, 10:38 PM

17 points

2 comments12 min readLW link

[Question] What Is the Idea Behind (Un-)Supervised Learning and Reinforcement Learning?

MorpheusSep 30, 2022, 4:48 PM

9 points

6 comments2 min readLW link

(Structural) Stability of Coupled Optimizers

Paul BricmanSep 30, 2022, 11:28 AM

25 points

0 comments10 min readLW link

Where I currently disagree with Ryan Greenblatt’s version of the ELK approach

So8resSep 29, 2022, 9:18 PM

63 points

7 comments5 min readLW link

Paper: Large Language Models Can Self-improve [Linkpost]

Evan R. MurphyOct 2, 2022, 1:29 AM

52 points

14 comments1 min readLW link

(openreview.net)

[Question] Is there a culture overhang?

Aleksi LiimatainenOct 3, 2022, 7:26 AM

18 points

4 comments1 min readLW link

Visualizing Learned Representations of Rice Disease

muhia_beeOct 3, 2022, 9:09 AM

7 points

0 comments4 min readLW link

(indecisive-sand-24a.notion.site)

If you want to learn technical AI safety, here’s a list of AI safety courses, reading lists, and resources

KatWoodsOct 3, 2022, 12:43 PM

12 points

3 comments1 min readLW link

Frontline of AGI Alignment

SD MarlowOct 4, 2022, 3:47 AM

−10 points

0 comments1 min readLW link

(robothouse.substack.com)

Humans aren’t fitness maximizers

So8resOct 4, 2022, 1:31 AM

52 points

45 comments5 min readLW link

Smoke without fire is scary

Adam JermynOct 4, 2022, 9:08 PM

49 points

22 comments4 min readLW link

CHAI, Assistance Games, And Fully-Updated Deference [Scott Alexander]

lberglundOct 4, 2022, 5:04 PM

21 points

1 comment17 min readLW link

(astralcodexten.substack.com)

Generative, Episodic Objectives for Safe AI

Michael GlassOct 5, 2022, 11:18 PM

11 points

3 comments8 min readLW link

[Linkpost] “Blueprint for an AI Bill of Rights”—Office of Science and Technology Policy, USA (2022)

T431Oct 5, 2022, 4:42 PM

8 points

4 comments2 min readLW link

(www.whitehouse.gov)

The Answer

Alex BeymanOct 5, 2022, 9:23 PM

−3 points

0 comments4 min readLW link

The probability that Artificial General Intelligence will be developed by 2043 is extremely low.

cveresOct 6, 2022, 6:05 PM

−14 points

8 comments1 min readLW link

The Shape of Things to Come

Alex BeymanOct 7, 2022, 4:11 PM

12 points

3 comments8 min readLW link

The Slow Reveal

Alex BeymanOct 9, 2022, 3:16 AM

3 points

0 comments24 min readLW link

What does it mean for an AGI to be ‘safe’?

So8resOct 7, 2022, 4:13 AM

72 points

32 comments3 min readLW link

Boolean Primitives for Coupled Optimizers

Paul BricmanOct 7, 2022, 6:02 PM

9 points

0 comments8 min readLW link

Analysis: US restricts GPU sales to China

aogOct 7, 2022, 6:38 PM

94 points

58 comments5 min readLW link

[Question] Broken Links for the Audio Version of 2021 MIRI Conversations

KriegerOct 8, 2022, 4:16 PM

1 point

1 comment1 min readLW link

Don’t leave your fingerprints on the future

So8resOct 8, 2022, 12:35 AM

93 points

32 comments5 min readLW link

Let’s talk about uncontrollable AI

Karl von WendtOct 9, 2022, 10:34 AM

12 points

6 comments3 min readLW link

Lessons learned from talking to >100 academics about AI safety

Marius HobbhahnOct 10, 2022, 1:16 PM

207 points

16 comments12 min readLW link

When reporting AI timelines, be clear who you’re (not) deferring to

Sam ClarkeOct 10, 2022, 2:24 PM

37 points

3 comments1 min readLW link

Natural Categories Update

Logan ZoellnerOct 10, 2022, 3:19 PM

29 points

6 comments2 min readLW link

Updates and Clarifications

SD MarlowOct 11, 2022, 5:34 AM

−5 points

1 comment1 min readLW link

My argument against AGI

cveresOct 12, 2022, 6:33 AM

3 points

5 comments1 min readLW link

Instrumental convergence in single-agent systems

Edouard Harris and simonsdsuo

Oct 12, 2022, 12:24 PM

27 points

4 comments8 min readLW link

(www.gladstone.ai)

A strange twist on the road to AGI

cveresOct 12, 2022, 11:27 PM

−8 points

0 comments1 min readLW link

Perfect Enemy

Alex BeymanOct 13, 2022, 8:23 AM

−2 points

0 comments46 min readLW link

A stubborn unbeliever finally gets the depth of the AI alignment problem

aelwoodOct 13, 2022, 3:16 PM

17 points

8 comments3 min readLW link

(pursuingreality.substack.com)

Misalignment-by-default in multi-agent systems

Edouard Harris and simonsdsuo

Oct 13, 2022, 3:38 PM

17 points

8 comments20 min readLW link

(www.gladstone.ai)

Niceness is unnatural

So8resOct 13, 2022, 1:30 AM

98 points

18 comments8 min readLW link

The Vitalik Buterin Fellowship in AI Existential Safety is open for applications!

Xin Chen, CynthiaOct 13, 2022, 6:32 PM

21 points

0 comments1 min readLW link

Greed Is the Root of This Evil

Thane RuthenisOct 13, 2022, 8:40 PM

21 points

4 comments8 min readLW link

Contra shard theory, in the context of the diamond maximizer problem

So8resOct 13, 2022, 11:51 PM

84 points

16 comments2 min readLW link

Anthropomorphic AI and Sandboxed Virtual Universes

jacob_cannellSep 3, 2010, 7:02 PM

4 points

124 comments5 min readLW link

Instrumental convergence: scale and physical interactions

Edouard Harris and simonsdsuo

Oct 14, 2022, 3:50 PM

15 points

0 comments17 min readLW link

(www.gladstone.ai)

Provably Honest—A First Step

Srijanak DeNov 5, 2022, 7:18 PM

10 points

2 comments8 min readLW link

They gave LLMs access to physics simulators

ryan_bOct 17, 2022, 9:21 PM

50 points

18 comments1 min readLW link

(arxiv.org)

Decision theory does not imply that we get to have nice things

So8resOct 18, 2022, 3:04 AM

142 points

53 comments26 min readLW link

[Question] How easy is it to supervise processes vs outcomes?

Noosphere89Oct 18, 2022, 5:48 PM

3 points

0 comments1 min readLW link

How To Make Prediction Markets Useful For Alignment Work

johnswentworthOct 18, 2022, 7:01 PM

86 points

18 comments2 min readLW link

The reward function is already how well you manipulate humans

KerryOct 19, 2022, 1:52 AM

20 points

9 comments2 min readLW link

Cooperators are more powerful than agents

Ivan VendrovOct 21, 2022, 8:02 PM

14 points

7 comments3 min readLW link

Logical Decision Theories: Our final failsafe?

Noosphere89Oct 25, 2022, 12:51 PM

−6 points

8 comments1 min readLW link

(www.lesswrong.com)

[Question] Simple question about corrigibility and values in AI.

jmhOct 22, 2022, 2:59 AM

6 points

1 comment1 min readLW link

Newsletter for Alignment Research: The ML Safety Updates

Esben KranOct 22, 2022, 4:17 PM

14 points

0 comments1 min readLW link

“Originality is nothing but judicious imitation”—Voltaire

VestoziaOct 23, 2022, 7:00 PM

0 points

0 comments13 min readLW link

AI researchers announce NeuroAI agenda

Cameron BergOct 24, 2022, 12:14 AM

37 points

12 comments6 min readLW link

(arxiv.org)

AGI in our lifetimes is wishful thinking

niknobleOct 24, 2022, 11:53 AM

−4 points

21 comments8 min readLW link

question-answer counterfactual intervals

Tamsin LeakeOct 24, 2022, 1:08 PM

8 points

0 comments4 min readLW link

(carado.moe)

Why some people believe in AGI, but I don’t.

cveresOct 26, 2022, 3:09 AM

−15 points

6 comments1 min readLW link

[Question] Is the Orthogonality Thesis true for humans?

Noosphere89Oct 27, 2022, 2:41 PM

12 points

18 comments1 min readLW link

Worldview iPeople—Future Fund’s AI Worldview Prize

Toni MUENDELOct 28, 2022, 1:53 AM

−22 points

4 comments9 min readLW link

Causal scrubbing: Appendix

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

Dec 3, 2022, 12:58 AM

16 points

0 comments20 min readLW link

Beyond Kolmogorov and Shannon

Alexander Gietelink Oldenziel and Adam Shai

Oct 25, 2022, 3:13 PM

60 points

14 comments5 min readLW link

Method of statements: an alternative to taboo

Q HomeNov 16, 2022, 10:57 AM

7 points

0 comments41 min readLW link

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

Dec 3, 2022, 12:58 AM

130 points

9 comments20 min readLW link

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

RowanWang, Alexandre Variengien, Arthur Conmy, Buck and jsteinhardt

Oct 28, 2022, 11:55 PM

86 points

5 comments9 min readLW link

(arxiv.org)

Causal scrubbing: results on a paren balance checker

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

Dec 3, 2022, 12:59 AM

26 points

0 comments30 min readLW link

AI as a Civilizational Risk Part 1/6: Historical Priors

PashaKamyshevOct 29, 2022, 9:59 PM

2 points

2 comments7 min readLW link

AI as a Civilizational Risk Part 2/6: Behavioral Modification

PashaKamyshevOct 30, 2022, 4:57 PM

9 points

0 comments10 min readLW link

AI as a Civilizational Risk Part 3/6: Anti-economy and Signal Pollution

PashaKamyshevOct 31, 2022, 5:03 PM

7 points

4 comments14 min readLW link

AI as a Civilizational Risk Part 4/6: Bioweapons and Philosophy of Modification

PashaKamyshevNov 1, 2022, 8:50 PM

7 points

1 comment8 min readLW link

AI as a Civilizational Risk Part 5/6: Relationship between C-risk and X-risk

PashaKamyshevNov 3, 2022, 2:19 AM

2 points

0 comments7 min readLW link

AI as a Civilizational Risk Part 6/6: What can be done

PashaKamyshevNov 3, 2022, 7:48 PM

2 points

3 comments4 min readLW link

Am I secretly excited for AI getting weird?

porbyOct 29, 2022, 10:16 PM

98 points

4 comments4 min readLW link

“Normal” is the equilibrium state of past optimization processes

Alex_AltairOct 30, 2022, 7:03 PM

77 points

5 comments5 min readLW link

love, not competition

Tamsin LeakeOct 30, 2022, 7:44 PM

31 points

20 comments1 min readLW link

(carado.moe)

My (naive) take on Risks from Learned Optimization

Artyom KarpovOct 31, 2022, 10:59 AM

7 points

0 comments5 min readLW link

Embedding safety in ML development

zeshenOct 31, 2022, 12:27 PM

24 points

1 comment18 min readLW link

Auditing games for high-level interpretability

Paul CologneseNov 1, 2022, 10:44 AM

28 points

1 comment7 min readLW link

publishing alignment research and infohazards

Tamsin LeakeOct 31, 2022, 6:02 PM

69 points

10 comments1 min readLW link

(carado.moe)

Caution when interpreting Deepmind’s In-context RL paper

Sam MarksNov 1, 2022, 2:42 AM

104 points

6 comments4 min readLW link

AGI and the future: Is a future with AGI and humans alive evidence that AGI is not a threat to our existence?

LetUsTalkNov 1, 2022, 7:37 AM

4 points

8 comments1 min readLW link

Threat Model Literature Review

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

Nov 1, 2022, 11:03 AM

55 points

4 comments25 min readLW link

Clarifying AI X-risk

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

Nov 1, 2022, 11:03 AM

102 points

23 comments4 min readLW link

a casual intro to AI doom and alignment

Tamsin LeakeNov 1, 2022, 4:38 PM

12 points

0 comments4 min readLW link

(carado.moe)

[Question] Which Issues in Conceptual Alignment have been Formalised or Observed (or not)?

ojorgensenNov 1, 2022, 10:32 PM

4 points

0 comments1 min readLW link

Questions about Value Lock-in, Paternalism, and Empowerment

Sam F. BrownNov 16, 2022, 3:33 PM

12 points

2 comments12 min readLW link

(sambrown.eu)

Why do we post our AI safety plans on the Internet?

Peter S. ParkNov 3, 2022, 4:02 PM

3 points

4 comments11 min readLW link

Mechanistic Interpretability as Reverse Engineering (follow-up to “cars and elephants”)

David Scott Krueger (formerly: capybaralet)Nov 3, 2022, 11:19 PM

28 points

3 comments1 min readLW link

[Question] Are alignment researchers devoting enough time to improving their research capacity?

Carson JonesNov 4, 2022, 12:58 AM

13 points

3 comments3 min readLW link

[Question] Don’t you think RLHF solves outer alignment?

Charbel-RaphaëlNov 4, 2022, 12:36 AM

2 points

19 comments1 min readLW link

A newcomer’s guide to the technical AI safety field

zeshenNov 4, 2022, 2:29 PM

30 points

1 comment10 min readLW link

Toy Models and Tegum Products

Adam JermynNov 4, 2022, 6:51 PM

27 points

7 comments5 min readLW link

For ELK truth is mostly a distraction

c.troutNov 4, 2022, 9:14 PM

32 points

0 comments21 min readLW link

Interpreting systems as solving POMDPs: a step towards a formal understanding of agency [paper link]

the gears to ascensionNov 5, 2022, 1:06 AM

12 points

2 comments1 min readLW link

(www.semanticscholar.org)

When can a mimic surprise you? Why generative models handle seemingly ill-posed problems

David JohnstonNov 5, 2022, 1:19 PM

8 points

4 comments16 min readLW link

The Slippery Slope from DALLE-2 to Deepfake Anarchy

scasperNov 5, 2022, 2:53 PM

16 points

9 comments11 min readLW link

[Question] Can we get around Godel’s Incompleteness theorems and Turing undecidable problems via infinite computers?

Noosphere89Nov 5, 2022, 6:01 PM

−10 points

12 comments1 min readLW link

Recommend HAIST resources for assessing the value of RLHF-related alignment research

Sam Marks and Xander Davies

Nov 5, 2022, 8:58 PM

26 points

9 comments3 min readLW link

[Question] Has anyone increased their AGI timelines?

Darren McKeeNov 6, 2022, 12:03 AM

38 points

13 comments1 min readLW link

Applying superintelligence without collusion

Eric DrexlerNov 8, 2022, 6:08 PM

88 points

57 comments4 min readLW link

A philosopher’s critique of RLHF

TW123Nov 7, 2022, 2:42 AM

55 points

8 comments2 min readLW link

4 Key Assumptions in AI Safety

PrometheusNov 7, 2022, 10:50 AM

20 points

5 comments7 min readLW link

Hacker-AI – Does it already exist?

Erland WittkotterNov 7, 2022, 2:01 PM

3 points

11 comments11 min readLW link

Loss of control of AI is not a likely source of AI x-risk

squekNov 7, 2022, 6:44 PM

−6 points

0 comments5 min readLW link

Mysteries of mode collapse

janusNov 8, 2022, 10:37 AM

213 points

35 comments14 min readLW link

Some advice on independent research

Marius HobbhahnNov 8, 2022, 2:46 PM

41 points

4 comments10 min readLW link

A first success story for Outer Alignment: InstructGPT

Noosphere89Nov 8, 2022, 10:52 PM

6 points

1 comment1 min readLW link

(openai.com)

A caveat to the Orthogonality Thesis

Wuschel SchulzNov 9, 2022, 3:06 PM

36 points

10 comments2 min readLW link

Trying to Make a Treacherous Mesa-Optimizer

MadHatterNov 9, 2022, 6:07 PM

87 points

13 comments4 min readLW link

(attentionspan.blog)

Is full self-driving an AGI-complete problem?

kraemahzNov 10, 2022, 2:04 AM

5 points

5 comments1 min readLW link

The harnessing of complexity

geduardoNov 10, 2022, 6:44 PM

6 points

2 comments3 min readLW link

[Question] I there a demo of “You can’t fetch the coffee if you’re dead”?

Ram RachumNov 10, 2022, 6:41 PM

8 points

9 comments1 min readLW link

LessWrong Poll on AGI

Niclas KupperNov 10, 2022, 1:13 PM

12 points

6 comments1 min readLW link

Value Formation: An Overarching Model

Thane RuthenisNov 15, 2022, 5:16 PM

27 points

9 comments34 min readLW link

[simulation] 4chan user claiming to be the attorney hired by Google’s sentient chatbot LaMDA shares wild details of encounter

janusNov 10, 2022, 9:39 PM

11 points

1 comment13 min readLW link

(generative.ink)

Why I’m Working On Model Agnostic Interpretability

Jessica RumbelowNov 11, 2022, 9:24 AM

28 points

9 comments2 min readLW link

Are funding options for AI Safety threatened? W45

Steinthal, Esben Kran and Sabrina Zaki

Nov 11, 2022, 1:00 PM

7 points

0 comments3 min readLW link

(newsletter.apartresearch.com)

How likely are malign priors over objectives? [aborted WIP]

David JohnstonNov 11, 2022, 5:36 AM

−2 points

0 comments8 min readLW link

Is AI Gain-of-Function research a thing?

MadHatterNov 12, 2022, 2:33 AM

8 points

2 comments2 min readLW link

Vanessa Kosoy’s PreDCA, distilled

Martín SotoNov 12, 2022, 11:38 AM

16 points

17 comments5 min readLW link

fully aligned singleton as a solution to everything

Tamsin LeakeNov 12, 2022, 6:19 PM

6 points

2 comments2 min readLW link

(carado.moe)

Ways to buy time

Orpheus16, OliviaJ and Thomas Larsen

Nov 12, 2022, 7:31 PM

26 points

21 comments12 min readLW link

Characterizing Intrinsic Compositionality in Transformers with Tree Projections

Ulisse MiniNov 13, 2022, 9:46 AM

12 points

2 comments1 min readLW link

(arxiv.org)

I (with the help of a few more people) am planning to create an introduction to AI Safety that a smart teenager can understand. What am I missing?

TapataktNov 14, 2022, 4:12 PM

3 points

5 comments1 min readLW link

Will we run out of ML data? Evidence from projecting dataset size trends

Pablo VillalobosNov 14, 2022, 4:42 PM

74 points

12 comments2 min readLW link

(epochai.org)

The limited upside of interpretability

Peter S. ParkNov 15, 2022, 6:46 PM

13 points

11 comments1 min readLW link

[Question] Is the speed of training large models going to increase significantly in the near future due to Cerebras Andromeda?

Amal Nov 15, 2022, 10:50 PM

11 points

11 comments1 min readLW link

Unpacking “Shard Theory” as Hunch, Question, Theory, and Insight

Jacy Reese AnthisNov 16, 2022, 1:54 PM

29 points

9 comments2 min readLW link

The two conceptions of Active Inference: an intelligence architecture and a theory of agency

Roman LeventovNov 16, 2022, 9:30 AM

7 points

0 comments4 min readLW link

Engineering Monosemanticity in Toy Models

Adam Jermyn, evhub and Nicholas Schiefer

Nov 18, 2022, 1:43 AM

72 points

6 comments3 min readLW link

(arxiv.org)

[Question] Is there any policy for a fair treatment of AIs whose friendliness is in doubt?

nahojNov 18, 2022, 7:01 PM

15 points

9 comments1 min readLW link

The Ground Truth Problem (Or, Why Evaluating Interpretability Methods Is Hard)

Jessica RumbelowNov 17, 2022, 11:06 AM

26 points

2 comments2 min readLW link

Massive Scaling Should be Frowned Upon

harsimonyNov 17, 2022, 8:43 AM

7 points

6 comments5 min readLW link

How AI Fails Us: A non-technical view of the Alignment Problem

testingthewatersNov 18, 2022, 7:02 PM

7 points

0 comments2 min readLW link

(ethics.harvard.edu)

LLMs may capture key components of human agency

catubcNov 17, 2022, 8:14 PM

21 points

0 comments4 min readLW link

AGIs may value intrinsic rewards more than extrinsic ones

catubcNov 17, 2022, 9:49 PM

8 points

6 comments4 min readLW link

The economy as an analogy for advanced AI systems

rosehadshar and particlemania

Nov 15, 2022, 11:16 AM

26 points

0 comments5 min readLW link

Cognitive science and failed AI forecasts

Eleni AngelouNov 24, 2022, 9:02 PM

0 points

0 comments2 min readLW link

A Short Dialogue on the Meaning of Reward Functions

Leon Lang, Quintin Pope and peligrietzer

Nov 19, 2022, 9:04 PM

40 points

0 comments3 min readLW link

[Question] Updates on scaling laws for foundation models from ′ Transcending Scaling Laws with 0.1% Extra Compute’

Nick_GreigNov 18, 2022, 12:46 PM

15 points

2 comments1 min readLW link

Distillation of “How Likely Is Deceptive Alignment?”

NickGabsNov 18, 2022, 4:31 PM

20 points

3 comments10 min readLW link

The Disastrously Confident And Inaccurate AI

Sharat Jacob JacobNov 18, 2022, 7:06 PM

13 points

0 comments13 min readLW link

generalized wireheading

Tamsin LeakeNov 18, 2022, 8:18 PM

21 points

7 comments2 min readLW link

(carado.moe)

By Default, GPTs Think In Plain Sight

Fabien RogerNov 19, 2022, 7:15 PM

60 points

16 comments9 min readLW link

ARC paper: Formalizing the presumption of independence

Erik JennerNov 20, 2022, 1:22 AM

88 points

2 comments2 min readLW link

(arxiv.org)

Planes are still decades away from displacing most bird jobs

guzeyNov 25, 2022, 4:49 PM

156 points

13 comments3 min readLW link

Scott Aaronson on “Reform AI Alignment”

ShmiNov 20, 2022, 10:20 PM

39 points

17 comments1 min readLW link

(scottaaronson.blog)

How Should AIS Relate To Its Funders? W46

Steinthal, Esben Kran and Sabrina Zaki

Nov 21, 2022, 3:58 PM

6 points

1 comment3 min readLW link

(newsletter.apartresearch.com)

Benefits/Risks of Scott Aaronson’s Orthodox/Reform Framing for AI Alignment

JeremyyNov 21, 2022, 5:54 PM

2 points

1 comment1 min readLW link

[Hebbian Natural Abstractions] Introduction

Samuel Nellessen and Jan

Nov 21, 2022, 8:34 PM

34 points

3 comments4 min readLW link

(www.snellessen.com)

Miscellaneous First-Pass Alignment Thoughts

NickGabsNov 21, 2022, 9:23 PM

12 points

4 comments10 min readLW link

Meta AI announces Cicero: Human-Level Diplomacy play (with dialogue)

Jacy Reese AnthisNov 22, 2022, 4:50 PM

95 points

64 comments1 min readLW link

(www.science.org)

Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

Orpheus16 and OliviaJ

Nov 22, 2022, 10:19 PM

69 points

20 comments4 min readLW link

Brute-forcing the universe: a non-standard shot at diamond alignment

Martín SotoNov 22, 2022, 10:36 PM

6 points

0 comments20 min readLW link

Simulators, constraints, and goal agnosticism: porbynotes vol. 1

porbyNov 23, 2022, 4:22 AM

36 points

2 comments35 min readLW link

Sets of objectives for a multi-objective RL agent to optimize

Ben Smith and Roland Pihlakas

Nov 23, 2022, 6:49 AM

11 points

0 comments8 min readLW link

Human-level Diplomacy was my fire alarm

Lao MeinNov 23, 2022, 10:05 AM

51 points

15 comments3 min readLW link

Ex nihilo

Hopkins StanleyNov 23, 2022, 2:38 PM

1 point

0 comments1 min readLW link

Corrigibility Via Thought-Process Deference

Thane RuthenisNov 24, 2022, 5:06 PM

13 points

5 comments9 min readLW link

Conjecture: a retrospective after 8 months of work

Connor Leahy, Sid Black, Gabriel Alfour and Chris Scammell

Nov 23, 2022, 5:10 PM

183 points

9 comments8 min readLW link

Conjecture Second Hiring Round

Connor Leahy, Sid Black, Gabriel Alfour and Chris Scammell

Nov 23, 2022, 5:11 PM

85 points

0 comments1 min readLW link

Injecting some numbers into the AGI debate—by Boaz Barak

JsevillamolNov 23, 2022, 4:10 PM

12 points

0 comments3 min readLW link

(windowsontheory.org)

Human-level Full-Press Diplomacy (some bare facts).

Cleo NardoNov 22, 2022, 8:59 PM

50 points

7 comments3 min readLW link

When AI solves a game, focus on the game’s mechanics, not its theme.

Cleo NardoNov 23, 2022, 7:16 PM

81 points

7 comments2 min readLW link

[Question] What is the best source to explain short AI timelines to a skeptical person?

trevorNov 23, 2022, 5:19 AM

4 points

4 comments1 min readLW link

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy and Megan Kinniment

Dec 5, 2022, 8:28 PM

37 points

16 comments10 min readLW link

The man and the tool

pedroalvaradoNov 25, 2022, 7:51 PM

1 point

0 comments4 min readLW link

Gliders in Language Models

Alexandre VariengienNov 25, 2022, 12:38 AM

27 points

11 comments10 min readLW link

The AI Safety community has four main work groups, Strategy, Governance, Technical and Movement Building

peterslatteryNov 25, 2022, 3:45 AM

0 points

0 comments6 min readLW link

Using mechanistic interpretability to find in-distribution failure in toy transformers

Charlie GeorgeNov 28, 2022, 7:39 PM

6 points

0 comments4 min readLW link

Intuitions by ML researchers may get progressively worse concerning likely candidates for transformative AI

Viktor RehnbergNov 25, 2022, 3:49 PM

7 points

0 comments2 min readLW link

Guardian AI (Misaligned systems are all around us.)

Jessica RumbelowNov 25, 2022, 3:55 PM

15 points

6 comments2 min readLW link

Three Alignment Schemas & Their Problems

Shoshannah TekofskyNov 26, 2022, 4:25 AM

16 points

1 comment6 min readLW link

Reward Is Not Necessary: How To Create A Compositional Self-Preserving Agent For Life-Long Learning

CapybasiliskNov 27, 2022, 2:05 PM

3 points

0 comments1 min readLW link

(arxiv.org)

Review: LOVE in a simbox

PeterMcCluskeyNov 27, 2022, 5:41 PM

32 points

4 comments9 min readLW link

(bayesianinvestor.com)

Superintelligent AI is necessary for an amazing future, but far from sufficient

So8resOct 31, 2022, 9:16 PM

115 points

46 comments34 min readLW link

[Question] How to correct for multiplicity with AI-generated models?

Lao MeinNov 28, 2022, 3:51 AM

4 points

0 comments1 min readLW link

Is Constructor Theory a useful tool for AI alignment?

A.H.Nov 29, 2022, 12:35 PM

11 points

8 comments26 min readLW link

Multi-Component Learning and S-Curves

Adam Jermyn and Buck

Nov 30, 2022, 1:37 AM

57 points

24 comments7 min readLW link

Subsets and quotients in interpretability

Erik JennerDec 2, 2022, 11:13 PM

24 points

1 comment7 min readLW link

Neglected cause: automated fraud detection in academia through image analysis

Lao MeinNov 30, 2022, 5:52 AM

10 points

1 comment2 min readLW link

AGI Impossible due to Energy Constrains

TheKlausNov 30, 2022, 6:48 PM

−8 points

13 comments1 min readLW link

Master plan spec: needs audit (logic and cooperative AI)

QuinnNov 30, 2022, 6:10 AM

12 points

5 comments7 min readLW link

AI takeover tabletop RPG: “The Treacherous Turn”

Daniel KokotajloNov 30, 2022, 7:16 AM

51 points

3 comments1 min readLW link

Has AI gone too far?

Boston AndersonNov 30, 2022, 6:49 PM

−15 points

3 comments1 min readLW link

Seeking submissions for short AI-safety course proposals

SergioDec 1, 2022, 12:32 AM

3 points

0 comments1 min readLW link

Did ChatGPT just gaslight me?

TW123Dec 1, 2022, 5:41 AM

123 points

45 comments9 min readLW link

(equonc.substack.com)

Safe Development of Hacker-AI Countermeasures – What if we are too late?

Erland WittkotterDec 1, 2022, 7:59 AM

3 points

0 comments14 min readLW link

Research request (alignment strategy): Deep dive on “making AI solve alignment for us”

JanBDec 1, 2022, 2:55 PM

16 points

3 comments1 min readLW link

[LINK] - ChatGPT discussion

JanBDec 1, 2022, 3:04 PM

13 points

7 comments1 min readLW link

(openai.com)

ChatGPT: First Impressions

specbugDec 1, 2022, 4:36 PM

18 points

2 comments13 min readLW link

(sixeleven.in)

Re-Examining LayerNorm

Eric WinsorDec 1, 2022, 10:20 PM

100 points

8 comments5 min readLW link

Update on Harvard AI Safety Team and MIT AI Alignment

Xander Davies, Sam Marks, kaivu, tlevin, eleni, maxnadeau, Naomi Bashkansky and Oam Patel

Dec 2, 2022, 12:56 AM

56 points

4 comments8 min readLW link

Deconfusing Direct vs Amortised Optimization

berenDec 2, 2022, 11:30 AM

48 points

6 comments10 min readLW link

[ASoT] Finetuning, RL, and GPT’s world prior

JozdienDec 2, 2022, 4:33 PM

31 points

8 comments5 min readLW link

Takeoff speeds, the chimps analogy, and the Cultural Intelligence Hypothesis

NickGabsDec 2, 2022, 7:14 PM

14 points

2 comments4 min readLW link

Non-Technical Preparation for Hacker-AI and Cyberwar 2.0+

Erland WittkotterDec 19, 2022, 11:42 AM

2 points

0 comments25 min readLW link

Apply for the ML Upskilling Winter Camp in Cambridge, UK [2-10 Jan]

hannah wing-yeeDec 2, 2022, 8:45 PM

3 points

0 comments2 min readLW link

Research Principles for 6 Months of AI Alignment Studies

Shoshannah TekofskyDec 2, 2022, 10:55 PM

22 points

3 comments6 min readLW link

Chat GPT’s views on Metaphysics and Ethics

Cole KillianDec 3, 2022, 6:12 PM

5 points

3 comments1 min readLW link

(twitter.com)

[Question] Will the first AGI agent have been designed as an agent (in addition to an AGI)?

nahojDec 3, 2022, 8:32 PM

1 point

8 comments1 min readLW link

Could an AI be Religious?

mk54Dec 4, 2022, 5:00 AM

−12 points

14 comments1 min readLW link

ChatGPT seems overconfident to me

qbolecDec 4, 2022, 8:03 AM

19 points

3 comments16 min readLW link

AI can exploit safety plans posted on the Internet

Peter S. ParkDec 4, 2022, 12:17 PM

−19 points

4 comments1 min readLW link

Race to the Top: Benchmarks for AI Safety

Isabella DuanDec 4, 2022, 6:48 PM

12 points

2 comments1 min readLW link

Take 3: No indescribable heavenworlds.

Charlie SteinerDec 4, 2022, 2:48 AM

21 points

12 comments2 min readLW link

ChatGPT is settling the Chinese Room argument

averrosDec 4, 2022, 8:25 PM

−7 points

4 comments1 min readLW link

AGI as a Black Swan Event

Stephen McAleeseDec 4, 2022, 11:00 PM

8 points

8 comments7 min readLW link

Probably good projects for the AI safety ecosystem

Ryan KiddDec 5, 2022, 2:26 AM

73 points

15 comments2 min readLW link

A ChatGPT story about ChatGPT doom

SurfingOrcaDec 5, 2022, 5:40 AM

6 points

3 comments4 min readLW link

Aligned Behavior is not Evidence of Alignment Past a Certain Level of Intelligence

Ronny FernandezDec 5, 2022, 3:19 PM

19 points

5 comments7 min readLW link

Is the “Valley of Confused Abstractions” real?

jacquesthibsDec 5, 2022, 1:36 PM

15 points

9 comments2 min readLW link

Analysis of AI Safety surveys for field-building insights

Ash JafariDec 5, 2022, 7:21 PM

10 points

2 comments4 min readLW link

Testing Ways to Bypass ChatGPT’s Safety Features

Robert_AIZIDec 5, 2022, 6:50 PM

6 points

2 comments5 min readLW link

(aizi.substack.com)

ChatGPT on Spielberg’s A.I. and AI Alignment

Bill BenzonDec 5, 2022, 9:10 PM

5 points

0 comments4 min readLW link

Shh, don’t tell the AI it’s likely to be evil

naterushDec 6, 2022, 3:35 AM

19 points

9 comments1 min readLW link

Neural networks biased towards geometrically simple functions?

DavidHolmesDec 8, 2022, 4:16 PM

16 points

2 comments3 min readLW link

Things roll downhill

awenonianDec 6, 2022, 3:27 PM

19 points

0 comments1 min readLW link

ChatGPT and the Human Race

Ben ReillyDec 6, 2022, 9:38 PM

6 points

1 comment3 min readLW link

AI Safety in a Vulnerable World: Requesting Feedback on Preliminary Thoughts

Jordan ArelDec 6, 2022, 10:35 PM

3 points

2 comments3 min readLW link

In defense of probably wrong mechanistic models

evhubDec 6, 2022, 11:24 PM

41 points

10 comments2 min readLW link

ChatGPT: “An error occurred. If this issue persists...”

Bill BenzonDec 7, 2022, 3:41 PM

5 points

11 comments3 min readLW link

Where to be an AI Safety Professor

scasperDec 7, 2022, 7:09 AM

30 points

12 comments2 min readLW link

Thoughts on AGI organizations and capabilities work

Rob Bensinger and So8res

Dec 7, 2022, 7:46 PM

94 points

17 comments5 min readLW link

Riffing on the agent type

QuinnDec 8, 2022, 12:19 AM

16 points

0 comments4 min readLW link

Of pumpkins, the Falcon Heavy, and Groucho Marx: High-Level discourse structure in ChatGPT

Bill BenzonDec 8, 2022, 10:25 PM

2 points

0 comments8 min readLW link

Why I’m Sceptical of Foom

DragonGodDec 8, 2022, 10:01 AM

19 points

26 comments3 min readLW link

If Wentworth is right about natural abstractions, it would be bad for alignment

Wuschel SchulzDec 8, 2022, 3:19 PM

27 points

5 comments4 min readLW link

Take 7: You should talk about “the human’s utility function” less.

Charlie SteinerDec 8, 2022, 8:14 AM

47 points

22 comments2 min readLW link

Notes on OpenAI’s alignment plan

Alex FlintDec 8, 2022, 7:13 PM

47 points

5 comments7 min readLW link

We need to make scary AIs

Igor IvanovDec 9, 2022, 10:04 AM

3 points

8 comments5 min readLW link

I Believe we are in a Hardware Overhang

nemDec 8, 2022, 11:18 PM

8 points

0 comments1 min readLW link

[Question] What are your thoughts on the future of AI-assisted software development?

RomanHaukssonDec 9, 2022, 10:04 AM

4 points

2 comments1 min readLW link

ChatGPT’s Misalignment Isn’t What You Think

stavrosDec 9, 2022, 11:11 AM

3 points

12 comments1 min readLW link

Simulators and Mindcrime

DragonGodDec 9, 2022, 3:20 PM

0 points

4 comments3 min readLW link

Working towards AI alignment is better

Johannes C. MayerDec 9, 2022, 3:39 PM

7 points

2 comments2 min readLW link

[Question] How would you improve ChatGPT’s filtering?

Noah ScalesDec 10, 2022, 8:05 AM

9 points

6 comments1 min readLW link

Inspiration as a Scarce Resource

zenbu zenbu zenbu zenbuDec 10, 2022, 3:23 PM

7 points

0 comments4 min readLW link

(inflorescence.substack.com)

Poll Results on AGI

Niclas KupperDec 10, 2022, 9:25 PM

10 points

0 comments2 min readLW link

The Opportunity and Risks of Learning Human Values In-Context

Past AccountDec 10, 2022, 9:40 PM

1 point

4 comments5 min readLW link

High level discourse structure in ChatGPT: Part 2 [Quasi-symbolic?]

Bill BenzonDec 10, 2022, 10:26 PM

7 points

0 comments6 min readLW link

ChatGPT goes through a wormhole hole in our Shandyesque universe [virtual wacky weed]

Bill BenzonDec 11, 2022, 11:59 AM

−1 points

2 comments3 min readLW link

Questions about AI that bother me

Eleni AngelouDec 11, 2022, 6:14 PM

11 points

2 comments2 min readLW link

Reflections on the PIBBSS Fellowship 2022

Nora_Ammann and particlemania

Dec 11, 2022, 9:53 PM

31 points

0 comments18 min readLW link

Benchmarks for Comparing Human and AI Intelligence

MrThinkDec 11, 2022, 10:06 PM

8 points

4 comments2 min readLW link

a rough sketch of formal aligned AI using QACI

Tamsin LeakeDec 11, 2022, 11:40 PM

14 points

0 comments4 min readLW link

(carado.moe)

Trivial GPT-3.5 limitation workaround

Dave LindberghDec 12, 2022, 8:42 AM

5 points

4 comments1 min readLW link

[Question] Thought experiment. If human minds could be harnessed into one universal consciousness of humanity, would we discover things that have been quite difficult to reach with the means of modern science? And would the consciousness of humanity be more comprehensive than the future power of artificial intelligence?

lotta liedesDec 12, 2022, 2:43 PM

−1 points

0 comments1 min readLW link

Meaningful things are those the universe possesses a semantics for

Abhimanyu Pallavi SudhirDec 12, 2022, 4:03 PM

7 points

14 comments14 min readLW link

Let’s go meta: Grammatical knowledge and self-referential sentences [ChatGPT]

Bill BenzonDec 12, 2022, 9:50 PM

5 points

0 comments9 min readLW link

[Question] Are lawsuits against AGI companies extending AGI timelines?

SlowingAGIDec 13, 2022, 6:00 AM

1 point

1 comment1 min readLW link

An exploration of GPT-2′s embedding weights

Adam ScherlisDec 13, 2022, 12:46 AM

26 points

2 comments10 min readLW link

Revisiting algorithmic progress

Tamay and Ege Erdil

Dec 13, 2022, 1:39 AM

92 points

8 comments2 min readLW link

(arxiv.org)

Alignment with argument-networks and assessment-predictions

Tor Økland BarstadDec 13, 2022, 2:17 AM

7 points

3 comments45 min readLW link

Limits of Superintelligence

Aleksei PetrenkoDec 13, 2022, 12:19 PM

1 point

0 comments1 min readLW link

[Question] Best introductory overviews of AGI safety?

JakubKDec 13, 2022, 7:01 PM

14 points

5 comments2 min readLW link

(forum.effectivealtruism.org)

Seeking participants for study of AI safety researchers

joelegardnerDec 13, 2022, 9:58 PM

2 points

0 comments1 min readLW link

Assessing the Capabilities of ChatGPT through Success Rates

Past AccountDec 13, 2022, 9:16 PM

5 points

0 comments2 min readLW link

Discovering Latent Knowledge in Language Models Without Supervision

XodarapDec 14, 2022, 12:32 PM

45 points

1 comment1 min readLW link

(arxiv.org)

all claw, no world — and other thoughts on the universal distribution

Tamsin LeakeDec 14, 2022, 6:55 PM

14 points

0 comments7 min readLW link

(carado.moe)

Contrary to List of Lethality’s point 22, alignment’s door number 2

False NameDec 14, 2022, 10:01 PM

0 points

1 comment22 min readLW link

ChatGPT has a HAL Problem

Paul AndersonDec 14, 2022, 9:31 PM

1 point

0 comments1 min readLW link

How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

CollinDec 15, 2022, 6:22 PM

124 points

18 comments16 min readLW link

Avoiding Psychopathic AI

Cameron BergDec 19, 2022, 5:01 PM

28 points

3 comments20 min readLW link

We’ve stepped over the threshold into the Fourth Arena, but don’t recognize it

Bill BenzonDec 15, 2022, 8:22 PM

2 points

0 comments7 min readLW link

AI Safety Movement Builders should help the community to optimise three factors: contributors, contributions and coordination

peterslatteryDec 15, 2022, 10:50 PM

4 points

0 comments6 min readLW link

Proper scoring rules don’t guarantee predicting fixed points

Johannes Treutlein, Rubi J. Hudson and Caspar Oesterheld

Dec 16, 2022, 6:22 PM

55 points

5 comments21 min readLW link

A learned agent is not the same as a learning agent

Ben AmitayDec 16, 2022, 5:27 PM

4 points

4 comments2 min readLW link

Abstract concepts and metalingual definition: Does ChatGPT understand justice and charity?

Bill BenzonDec 16, 2022, 9:01 PM

2 points

0 comments13 min readLW link

Using Information Theory to tackle AI Alignment: A Practical Approach

Daniel SalamiDec 17, 2022, 1:37 AM

6 points

4 comments8 min readLW link

Looking for an alignment tutor

JanBDec 17, 2022, 7:08 PM

15 points

2 comments1 min readLW link

What we owe the microbiome

weverkaDec 17, 2022, 7:40 PM

2 points

0 comments1 min readLW link

(forum.effectivealtruism.org)

Bad at Arithmetic, Promising at Math

cohenmacaulayDec 18, 2022, 5:40 AM

91 points

17 comments20 min readLW link

AGI is here, but nobody wants it. Why should we even care?

MGowDec 20, 2022, 7:14 PM

−20 points

0 comments17 min readLW link

Hacker-AI and Cyberwar 2.0+

Erland WittkotterDec 19, 2022, 11:46 AM

2 points

0 comments15 min readLW link

Does ChatGPT’s performance warrant working on a tutor for children? [It’s time to take it to the lab.]

Bill BenzonDec 19, 2022, 3:12 PM

13 points

2 comments4 min readLW link

(new-savanna.blogspot.com)

Results from a survey on tool use and workflows in alignment research

jacquesthibs, Jan, janus and Logan Riggs

Dec 19, 2022, 3:19 PM

50 points

2 comments19 min readLW link

Proliferating Education

Haris RashidDec 20, 2022, 7:22 PM

−1 points

2 comments5 min readLW link

(www.harisrab.com)

[Question] Will research in AI risk jinx it? Consequences of training AI on AI risk arguments

Yann DuboisDec 19, 2022, 10:42 PM

5 points

6 comments1 min readLW link

AGI Timelines in Governance: Different Strategies for Different Timeframes

simeon_c and AmberDawn

Dec 19, 2022, 9:31 PM

47 points

15 comments10 min readLW link

(Extremely) Naive Gradient Hacking Doesn’t Work

ojorgensenDec 20, 2022, 2:35 PM

6 points

0 comments6 min readLW link

An Open Agency Architecture for Safe Transformative AI

davidadDec 20, 2022, 1:04 PM

18 points

12 comments4 min readLW link

Properties of current AIs and some predictions of the evolution of AI from the perspective of scale-free theories of agency and regulative development

Roman LeventovDec 20, 2022, 5:13 PM

7 points

0 comments36 min readLW link

I believe some AI doomers are overconfident

FTPickleDec 20, 2022, 5:09 PM

10 points

14 comments2 min readLW link

Performing an SVD on a time-series matrix of gradient updates on an MNIST network produces 92.5 singular values

Garrett BakerDec 21, 2022, 12:44 AM

8 points

10 comments5 min readLW link

CIRL Corrigibility is Fragile

Rachel Freedman and AdamGleave

Dec 21, 2022, 1:40 AM

21 points

1 comment12 min readLW link

New AI risk intro from Vox [link post]

JakubKDec 21, 2022, 6:00 AM

5 points

1 comment2 min readLW link

(www.vox.com)

habryka Oct 3, 2021, 4:21 AM
2 points
Some of the recent edits changed some of the links to no longer have the ?showPostCount=true&useTagName=true query parameters in the links, which changes how they are displayed and makes the display inconsistent. Seems like we should fix this.
- plex Oct 4, 2021, 4:50 PM
  1 point
  0
  Parent
  Yep, that was me adding some new ones without the parameter (though I think I didn’t remove it from any which already had it), did not know that was needed, fixed now (and fixed on portal page).
plex Aug 29, 2021, 3:58 PM
2 points
I think Basic Alignment Theory should be renamed, very little of it is basic. I propose either Alignment Theory or Conceptual Alignment (credit to @adamshimi for the name).