AI Risk

TagLast edit: 2 Nov 2022 20:27 UTC by brook

AI Risk is analysis of the risks associated with building powerful AI systems.

Superintelligence FAQ

Scott Alexander20 Sep 2016 19:00 UTC

141 points

39 comments27 min readLW link

What failure looks like

paulfchristiano17 Mar 2019 20:18 UTC

437 points

55 comments8 min readLW link 2 reviews

AGI Ruin: A List of Lethalities

Eliezer Yudkowsky5 Jun 2022 22:05 UTC

955 points

711 comments30 min readLW link 3 reviews

Specification gaming examples in AI

Vika3 Apr 2018 12:30 UTC

48 points

9 comments1 min readLW link 2 reviews

An artificially structured argument for expecting AGI ruin

Rob Bensinger7 May 2023 21:52 UTC

91 points

26 comments19 min readLW link

Where I agree and disagree with Eliezer

paulfchristiano19 Jun 2022 19:15 UTC

907 points

224 comments18 min readLW link 2 reviews

Discussion with Eliezer Yudkowsky on AGI interventions

Rob Bensinger and Eliezer Yudkowsky

11 Nov 2021 3:01 UTC

328 points

253 comments34 min readLW link 1 review

“Corrigibility at some small length” by dath ilan

Christopher King5 Apr 2023 1:47 UTC

32 points

3 comments9 min readLW link

(www.glowfic.com)

Intuitions about goal-directed behavior

Rohin Shah1 Dec 2018 4:25 UTC

55 points

15 comments6 min readLW link

On how various plans miss the hard bits of the alignment challenge

So8res12 Jul 2022 2:49 UTC

315 points

89 comments29 min readLW link 3 reviews

Epistemological Framing for AI Alignment Research

adamShimi8 Mar 2021 22:05 UTC

58 points

7 comments9 min readLW link

Conjecture internal survey: AGI timelines and probability of human extinction from advanced AI

Maris Sala22 May 2023 14:31 UTC

155 points

5 comments3 min readLW link

(www.conjecture.dev)

AGI in sight: our look at the game board

Andrea_Miotti and Gabriel Alfour

18 Feb 2023 22:17 UTC

228 points

135 comments6 min readLW link

(andreamiotti.substack.com)

Striking Implications for Learning Theory, Interpretability — and Safety?

RogerDearnaley5 Jan 2024 8:46 UTC

37 points

4 comments2 min readLW link

Open Problems in AI X-Risk [PAIS #5]

Dan H and TW123

10 Jun 2022 2:08 UTC

61 points

6 comments36 min readLW link

The ‘Neglected Approaches’ Approach: AE Studio’s Alignment Agenda

Cameron Berg, Judd Rosenblatt, Trent Hodgeson and Marc Carauleanu

18 Dec 2023 20:35 UTC

187 points

23 comments12 min readLW link 1 review

What can the principal-agent literature tell us about AI risk?

apc8 Feb 2020 21:28 UTC

104 points

29 comments16 min readLW link

Bing chat is the AI fire alarm

Ratios17 Feb 2023 6:51 UTC

115 points

63 comments3 min readLW link

Stampy’s AI Safety Info—New Distillations #1 [March 2023]

markov7 Apr 2023 11:06 UTC

42 points

0 comments2 min readLW link

(aisafety.info)

Another (outer) alignment failure story

paulfchristiano7 Apr 2021 20:12 UTC

249 points

38 comments12 min readLW link 1 review

Interpreting the Learning of Deceit

RogerDearnaley18 Dec 2023 8:12 UTC

30 points

14 comments9 min readLW link

MIRI announces new “Death With Dignity” strategy

Eliezer Yudkowsky2 Apr 2022 0:43 UTC

377 points

547 comments18 min readLW link 1 review

A transcript of the TED talk by Eliezer Yudkowsky

Mikhail Samin12 Jul 2023 12:12 UTC

105 points

13 comments4 min readLW link

Meta AI announces Cicero: Human-Level Diplomacy play (with dialogue)

Jacy Reese Anthis22 Nov 2022 16:50 UTC

93 points

64 comments1 min readLW link

(www.science.org)

Don’t Share Information Exfohazardous on Others’ AI-Risk Models

Thane Ruthenis19 Dec 2023 20:09 UTC

68 points

11 comments1 min readLW link

AI will change the world, but won’t take it over by playing “3-dimensional chess”.

boazbarak and benedelman

22 Nov 2022 18:57 UTC

134 points

97 comments24 min readLW link

A Gym Gridworld Environment for the Treacherous Turn

Michaël Trazzi28 Jul 2018 21:27 UTC

74 points

9 comments3 min readLW link

(github.com)

[Question] Will OpenAI’s work unintentionally increase existential risks related to AI?

adamShimi11 Aug 2020 18:16 UTC

53 points

55 comments1 min readLW link

Robin Hanson’s latest AI risk position statement

Liron3 Mar 2023 14:25 UTC

55 points

18 comments1 min readLW link

(www.overcomingbias.com)

Developmental Stages of GPTs

orthonormal26 Jul 2020 22:03 UTC

140 points

72 comments7 min readLW link 1 review

A Path out of Insufficient Views

Unreal24 Sep 2024 20:00 UTC

44 points

64 comments9 min readLW link

How AI Takeover Might Happen in 2 Years

joshc7 Feb 2025 17:10 UTC

431 points

139 comments29 min readLW link

(x.com)

Don’t accelerate problems you’re trying to solve

Andrea_Miotti and remember

15 Feb 2023 18:11 UTC

100 points

27 comments4 min readLW link

Architects of Our Own Demise: We Should Stop Developing AI Carelessly

Roko26 Oct 2023 0:36 UTC

170 points

75 comments3 min readLW link

The Hidden Complexity of Wishes

Eliezer Yudkowsky24 Nov 2007 0:12 UTC

180 points

199 comments8 min readLW link

An Appeal to AI Superintelligence: Reasons to Preserve Humanity

James_Miller18 Mar 2023 16:22 UTC

41 points

73 comments12 min readLW link

How good is humanity at coordination?

Buck21 Jul 2020 20:01 UTC

83 points

44 comments3 min readLW link

My Objections to “We’re All Gonna Die with Eliezer Yudkowsky”

Quintin Pope21 Mar 2023 0:06 UTC

363 points

233 comments39 min readLW link 1 review

Intent alignment should not be the goal for AGI x-risk reduction

John Nay26 Oct 2022 1:24 UTC

1 point

10 comments3 min readLW link

Counterarguments to the basic AI x-risk case

KatjaGrace14 Oct 2022 13:00 UTC

373 points

125 comments34 min readLW link 1 review

(aiimpacts.org)

AI Could Defeat All Of Us Combined

HoldenKarnofsky9 Jun 2022 15:50 UTC

170 points

42 comments17 min readLW link

(www.cold-takes.com)

Soft takeoff can still lead to decisive strategic advantage

Daniel Kokotajlo23 Aug 2019 16:39 UTC

122 points

47 comments8 min readLW link 4 reviews

Being at peace with Doom

Johannes C. Mayer9 Apr 2023 14:53 UTC

23 points

14 comments4 min readLW link 1 review

“Endgame safety” for AGI

Steven Byrnes24 Jan 2023 14:15 UTC

85 points

10 comments6 min readLW link

[Question] How likely are scenarios where AGI ends up overtly or de facto torturing us? How likely are scenarios where AGI prevents us from committing suicide or dying?

JohnGreer28 Mar 2023 18:00 UTC

11 points

4 comments1 min readLW link

Are minimal circuits deceptive?

evhub7 Sep 2019 18:11 UTC

78 points

11 comments8 min readLW link

DL towards the unaligned Recursive Self-Optimization attractor

jacob_cannell18 Dec 2021 2:15 UTC

32 points

22 comments4 min readLW link

Devil’s Advocate: Adverse Selection Against Conscientiousness

lionhearted (Sebastian Marshall)28 May 2023 17:53 UTC

10 points

2 comments1 min readLW link

Announcing Apollo Research

Marius Hobbhahn, beren, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni and Jérémy Scheurer

30 May 2023 16:17 UTC

217 points

11 comments8 min readLW link

Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?

Karl von Wendt25 Jun 2023 16:59 UTC

107 points

53 comments7 min readLW link

On Solving Problems Before They Appear: The Weird Epistemologies of Alignment

adamShimi11 Oct 2021 8:20 UTC

110 points

10 comments15 min readLW link

Self-Other Overlap: A Neglected Approach to AI Alignment

Marc Carauleanu, Mike Vaiana, Judd Rosenblatt, Diogo de Lucena, Cameron Berg and Trent Hodgeson

30 Jul 2024 16:22 UTC

226 points

51 comments12 min readLW link

A challenge for AGI organizations, and a challenge for readers

Rob Bensinger and Eliezer Yudkowsky

1 Dec 2022 23:11 UTC

302 points

33 comments2 min readLW link

Alexander and Yudkowsky on AGI goals

Scott Alexander and Eliezer Yudkowsky

24 Jan 2023 21:09 UTC

179 points

53 comments26 min readLW link 1 review

Should we postpone AGI until we reach safety?

otto.barten18 Nov 2020 15:43 UTC

27 points

36 comments3 min readLW link

Request to AGI organizations: Share your views on pausing AI progress

Orpheus16 and simeon_c

11 Apr 2023 17:30 UTC

141 points

11 comments1 min readLW link

Truthful LMs as a warm-up for aligned AGI

Jacob_Hilton17 Jan 2022 16:49 UTC

65 points

14 comments13 min readLW link

A Bear Case: My Predictions Regarding AI Progress

Thane Ruthenis5 Mar 2025 16:41 UTC

375 points

163 comments9 min readLW link

[RETRACTED] It’s time for EA leadership to pull the short-timelines fire alarm.

Not Relevant8 Apr 2022 16:07 UTC

116 points

166 comments4 min readLW link

World-Model Interpretability Is All We Need

Thane Ruthenis14 Jan 2023 19:37 UTC

36 points

22 comments21 min readLW link

The alignment problem from a deep learning perspective

Richard_Ngo10 Aug 2022 22:46 UTC

107 points

15 comments27 min readLW link 1 review

(4 min read) An intuitive explanation of the AI influence situation

trevor13 Jan 2024 17:34 UTC

12 points

26 comments4 min readLW link

Cortés, AI Risk, and the Dynamics of Competing Conquerors

James_Miller2 Jan 2024 16:37 UTC

14 points

3 comments3 min readLW link

The Fusion Power Generator Scenario

johnswentworth8 Aug 2020 18:31 UTC

156 points

30 comments3 min readLW link

Full Transcript: Eliezer Yudkowsky on the Bankless podcast

remember and Andrea_Miotti

23 Feb 2023 12:34 UTC

138 points

89 comments75 min readLW link

Stuxnet, not Skynet: Humanity’s disempowerment by AI

Roko4 Nov 2023 22:23 UTC

107 points

24 comments6 min readLW link

Why the technological singularity by AGI may never happen

hippke3 Sep 2021 14:19 UTC

5 points

14 comments1 min readLW link

Brainrot

Jesse Hoogland26 Jan 2025 5:35 UTC

43 points

0 comments3 min readLW link

Slow motion videos as AI risk intuition pumps

Andrew_Critch14 Jun 2022 19:31 UTC

241 points

41 comments2 min readLW link 1 review

[Question] How load-bearing is KL divergence from a known-good base model in modern RL?

faul_sname22 May 2025 12:08 UTC

12 points

2 comments4 min readLW link

Reframing the burden of proof: Companies should prove that models are safe (rather than expecting auditors to prove that models are dangerous)

Orpheus1625 Apr 2023 18:49 UTC

27 points

11 comments3 min readLW link

(childrenoficarus.substack.com)

My thoughts on OpenAI’s alignment plan

Orpheus1630 Dec 2022 19:33 UTC

55 points

3 comments20 min readLW link

Neuroscience and Alignment

Garrett Baker18 Mar 2024 21:09 UTC

42 points

25 comments2 min readLW link

The Control Problem: Unsolved or Unsolvable?

Remmelt2 Jun 2023 15:42 UTC

57 points

46 comments13 min readLW link

[Linkpost] Scott Alexander reacts to OpenAI’s latest post

Orpheus1611 Mar 2023 22:24 UTC

27 points

0 comments5 min readLW link

(astralcodexten.substack.com)

AI Takeover Scenario with Scaled LLMs

simeon_c16 Apr 2023 23:28 UTC

42 points

15 comments8 min readLW link

Reliability, Security, and AI risk: Notes from infosec textbook chapter 1

Orpheus167 Apr 2023 15:47 UTC

34 points

1 comment4 min readLW link

Assessment of AI safety agendas: think about the downside risk

Roman Leventov19 Dec 2023 9:00 UTC

13 points

1 comment1 min readLW link

ea.domains—Domains Free to a Good Home

plex12 Jan 2023 13:32 UTC

24 points

0 comments4 min readLW link

Empowerment is (almost) All We Need

jacob_cannell23 Oct 2022 21:48 UTC

61 points

44 comments17 min readLW link

Sam Altman and Ezra Klein on the AI Revolution

Zack_M_Davis27 Jun 2021 4:53 UTC

38 points

17 comments1 min readLW link

(www.nytimes.com)

Some disjunctive reasons for urgency on AI risk

Wei Dai15 Feb 2019 20:43 UTC

36 points

24 comments1 min readLW link

My Overview of the AI Alignment Landscape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC

127 points

9 comments15 min readLW link

We don’t want to post again “This might be the last AI Safety Camp”

Remmelt, Linda Linsefors and Robert Kralisch

21 Jan 2025 12:03 UTC

36 points

17 comments1 min readLW link

(manifund.org)

Thinking soberly about the context and consequences of Friendly AI

Mitchell_Porter16 Oct 2012 4:33 UTC

21 points

39 comments1 min readLW link

Podcast: Shoshannah Tekofsky on skilling up in AI safety, visiting Berkeley, and developing novel research ideas

Orpheus1625 Nov 2022 20:47 UTC

37 points

2 comments9 min readLW link

The other side of the tidal wave

KatjaGrace3 Nov 2023 5:40 UTC

189 points

86 comments1 min readLW link

(worldspiritsockpuppet.com)

Racing through a minefield: the AI deployment problem

HoldenKarnofsky22 Dec 2022 16:10 UTC

38 points

2 comments13 min readLW link

(www.cold-takes.com)

Epistemological Vigilance for Alignment

adamShimi6 Jun 2022 0:27 UTC

66 points

11 comments10 min readLW link

Low P(x-risk) as the Bailey for Low P(doom)

Vladimir_Nesov29 Jul 2025 18:01 UTC

48 points

29 comments2 min readLW link

What would a compute monitoring plan look like? [Linkpost]

Orpheus1626 Mar 2023 19:33 UTC

158 points

10 comments4 min readLW link

(arxiv.org)

Where’s the foom?

Fergus Fettes11 Apr 2023 15:50 UTC

34 points

27 comments2 min readLW link

All AGI Safety questions welcome (especially basic ones) [~monthly thread]

mwatkins and Robert Miles

26 Jan 2023 21:01 UTC

39 points

81 comments2 min readLW link

AMA Conjecture, A New Alignment Startup

adamShimi9 Apr 2022 9:43 UTC

47 points

42 comments1 min readLW link

Reply to Holden on ‘Tool AI’

Eliezer Yudkowsky12 Jun 2012 18:00 UTC

152 points

356 comments17 min readLW link

Plan for mediocre alignment of brain-like [model-based RL] AGI

Steven Byrnes13 Mar 2023 14:11 UTC

68 points

25 comments12 min readLW link

An AI risk argument that resonates with NYTimes readers

Julian Bradshaw12 Mar 2023 23:09 UTC

212 points

14 comments1 min readLW link

Beyond Hyperanthropomorphism

PointlessOne21 Aug 2022 17:55 UTC

3 points

17 comments1 min readLW link

(studio.ribbonfarm.com)

Levels of safety for AI and other technologies

jasoncrawford28 Jun 2023 18:35 UTC

16 points

0 comments2 min readLW link

(rootsofprogress.org)

Is progress in ML-assisted theorem-proving beneficial?

mako yass28 Sep 2021 1:54 UTC

11 points

3 comments1 min readLW link

A plea for solutionism on AI safety

jasoncrawford9 Jun 2023 16:29 UTC

72 points

6 comments6 min readLW link

(rootsofprogress.org)

[Question] Measure of complexity allowed by the laws of the universe and relative theory?

dr_s7 Sep 2023 12:21 UTC

8 points

22 comments1 min readLW link

The Rising Sea

Jesse Hoogland25 Jan 2025 20:48 UTC

96 points

5 comments2 min readLW link

An overview of 11 proposals for building safe advanced AI

evhub29 May 2020 20:38 UTC

220 points

37 comments38 min readLW link 2 reviews

The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better

Thane Ruthenis21 Feb 2025 20:15 UTC

152 points

53 comments6 min readLW link

Risks from AI Overview: Summary

Dan H, Mantas Mazeika and TW123

18 Aug 2023 1:21 UTC

25 points

1 comment13 min readLW link

(www.safe.ai)

Some conceptual highlights from “Disjunctive Scenarios of Catastrophic AI Risk”

Kaj_Sotala12 Feb 2018 12:30 UTC

45 points

4 comments6 min readLW link

(kajsotala.fi)

“Carefully Bootstrapped Alignment” is organizationally hard

Raemon17 Mar 2023 18:00 UTC

265 points

23 comments11 min readLW link 1 review

[Linkpost] Biden-Harris Executive Order on AI

beren30 Oct 2023 15:20 UTC

3 points

0 comments1 min readLW link

How Would an Utopia-Maximizer Look Like?

Thane Ruthenis20 Dec 2023 20:01 UTC

32 points

23 comments10 min readLW link

How dangerous is human-level AI?

Alex_Altair10 Jun 2022 17:38 UTC

21 points

4 comments8 min readLW link

My AI Model Delta Compared To Yudkowsky

johnswentworth10 Jun 2024 16:12 UTC

291 points

103 comments4 min readLW link

All AGI Safety questions welcome (especially basic ones) [April 2023]

steven04618 Apr 2023 4:21 UTC

57 points

89 comments2 min readLW link

Challenges with Breaking into MIRI-Style Research

Chris_Leong17 Jan 2022 9:23 UTC

75 points

16 comments2 min readLW link

AGI Safety Literature Review (Everitt, Lea & Hutter 2018)

Kaj_Sotala4 May 2018 8:56 UTC

14 points

1 comment1 min readLW link

(arxiv.org)

Sensor Exposure can Compromise the Human Brain in the 2020s

trevor26 Oct 2023 3:31 UTC

17 points

6 comments10 min readLW link

Difficulties in making powerful aligned AI

DanielFilan14 May 2023 20:50 UTC

41 points

1 comment10 min readLW link

(danielfilan.com)

Ten Levels of AI Alignment Difficulty

Sammy Martin3 Jul 2023 20:20 UTC

140 points

24 comments12 min readLW link 1 review

[Question] Suggestions of posts on the AF to review

adamShimi16 Feb 2021 12:40 UTC

56 points

20 comments1 min readLW link

I’m trying out “asteroid mindset”

Alex_Altair3 Jun 2022 13:35 UTC

90 points

5 comments4 min readLW link

Thoughts on Robin Hanson’s AI Impacts interview

Steven Byrnes24 Nov 2019 1:40 UTC

25 points

3 comments7 min readLW link

Lurking in the Noise

J Bostock25 Jun 2025 13:36 UTC

37 points

2 comments4 min readLW link

Confusions in My Model of AI Risk

peterbarnett7 Jul 2022 1:05 UTC

22 points

9 comments5 min readLW link

Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

Ben Pace4 Oct 2019 4:08 UTC

221 points

61 comments15 min readLW link 2 reviews

Paper: On measuring situational awareness in LLMs

Owain_Evans, Daniel Kokotajlo, Mikita Balesni, Tomek Korbak, Asa Cooper Stickland, Meg and Maximilian Kaufmann

4 Sep 2023 12:54 UTC

109 points

17 comments5 min readLW link

(arxiv.org)

The Shortest Path Between Scylla and Charybdis

Thane Ruthenis18 Dec 2023 20:08 UTC

50 points

8 comments5 min readLW link

Response to Aschenbrenner’s “Situational Awareness”

Rob Bensinger6 Jun 2024 22:57 UTC

197 points

27 comments3 min readLW link

DeepMind alignment team opinions on AGI ruin arguments

Vika12 Aug 2022 21:06 UTC

397 points

37 comments14 min readLW link 1 review

Activation additions in a small residual network

Garrett Baker22 May 2023 20:28 UTC

22 points

4 comments3 min readLW link

“Taking AI Risk Seriously” (thoughts by Critch)

Raemon29 Jan 2018 9:27 UTC

110 points

68 comments13 min readLW link

[Link] Sarah Constantin: “Why I am Not An AI Doomer”

lbThingrb12 Apr 2023 1:52 UTC

61 points

13 comments1 min readLW link

(sarahconstantin.substack.com)

Wizards and prophets of AI [draft for comment]

jasoncrawford31 Mar 2023 20:22 UTC

16 points

11 comments6 min readLW link

I Think Eliezer Should Go on Glenn Beck

Lao Mein30 Jun 2023 3:12 UTC

29 points

24 comments1 min readLW link

Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic

Orpheus1620 Dec 2022 21:39 UTC

18 points

2 comments11 min readLW link

Current AIs Provide Nearly No Data Relevant to AGI Alignment

Thane Ruthenis15 Dec 2023 20:16 UTC

132 points

157 comments8 min readLW link 1 review

How evals might (or might not) prevent catastrophic risks from AI

Orpheus167 Feb 2023 20:16 UTC

45 points

0 comments9 min readLW link

Stanford Encyclopedia of Philosophy on AI ethics and superintelligence

Kaj_Sotala2 May 2020 7:35 UTC

43 points

19 comments7 min readLW link

(plato.stanford.edu)

Minimum Viable Exterminator

Richard Horvath29 May 2023 16:32 UTC

14 points

5 comments5 min readLW link

Work on Security Instead of Friendliness?

Wei Dai21 Jul 2012 18:28 UTC

71 points

107 comments2 min readLW link

[Linkpost] Existential Risk Analysis in Empirical Research Papers

Dan H2 Jul 2022 0:09 UTC

40 points

0 comments1 min readLW link

(arxiv.org)

Epistemic Strategies of Safety-Capabilities Tradeoffs

adamShimi22 Oct 2021 8:22 UTC

5 points

0 comments6 min readLW link

Will Artificial Superintelligence Kill Us?

James_Miller23 May 2023 16:27 UTC

33 points

2 comments22 min readLW link

Poster Session on AI Safety

Neil Crawford12 Nov 2022 3:50 UTC

7 points

8 comments4 min readLW link

Interview with Eliezer Yudkowsky on Rationality and Systematic Misunderstanding of AI Alignment

Liron15 Sep 2025 18:35 UTC

89 points

21 comments93 min readLW link

(www.youtube.com)

Detachment vs attachment [AI risk and mental health]

Neil 15 Jan 2024 0:41 UTC

15 points

4 comments3 min readLW link

Against “argument from overhang risk”

RobertM16 May 2024 4:44 UTC

31 points

11 comments5 min readLW link

Bankless Podcast: 159 - We’re All Gonna Die with Eliezer Yudkowsky

bayesed20 Feb 2023 16:42 UTC

83 points

54 comments1 min readLW link

(www.youtube.com)

The Wizard of Oz Problem: How incentives and narratives can skew our perception of AI developments

Orpheus1620 Mar 2023 20:44 UTC

16 points

3 comments6 min readLW link

Twitter thread on AI takeover scenarios

Richard_Ngo31 Jul 2024 0:24 UTC

37 points

0 comments2 min readLW link

(x.com)

The AI Revolution in Biology

Roman Leventov26 May 2024 9:30 UTC

13 points

0 comments1 min readLW link

(www.cognitiverevolution.ai)

AISN #25: White House Executive Order on AI, UK AI Safety Summit, and Progress on Voluntary Evaluations of AI Risks

Dan H31 Oct 2023 19:34 UTC

35 points

1 comment6 min readLW link

(newsletter.safe.ai)

[Question] Would (myopic) general public good producers significantly accelerate the development of AGI?

mako yass2 Mar 2022 23:47 UTC

25 points

10 comments1 min readLW link

AXRP Episode 13 - First Principles of AGI Safety with Richard Ngo

DanielFilan31 Mar 2022 5:20 UTC

25 points

1 comment48 min readLW link

ChatGPT (and now GPT4) is very easily distracted from its rules

dmcs15 Mar 2023 17:55 UTC

180 points

42 comments1 min readLW link

The basic reasons I expect AGI ruin

Rob Bensinger18 Apr 2023 3:37 UTC

189 points

73 comments14 min readLW link

Google’s Ethical AI team and AI Safety

magfrump20 Feb 2021 9:42 UTC

12 points

16 comments7 min readLW link

Mechanism Design for AI Safety—Reading Group Curriculum

Rubi J. Hudson25 Oct 2022 3:54 UTC

15 points

3 comments4 min readLW link

Request: stop advancing AI capabilities

So8res26 May 2023 17:42 UTC

154 points

24 comments1 min readLW link

A Case for the Least Forgiving Take On Alignment

Thane Ruthenis2 May 2023 21:34 UTC

100 points

85 comments22 min readLW link

My disagreements with “AGI ruin: A List of Lethalities”

Noosphere8915 Sep 2024 17:22 UTC

36 points

46 comments18 min readLW link

My current uncertainties regarding AI, alignment, and the end of the world

dominicq14 Nov 2021 14:08 UTC

2 points

3 comments2 min readLW link

[Question] First and Last Questions for GPT-5*

Mitchell_Porter24 Nov 2023 5:03 UTC

20 points

5 comments1 min readLW link

Epistemic Strategies of Selection Theorems

adamShimi18 Oct 2021 8:57 UTC

33 points

1 comment12 min readLW link

What if we approach AI safety like a technical engineering safety problem

zeshen20 Aug 2022 10:29 UTC

36 points

4 comments7 min readLW link

The strategy-stealing assumption

paulfchristiano16 Sep 2019 15:23 UTC

87 points

54 comments12 min readLW link 3 reviews

Complex Systems for AI Safety [Pragmatic AI Safety #3]

Dan H and TW123

24 May 2022 0:00 UTC

58 points

3 comments21 min readLW link

Contra “Strong Coherence”

DragonGod4 Mar 2023 20:05 UTC

39 points

24 comments1 min readLW link

More Is Different for AI

jsteinhardt4 Jan 2022 19:30 UTC

140 points

24 comments3 min readLW link 1 review

(bounded-regret.ghost.io)

My research agenda in agent foundations

Alex_Altair28 Jun 2023 18:00 UTC

76 points

9 comments11 min readLW link

Catastrophic Risks from AI #3: AI Race

Dan H, Mantas Mazeika and TW123

23 Jun 2023 19:21 UTC

18 points

9 comments29 min readLW link

(arxiv.org)

Non-Adversarial Goodhart and AI Risks

Davidmanheim27 Mar 2018 1:39 UTC

22 points

11 comments6 min readLW link

Bayeswatch 7: Wildfire

lsusr8 Sep 2021 5:35 UTC

52 points

6 comments3 min readLW link

Winners of AI Alignment Awards Research Contest

Orpheus16 and Olive Branch

13 Jul 2023 16:14 UTC

115 points

4 comments12 min readLW link

(alignmentawards.com)

Quote quiz: “drifting into dependence”

jasoncrawford27 Apr 2023 15:13 UTC

7 points

6 comments1 min readLW link

(rootsofprogress.org)

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

Rohin Shah2 Jan 2020 18:20 UTC

36 points

95 comments10 min readLW link

(mailchi.mp)

Brainstorming additional AI risk reduction ideas

John_Maxwell14 Jun 2012 7:55 UTC

19 points

37 comments1 min readLW link

[Question] What are good alignment conference papers?

adamShimi28 Aug 2021 13:35 UTC

12 points

2 comments1 min readLW link

Six AI Risk/Strategy Ideas

Wei Dai27 Aug 2019 0:40 UTC

73 points

17 comments4 min readLW link 1 review

AISN #35: Lobbying on AI Regulation Plus, New Models from OpenAI and Google, and Legal Regimes for Training on Copyrighted Data

Dan H and Corin Katzke

16 May 2024 14:29 UTC

2 points

3 comments6 min readLW link

(newsletter.safe.ai)

Comparing Four Approaches to Inner Alignment

Lucas Teixeira29 Jul 2022 21:06 UTC

38 points

1 comment9 min readLW link

Artificial Intelligence: A Modern Approach (4th edition) on the Alignment Problem

Zack_M_Davis17 Sep 2020 2:23 UTC

72 points

12 comments5 min readLW link

(aima.cs.berkeley.edu)

[Question] Does the Structure of an algorithm matter for AI Risk and/or consciousness?

Logan Zoellner3 Dec 2021 18:31 UTC

7 points

4 comments1 min readLW link

“If we go extinct due to misaligned AI, at least nature will continue, right? … right?”

plex18 May 2024 14:09 UTC

54 points

23 comments2 min readLW link

(aisafety.info)

Introducing AI Lab Watch

Zach Stein-Perlman30 Apr 2024 17:00 UTC

225 points

30 comments1 min readLW link

(ailabwatch.org)

Catastrophic Risks from AI #2: Malicious Use

Dan H, Mantas Mazeika and TW123

22 Jun 2023 17:10 UTC

38 points

1 comment17 min readLW link

(arxiv.org)

Rogue AGI Embodies Valuable Intellectual Property

Mark Xu and CarlShulman

3 Jun 2021 20:37 UTC

71 points

9 comments3 min readLW link

A shortcoming of concrete demonstrations as AGI risk advocacy

Steven Byrnes11 Dec 2024 16:48 UTC

106 points

27 comments2 min readLW link

Robustness to Scaling Down: More Important Than I Thought

adamShimi23 Jul 2022 11:40 UTC

38 points

5 comments3 min readLW link

Projects I would like to see (possibly at AI Safety Camp)

Linda Linsefors27 Sep 2023 21:27 UTC

22 points

12 comments4 min readLW link

List your AI X-Risk cruxes!

Aryeh Englander28 Apr 2024 18:26 UTC

42 points

7 comments2 min readLW link

AI companies aren’t really using external evaluators

Zach Stein-Perlman24 May 2024 16:01 UTC

242 points

15 comments4 min readLW link

[Question] Did AI pioneers not worry much about AI risks?

lisperati9 Feb 2020 19:58 UTC

42 points

9 comments1 min readLW link

The Plan − 2023 Version

johnswentworth29 Dec 2023 23:34 UTC

152 points

40 comments31 min readLW link 1 review

Applications for AI Safety Camp 2022 Now Open!

adamShimi17 Nov 2021 21:42 UTC

47 points

3 comments1 min readLW link

Apply to lead a project during the next virtual AI Safety Camp

Linda Linsefors and Remmelt

13 Sep 2023 13:29 UTC

19 points

0 comments5 min readLW link

(aisafety.camp)

Distilled—AGI Safety from First Principles

Harrison G29 May 2022 0:57 UTC

11 points

1 comment14 min readLW link

Wentworth and Larsen on buying time

Orpheus16, Thomas Larsen and johnswentworth

9 Jan 2023 21:31 UTC

74 points

6 comments12 min readLW link

Pausing AI is Positive Expected Value

Liron10 Mar 2024 17:10 UTC

9 points

2 comments3 min readLW link

(twitter.com)

A Rocket–Interpretability Analogy

plex21 Oct 2024 13:55 UTC

155 points

31 comments1 min readLW link

The Mask Comes Off: A Trio of Tales

Zvi14 Feb 2025 15:30 UTC

81 points

1 comment13 min readLW link

(thezvi.wordpress.com)

Gradual Disempowerment, Shell Games and Flinches

Jan_Kulveit2 Feb 2025 14:47 UTC

133 points

36 comments6 min readLW link

AI Safety “Success Stories”

Wei Dai7 Sep 2019 2:54 UTC

128 points

27 comments4 min readLW link 1 review

[Linkpost] TIME article: DeepMind’s CEO Helped Take AI Mainstream. Now He’s Urging Caution

Orpheus1621 Jan 2023 16:51 UTC

58 points

2 comments3 min readLW link

(time.com)

Eliezer and I wrote a book: If Anyone Builds It, Everyone Dies

So8res14 May 2025 19:00 UTC

648 points

114 comments2 min readLW link

Most People Don’t Realize We Have No Idea How Our AIs Work

Thane Ruthenis21 Dec 2023 20:02 UTC

159 points

42 comments1 min readLW link

New survey: 46% of Americans are concerned about extinction from AI; 69% support a six-month pause in AI development

Orpheus165 Apr 2023 1:26 UTC

47 points

9 comments1 min readLW link

(today.yougov.com)

Making alignment a law of the universe

Richard Juggins25 Feb 2025 10:44 UTC

0 points

3 comments15 min readLW link

Mistral Large 2 (123B) seems to exhibit alignment faking

Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Cameron Berg, Judd Rosenblatt, Mike Vaiana and Trent Hodgeson

27 Mar 2025 15:39 UTC

81 points

4 comments13 min readLW link

[Question] Why not constrain wetlabs instead of AI?

Lone Pine21 Mar 2023 18:02 UTC

15 points

10 comments1 min readLW link

A Quick Guide to Confronting Doom

Ruby13 Apr 2022 19:30 UTC

245 points

33 comments2 min readLW link

AI Safety Newsletter #1 [CAIS Linkpost]

Orpheus16, Dan H and ozhang

10 Apr 2023 20:18 UTC

45 points

0 comments4 min readLW link

(newsletter.safe.ai)

Results from the language model hackathon

Esben Kran10 Oct 2022 8:29 UTC

22 points

1 comment4 min readLW link

ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so

Christopher King15 Mar 2023 0:29 UTC

116 points

22 comments2 min readLW link

A moral backlash against AI will probably slow down AGI development

geoffreymiller7 Jun 2023 20:39 UTC

51 points

10 comments14 min readLW link

n=3 AI Risk Quick Math and Reasoning

lionhearted (Sebastian Marshall)7 Apr 2023 20:27 UTC

6 points

3 comments4 min readLW link

Catastrophic Risks from AI #4: Organizational Risks

Dan H, Mantas Mazeika and TW123

26 Jun 2023 19:36 UTC

23 points

0 comments21 min readLW link

(arxiv.org)

Against a General Factor of Doom

Jeffrey Heninger23 Nov 2022 16:50 UTC

63 points

19 comments4 min readLW link 1 review

(aiimpacts.org)

April drafts

AI Impacts1 Apr 2021 18:10 UTC

49 points

2 comments1 min readLW link

(aiimpacts.org)

AI 2027 Thoughts

PeterMcCluskey26 Apr 2025 0:00 UTC

29 points

2 comments6 min readLW link

(bayesianinvestor.com)

AI Safety Seems Hard to Measure

HoldenKarnofsky8 Dec 2022 19:50 UTC

71 points

6 comments14 min readLW link

(www.cold-takes.com)

Confused why a “capabilities research is good for alignment progress” position isn’t discussed more

Kaj_Sotala2 Jun 2022 21:41 UTC

131 points

27 comments4 min readLW link

Interview with Skynet

lsusr30 Sep 2021 2:20 UTC

49 points

1 comment2 min readLW link

Invitation to lead a project at AI Safety Camp (Virtual Edition, 2025)

Linda Linsefors, Remmelt Ellen and Robert Kralisch

23 Aug 2024 14:18 UTC

17 points

2 comments4 min readLW link

My motivation and theory of change for working in AI healthtech

Andrew_Critch12 Oct 2024 0:36 UTC

180 points

39 comments14 min readLW link

Contra Hanson on AI Risk

Liron4 Mar 2023 8:02 UTC

36 points

23 comments8 min readLW link

[Question] Should people build productizations of open source AI models?

lc2 Nov 2023 1:26 UTC

23 points

0 comments1 min readLW link

The Dissolution of AI Safety

Roko12 Dec 2024 10:34 UTC

8 points

44 comments1 min readLW link

(www.transhumanaxiology.com)

AI Safety Newsletter #5: Geoffrey Hinton speaks out on AI risk, the White House meets with AI labs, and Trojan attacks on language models

Dan H and Orpheus16

9 May 2023 15:26 UTC

28 points

1 comment4 min readLW link

(newsletter.safe.ai)

AI safety tax dynamics

owencb23 Oct 2024 12:18 UTC

22 points

0 comments6 min readLW link

(strangecities.substack.com)

Clarifying “What failure looks like”

Sam Clarke20 Sep 2020 20:40 UTC

97 points

14 comments17 min readLW link

Why do misalignment risks increase as AIs get more capable?

ryan_greenblatt11 Apr 2025 3:06 UTC

33 points

6 comments3 min readLW link

AI Safety Memes Wiki

plex and Vishakha

24 Jul 2024 18:53 UTC

37 points

2 comments1 min readLW link

(aisafety.info)

Chad Jones paper modeling AI and x-risk vs. growth

jasoncrawford26 Apr 2023 20:07 UTC

39 points

7 comments2 min readLW link

(web.stanford.edu)

What Failure Looks Like: Distilling the Discussion

Ben Pace29 Jul 2020 21:49 UTC

82 points

14 comments7 min readLW link

Synthesizing Standalone World-Models (+ Bounties, Seeking Funding)

Thane Ruthenis22 Sep 2025 19:06 UTC

64 points

22 comments11 min readLW link

A case for AI alignment being difficult

jessicata31 Dec 2023 19:55 UTC

106 points

59 comments15 min readLW link 1 review

(unstableontology.com)

The first future and the best future

KatjaGrace25 Apr 2024 6:40 UTC

106 points

12 comments1 min readLW link

(worldspiritsockpuppet.com)

Disentangling arguments for the importance of AI safety

Richard_Ngo21 Jan 2019 12:41 UTC

133 points

23 comments8 min readLW link

Do not delete your misaligned AGI.

mako yass24 Mar 2024 21:37 UTC

63 points

13 comments3 min readLW link

“warning about ai doom” is also “announcing capabilities progress to noobs”

the gears to ascension8 Apr 2023 23:42 UTC

23 points

5 comments3 min readLW link

Reducing LLM deception at scale with self-other overlap fine-tuning

Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Cameron Berg, Mike Vaiana and Trent Hodgeson

13 Mar 2025 19:09 UTC

162 points

46 comments6 min readLW link

[Question] Good taxonomies of all risks (small or large) from AI?

Aryeh Englander5 Mar 2024 18:15 UTC

6 points

1 comment1 min readLW link

Will working here advance AGI? Help us not destroy the world!

Yonatan Cale29 May 2022 11:42 UTC

30 points

46 comments1 min readLW link

Eight Short Studies On Excuses

Scott Alexander20 Apr 2010 23:01 UTC

872 points

254 comments10 min readLW link

4 ways to think about democratizing AI [GovAI Linkpost]

Orpheus1613 Feb 2023 18:06 UTC

24 points

4 comments1 min readLW link

(www.governance.ai)

No One-Size-Fit-All Epistemic Strategy

adamShimi20 Aug 2022 12:56 UTC

24 points

2 comments2 min readLW link

[Question] What does it look like for AI to significantly improve human coordination, before superintelligence?

Bird Concept15 Jan 2024 19:22 UTC

22 points

2 comments1 min readLW link

Paradigm-building from first principles: Effective altruism, AGI, and alignment

Cameron Berg8 Feb 2022 16:12 UTC

29 points

5 comments14 min readLW link

Talk to me about your summer/career plans

Orpheus1631 Jan 2023 18:29 UTC

31 points

3 comments2 min readLW link

Announcing Human-aligned AI Summer School

Jan_Kulveit and Tomáš Gavenčiak

22 May 2024 8:55 UTC

51 points

0 comments1 min readLW link

(humanaligned.ai)

Our Reality: A Simulation Run by a Paperclip Maximizer

James_Miller and avturchin

27 Apr 2025 16:17 UTC

21 points

65 comments5 min readLW link

[Question] Why are we sure that AI will “want” something?

Shmi16 Sep 2022 20:35 UTC

31 points

57 comments1 min readLW link

AI Safety is Dropping the Ball on Clown Attacks

trevor22 Oct 2023 20:09 UTC

74 points

83 comments34 min readLW link

Framing approaches to alignment and the hard problem of AI cognition

ryan_greenblatt15 Dec 2021 19:06 UTC

16 points

15 comments27 min readLW link

Talking publicly about AI risk

Jan_Kulveit21 Apr 2023 11:28 UTC

180 points

9 comments6 min readLW link

[Question] How much do personal biases in risk assessment affect assessment of AI risks?

Gordon Seidoh Worley3 May 2023 6:12 UTC

10 points

8 comments1 min readLW link

AI Doom Is Not (Only) Disjunctive

NickGabs30 Mar 2023 1:42 UTC

12 points

0 comments5 min readLW link

DeepMind and Google Brain are merging [Linkpost]

Orpheus1620 Apr 2023 18:47 UTC

55 points

5 comments1 min readLW link

(www.deepmind.com)

Oversight Misses 100% of Thoughts The AI Does Not Think

johnswentworth12 Aug 2022 16:30 UTC

113 points

49 comments1 min readLW link

On A List of Lethalities

Zvi13 Jun 2022 12:30 UTC

165 points

50 comments54 min readLW link 1 review

(thezvi.wordpress.com)

25 Min Talk on MetaEthical.AI with Questions from Stuart Armstrong

June Ku29 Apr 2021 15:38 UTC

21 points

7 comments1 min readLW link

A guide to Iterated Amplification & Debate

Rafael Harth15 Nov 2020 17:14 UTC

75 points

12 comments15 min readLW link

Predictable updating about AI risk

Joe Carlsmith8 May 2023 21:53 UTC

295 points

25 comments36 min readLW link 1 review

All AGI Safety questions welcome (especially basic ones) [May 2023]

steven04618 May 2023 22:30 UTC

33 points

44 comments2 min readLW link

AI #76: Six Shorts Stories About OpenAI

Zvi8 Aug 2024 13:50 UTC

53 points

10 comments48 min readLW link

(thezvi.wordpress.com)

[Talk transcript] What “structure” is and why it matters

Alex_Altair25 Jul 2024 15:49 UTC

23 points

0 comments5 min readLW link

(www.youtube.com)

Paradigm-building: Introduction

Cameron Berg8 Feb 2022 0:06 UTC

28 points

0 comments2 min readLW link

Are we there yet?

theflowerpot20 Jun 2022 11:19 UTC

2 points

2 comments1 min readLW link

Taboo P(doom)

NathanBarnard3 Feb 2023 10:37 UTC

14 points

10 comments1 min readLW link

Clarifying some key hypotheses in AI alignment

Ben Cottier and Rohin Shah

15 Aug 2019 21:29 UTC

79 points

12 comments9 min readLW link

Rejecting Violence as an AI Safety Strategy

James_Miller22 Sep 2025 16:34 UTC

57 points

5 comments3 min readLW link

The Alignment Problem

lsusr11 Jul 2022 3:03 UTC

47 points

18 comments3 min readLW link

Relaxed adversarial training for inner alignment

evhub10 Sep 2019 23:03 UTC

69 points

27 comments27 min readLW link

Why I don’t believe in doom

mukashi7 Jun 2022 23:49 UTC

6 points

30 comments4 min readLW link

The Main Sources of AI Risk?

Daniel Kokotajlo and Wei Dai

21 Mar 2019 18:28 UTC

128 points

26 comments2 min readLW link

A (EtA: quick) note on terminology: AI Alignment != AI x-safety

David Scott Krueger (formerly: capybaralet)8 Feb 2023 22:33 UTC

46 points

20 comments1 min readLW link

Refine’s Third Blog Post Day/Week

adamShimi17 Sep 2022 17:03 UTC

18 points

0 comments1 min readLW link

If you wish to make an apple pie, you must first become dictator of the universe

jasoncrawford5 Jul 2023 18:14 UTC

27 points

9 comments13 min readLW link

(rootsofprogress.org)

Three Stories for How AGI Comes Before FAI

John_Maxwell17 Sep 2019 23:26 UTC

27 points

5 comments6 min readLW link

Worrisome misunderstanding of the core issues with AI transition

Roman Leventov18 Jan 2024 10:05 UTC

5 points

2 comments4 min readLW link

Using Brain-Computer Interfaces to get more data for AI alignment

Robbo7 Nov 2021 0:00 UTC

43 points

10 comments7 min readLW link

Drug addicts and deceptively aligned agents—a comparative analysis

Jan5 Nov 2021 21:42 UTC

42 points

2 comments12 min readLW link

(universalprior.substack.com)

A shift in arguments for AI risk

Richard_Ngo28 May 2019 13:47 UTC

32 points

7 comments1 min readLW link

(fragile-credences.github.io)

Activation additions in a simple MNIST network

Garrett Baker18 May 2023 2:49 UTC

26 points

0 comments2 min readLW link

Catastrophic Risks from AI #5: Rogue AIs

Dan H, Mantas Mazeika and TW123

27 Jun 2023 22:06 UTC

15 points

0 comments22 min readLW link

(arxiv.org)

We need (a lot) more rogue agent honeypots

Ozyrus23 Mar 2025 22:24 UTC

37 points

12 comments4 min readLW link

Douglas Hofstadter changes his mind on Deep Learning & AI risk (June 2023)?

gwern3 Jul 2023 0:48 UTC

428 points

54 comments7 min readLW link

(www.youtube.com)

Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)

Andrew_Critch14 Jun 2024 0:16 UTC

357 points

38 comments4 min readLW link

My guess at Conjecture’s vision: triggering a narrative bifurcation

Alexandre Variengien6 Feb 2024 19:10 UTC

75 points

12 comments16 min readLW link

Many AI governance proposals have a tradeoff between usefulness and feasibility

Orpheus16 and Carson Ezell

3 Feb 2023 18:49 UTC

22 points

2 comments2 min readLW link

AI Alignment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC

126 points

6 comments35 min readLW link

Thoughts on refusing harmful requests to large language models

William_S19 Jan 2023 19:49 UTC

32 points

4 comments2 min readLW link

MIRI’s April 2024 Newsletter

Harlan12 Apr 2024 23:38 UTC

95 points

0 comments3 min readLW link

(intelligence.org)

Why AI Safety is Hard

Simon Möller22 Mar 2023 10:44 UTC

1 point

0 comments6 min readLW link

Using blinders to help you see things for what they are

Adam Zerner11 Nov 2021 7:07 UTC

13 points

2 comments2 min readLW link

Behavioral Sufficient Statistics for Goal-Directedness

adamShimi11 Mar 2021 15:01 UTC

21 points

12 comments9 min readLW link

AI Safety Info Distillation Fellowship

Robert Miles and mwatkins

17 Feb 2023 16:16 UTC

47 points

3 comments3 min readLW link

AI #2

Zvi2 Mar 2023 14:50 UTC

66 points

18 comments55 min readLW link

(thezvi.wordpress.com)

The Inner Alignment Problem

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

4 Jun 2019 1:20 UTC

105 points

17 comments13 min readLW link

It Looks Like You’re Trying To Take Over The World

gwern9 Mar 2022 16:35 UTC

410 points

120 comments1 min readLW link 1 review

(www.gwern.net)

AI #1: Sydney and Bing

Zvi21 Feb 2023 14:00 UTC

171 points

45 comments61 min readLW link 1 review

(thezvi.wordpress.com)

AI Fire Alarm Scenarios

PeterMcCluskey28 Dec 2021 2:20 UTC

10 points

1 comment6 min readLW link

(www.bayesianinvestor.com)

RA Bounty: Looking for feedback on screenplay about AI Risk

Writer26 Oct 2023 13:23 UTC

32 points

6 comments1 min readLW link

AI Safety Microgrant Round

Chris_Leong14 Nov 2022 4:25 UTC

22 points

1 comment3 min readLW link

Financial Times: We must slow down the race to God-like AI

trevor13 Apr 2023 19:55 UTC

113 points

17 comments16 min readLW link

(www.ft.com)

An unaligned benchmark

paulfchristiano17 Nov 2018 15:51 UTC

31 points

0 comments9 min readLW link

All AGI Safety questions welcome (especially basic ones) [~monthly thread]

Robert Miles1 Nov 2022 23:23 UTC

68 points

106 comments2 min readLW link

Deceptive Alignment

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

5 Jun 2019 20:16 UTC

118 points

20 comments17 min readLW link

We’re Not Ready: thoughts on “pausing” and responsible scaling policies

HoldenKarnofsky27 Oct 2023 15:19 UTC

200 points

33 comments8 min readLW link

AI Safety Newsletter #6: Examples of AI safety progress, Yoshua Bengio proposes a ban on AI agents, and lessons from nuclear arms control

Dan H and Orpheus16

16 May 2023 15:14 UTC

31 points

0 comments6 min readLW link

(newsletter.safe.ai)

Questions about Conjecure’s CoEm proposal

Orpheus16 and NicholasKees

9 Mar 2023 19:32 UTC

51 points

4 comments2 min readLW link

My Overview of the AI Alignment Landscape: Threat Models

Neel Nanda25 Dec 2021 23:07 UTC

53 points

3 comments28 min readLW link

Complex Systems are Hard to Control

jsteinhardt4 Apr 2023 0:00 UTC

42 points

5 comments10 min readLW link

(bounded-regret.ghost.io)

Review of “Fun with +12 OOMs of Compute”

adamShimi, Joe Collman and Gyrodiot

28 Mar 2021 14:55 UTC

65 points

21 comments8 min readLW link 1 review

Why I’m Worried About AI

peterbarnett23 May 2022 21:13 UTC

22 points

2 comments12 min readLW link

Four lenses on AI risks

jasoncrawford28 Mar 2023 21:52 UTC

23 points

5 comments3 min readLW link

(rootsofprogress.org)

A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans

Thane Ruthenis17 Dec 2023 20:28 UTC

29 points

7 comments11 min readLW link

Catastrophe through Chaos

Marius Hobbhahn31 Jan 2025 14:19 UTC

187 points

17 comments12 min readLW link

Announcement: AI alignment prize winners and next round

cousin_it15 Jan 2018 14:33 UTC

81 points

68 comments2 min readLW link

The Preference Fulfillment Hypothesis

Kaj_Sotala26 Feb 2023 10:55 UTC

66 points

63 comments11 min readLW link

A Narrow Path: a plan to deal with AI extinction risk

Andrea_Miotti, davekasten and Tolga

7 Oct 2024 13:02 UTC

74 points

12 comments2 min readLW link

(www.narrowpath.co)

Drexler on AI Risk

PeterMcCluskey1 Feb 2019 5:11 UTC

35 points

10 comments9 min readLW link

(www.bayesianinvestor.com)

Announcing AISIC 2022 - the AI Safety Israel Conference, October 19-20

Davidmanheim21 Sep 2022 19:32 UTC

13 points

0 comments1 min readLW link

Learning societal values from law as part of an AGI alignment strategy

John Nay21 Oct 2022 2:03 UTC

5 points

18 comments54 min readLW link

Consider the humble rock (or: why the dumb thing kills you)

pleiotroth4 Jul 2024 13:54 UTC

62 points

11 comments4 min readLW link

[Question] Has there been any work on attempting to use Pascal’s Mugging to make an AGI behave?

Chris_Leong15 Jun 2022 8:33 UTC

7 points

17 comments1 min readLW link

Hands-On Experience Is Not Magic

Thane Ruthenis27 May 2023 16:57 UTC

22 points

14 comments5 min readLW link

Why don’t singularitarians bet on the creation of AGI by buying stocks?

John_Maxwell11 Mar 2020 16:27 UTC

43 points

21 comments4 min readLW link

“Why can’t you just turn it off?”

Roko19 Nov 2023 14:46 UTC

48 points

25 comments1 min readLW link

How I Formed My Own Views About AI Safety

Neel Nanda27 Feb 2022 18:50 UTC

66 points

6 comments13 min readLW link

(www.neelnanda.io)

AI risk hub in Singapore?

Daniel Kokotajlo29 Oct 2020 11:45 UTC

59 points

19 comments4 min readLW link

Another plausible scenario of AI risk: AI builds military infrastructure while collaborating with humans, defects later.

avturchin10 Jun 2022 17:24 UTC

10 points

2 comments1 min readLW link

OpenAI: Fallout

Zvi28 May 2024 13:20 UTC

204 points

25 comments36 min readLW link

(thezvi.wordpress.com)

Adam Smith Meets AI Doomers

James_Miller31 Jan 2024 15:53 UTC

35 points

10 comments5 min readLW link

All images from the WaitButWhy sequence on AI

trevor8 Apr 2023 7:36 UTC

73 points

5 comments2 min readLW link

Information warfare historically revolved around human conduits

trevor28 Aug 2023 18:54 UTC

37 points

7 comments3 min readLW link

deleted

funnyfranco15 Mar 2025 6:08 UTC

8 points

0 comments1 min readLW link

Will GPT-5 be able to self-improve?

Nathan Helm-Burger29 Apr 2023 17:34 UTC

18 points

22 comments3 min readLW link

Risks from Learned Optimization: Conclusion and Related Work

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

7 Jun 2019 19:53 UTC

82 points

5 comments6 min readLW link

Shapes of Mind and Pluralism in Alignment

adamShimi13 Aug 2022 10:01 UTC

33 points

2 comments2 min readLW link

Systems that cannot be unsafe cannot be safe

Davidmanheim2 May 2023 8:53 UTC

62 points

27 comments2 min readLW link

What I Learned Running Refine

adamShimi24 Nov 2022 14:49 UTC

108 points

5 comments4 min readLW link

Are we dropping the ball on Recommendation AIs?

Charbel-Raphaël23 Oct 2024 17:48 UTC

49 points

17 comments6 min readLW link

How Josiah became an AI safety researcher

Neil Crawford6 Sep 2022 17:17 UTC

4 points

0 comments1 min readLW link

Conditions for Mesa-Optimization

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

1 Jun 2019 20:52 UTC

86 points

48 comments12 min readLW link

Kessler’s Second Syndrome

Jesse Hoogland26 Jan 2025 7:04 UTC

70 points

2 comments3 min readLW link

Staged release

Zach Stein-Perlman17 Apr 2024 16:00 UTC

11 points

4 comments2 min readLW link

Gradient Descent on the Human Brain

Jozdien and gaspode

1 Apr 2024 22:39 UTC

59 points

5 comments2 min readLW link

Response to Oren Etzioni’s “How to know if artificial intelligence is about to destroy civilization”

Daniel Kokotajlo27 Feb 2020 18:10 UTC

27 points

5 comments8 min readLW link

Shortest damn doomsplainer in world history

lemonhope20 May 2025 9:07 UTC

6 points

5 comments1 min readLW link

[Book Review] “The Alignment Problem” by Brian Christian

lsusr20 Sep 2021 6:36 UTC

72 points

16 comments6 min readLW link

AISafety.com – Resources for AI Safety

Søren Elverlin, plex, Bryce Robertson and Melissa Samworth

17 May 2024 15:57 UTC

83 points

3 comments1 min readLW link

Microdooms averted by working on AI Safety

Nikola Jurkovic17 Sep 2023 21:46 UTC

34 points

3 comments3 min readLW link

(forum.effectivealtruism.org)

The case for removing alignment and ML research from the training dataset

beren30 May 2023 20:54 UTC

50 points

8 comments5 min readLW link

[Question] Is this a Pivotal Weak Act? Creating bacteria that decompose metal

doomyeser11 Sep 2024 18:07 UTC

9 points

9 comments3 min readLW link

AISN #24: Kissinger Urges US-China Cooperation on AI, China’s New AI Law, US Export Controls, International Institutions, and Open Source AI

Dan H and Corin Katzke

18 Oct 2023 17:06 UTC

14 points

0 comments6 min readLW link

(newsletter.safe.ai)

AI Safety Newsletter #7: Disinformation, Governance Recommendations for AI labs, and Senate Hearings on AI

Dan H and Orpheus16

23 May 2023 21:47 UTC

25 points

0 comments6 min readLW link

(newsletter.safe.ai)

All AGI safety questions welcome (especially basic ones) [Sept 2022]

plex8 Sep 2022 11:56 UTC

22 points

48 comments3 min readLW link

Theory of Change for AI Safety Camp

Linda Linsefors22 Jan 2025 22:07 UTC

36 points

3 comments7 min readLW link

Catastrophic Risks from AI #1: Introduction

Dan H, Mantas Mazeika and TW123

22 Jun 2023 17:09 UTC

40 points

1 comment5 min readLW link

(arxiv.org)

Critiquing “What failure looks like”

Grue_Slinky27 Dec 2019 23:59 UTC

35 points

6 comments3 min readLW link

Continuity Assumptions

Jan_Kulveit13 Jun 2022 21:31 UTC

44 points

13 comments4 min readLW link

[Question] Why don’t quantilizers also cut off the upper end of the distribution?

Alex_Altair15 May 2023 1:40 UTC

25 points

2 comments1 min readLW link

But exactly how complex and fragile?

KatjaGrace3 Nov 2019 18:20 UTC

87 points

32 comments3 min readLW link 1 review

(meteuphoric.com)

A Simple Explanation of AGI Risk

TurnTrout1 Jul 2025 16:18 UTC

66 points

4 comments5 min readLW link

(turntrout.com)

Less Realistic Tales of Doom

Mark Xu6 May 2021 23:01 UTC

113 points

13 comments4 min readLW link

Alignment Risk Doesn’t Require Superintelligence

JustisMills15 Jun 2022 3:12 UTC

35 points

4 comments2 min readLW link

Existing Safety Frameworks Imply Unreasonable Confidence

Joe Rogero, yams and Joe Collman

10 Apr 2025 16:31 UTC

46 points

3 comments15 min readLW link

(intelligence.org)

Incentives and Selection: A Missing Frame From AI Threat Discussions?

DragonGod26 Feb 2023 1:18 UTC

11 points

16 comments2 min readLW link

Approaches to gradient hacking

adamShimi14 Aug 2021 15:16 UTC

16 points

8 comments8 min readLW link

Ngo and Yudkowsky on alignment difficulty

Eliezer Yudkowsky and Richard_Ngo

15 Nov 2021 20:31 UTC

259 points

152 comments99 min readLW link 1 review

Alex Turner’s Research, Comprehensive Information Gathering

adamShimi23 Jun 2021 9:44 UTC

15 points

3 comments3 min readLW link

The Paris AI Anti-Safety Summit

Zvi12 Feb 2025 14:00 UTC

129 points

21 comments21 min readLW link

(thezvi.wordpress.com)

How difficult is AI Alignment?

Sammy Martin13 Sep 2024 15:47 UTC

44 points

6 comments23 min readLW link

The Overton Window widens: Examples of AI risk in the media

Orpheus1623 Mar 2023 17:10 UTC

107 points

24 comments6 min readLW link

Thoughts on AGI safety from the top

jylin042 Feb 2022 20:06 UTC

36 points

3 comments32 min readLW link

AI alignment as a translation problem

Roman Leventov5 Feb 2024 14:14 UTC

22 points

2 comments3 min readLW link

AISN #23: New OpenAI Models, News from Anthropic, and Representation Engineering

Dan H4 Oct 2023 17:37 UTC

15 points

2 comments5 min readLW link

(newsletter.safe.ai)

The AI Safety Game (UPDATED)

Daniel Kokotajlo5 Dec 2020 10:27 UTC

45 points

10 comments3 min readLW link

Thoughts on ‘List of Lethalities’

Alex Lawsen 17 Aug 2022 18:33 UTC

27 points

0 comments10 min readLW link

Against most, but not all, AI risk analogies

Matthew Barnett14 Jan 2024 3:36 UTC

63 points

41 comments7 min readLW link

Some abstract, non-technical reasons to be non-maximally-pessimistic about AI alignment

Rob Bensinger12 Dec 2021 2:08 UTC

70 points

35 comments7 min readLW link

Paper: Tell, Don’t Show- Declarative facts influence how LLMs generalize

Owain_Evans and AlexMeinke

19 Dec 2023 19:14 UTC

45 points

4 comments6 min readLW link

(arxiv.org)

Gaia Network: a practical, incremental pathway to Open Agency Architecture

Roman Leventov and Rafael Kaufmann Nedal

20 Dec 2023 17:11 UTC

22 points

8 comments16 min readLW link

Permanent Disempowerment is the Baseline

Vladimir_Nesov4 Aug 2025 17:43 UTC

75 points

23 comments6 min readLW link

Evil autocomplete: Existential Risk and Next-Token Predictors

Yitz28 Feb 2023 8:47 UTC

9 points

3 comments5 min readLW link

Let’s build a fire alarm for AGI

chaosmage15 May 2023 9:16 UTC

−1 points

0 comments2 min readLW link

AI Neorealism: a threat model & success criterion for existential safety

davidad15 Dec 2022 13:42 UTC

67 points

1 comment3 min readLW link

Confusion about neuroscience/cognitive science as a danger for AI Alignment

Samuel Nellessen22 Jun 2022 17:59 UTC

3 points

1 comment3 min readLW link

(snellessen.com)

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Jan_Kulveit, Raymond Douglas, Nora_Ammann, Deger Turan, David Scott Krueger (formerly: capybaralet) and David Duvenaud

30 Jan 2025 17:03 UTC

167 points

65 comments2 min readLW link

(gradual-disempowerment.ai)

[Question] Conditional on the first AGI being aligned correctly, is a good outcome even still likely?

iamthouthouarti6 Sep 2021 17:30 UTC

2 points

1 comment1 min readLW link

Levels of Doom: Eutopia, Disempowerment, Extinction

Vladimir_Nesov5 Jun 2025 19:08 UTC

34 points

0 comments2 min readLW link

The problem/solution matrix: Calculating the probability of AI safety “on the back of an envelope”

John_Maxwell20 Oct 2019 8:03 UTC

22 points

4 comments2 min readLW link

Top lesson from GPT: we will probably destroy humanity “for the lulz” as soon as we are able.

Shmi16 Apr 2023 20:27 UTC

63 points

28 comments1 min readLW link

How we could stumble into AI catastrophe

HoldenKarnofsky13 Jan 2023 16:20 UTC

71 points

18 comments18 min readLW link

(www.cold-takes.com)

Recommender Alignment for Lock-In Risk

alamerton24 Mar 2025 12:56 UTC

8 points

0 comments7 min readLW link

Catastrophic Risks from AI #6: Discussion and FAQ

Dan H, Mantas Mazeika and TW123

27 Jun 2023 23:23 UTC

24 points

1 comment13 min readLW link

(arxiv.org)

New voluntary commitments (AI Seoul Summit)

Zach Stein-Perlman21 May 2024 11:00 UTC

81 points

17 comments7 min readLW link

(www.gov.uk)

Perform Tractable Research While Avoiding Capabilities Externalities [Pragmatic AI Safety #4]

Dan H and TW123

30 May 2022 20:25 UTC

51 points

3 comments25 min readLW link

Uber Self-Driving Crash

jefftk7 Nov 2019 15:00 UTC

109 points

1 comment2 min readLW link

(www.jefftk.com)

“Humanity vs. AGI” Will Never Look Like “Humanity vs. AGI” to Humanity

Thane Ruthenis16 Dec 2023 20:08 UTC

192 points

34 comments5 min readLW link

Risks from Learned Optimization: Introduction

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

31 May 2019 23:44 UTC

187 points

42 comments12 min readLW link 3 reviews

Survey: What (de)motivates you about AI risk?

Daniel_Friedrich3 Aug 2022 19:17 UTC

1 point

0 comments1 min readLW link

(forms.gle)

Counter-considerations on AI arms races

Mateusz Bagiński and JustinShovelain

15 May 2025 14:54 UTC

23 points

0 comments18 min readLW link

Environmental Structure Can Cause Instrumental Convergence

TurnTrout22 Jun 2021 22:26 UTC

71 points

43 comments16 min readLW link

(arxiv.org)

A conversation about Katja’s counterarguments to AI risk

Matthew Barnett and Ege Erdil

18 Oct 2022 18:40 UTC

43 points

9 comments33 min readLW link

Summary of the Acausal Attack Issue for AIXI

Diffractor13 Dec 2021 8:16 UTC

12 points

6 comments4 min readLW link

Against ubiquitous alignment taxes

beren6 Mar 2023 19:50 UTC

59 points

10 comments2 min readLW link

Orthogonality is expensive

beren3 Apr 2023 10:20 UTC

43 points

9 comments3 min readLW link

Thoughts on sharing information about language model capabilities

paulfchristiano31 Jul 2023 16:04 UTC

211 points

44 comments11 min readLW link 1 review

Towards the Operationalization of Philosophy & Wisdom

Thane Ruthenis28 Oct 2024 19:45 UTC

20 points

2 comments33 min readLW link

(aiimpacts.org)

We Are Conjecture, A New Alignment Research Startup

Connor Leahy8 Apr 2022 11:40 UTC

197 points

25 comments4 min readLW link

The 6D effect: When companies take risks, one email can be very powerful.

scasper4 Nov 2023 20:08 UTC

286 points

42 comments3 min readLW link

An LLM-based “exemplary actor”

Roman Leventov29 May 2023 11:12 UTC

16 points

0 comments12 min readLW link

My Plan to Build Aligned Superintelligence

apollonianblues21 Aug 2022 13:16 UTC

18 points

7 comments8 min readLW link

Responses to Catastrophic AGI Risk: A Survey

lukeprog8 Jul 2013 14:33 UTC

17 points

8 comments1 min readLW link

Coordination by common knowledge to prevent uncontrollable AI

Karl von Wendt14 May 2023 13:37 UTC

10 points

2 comments9 min readLW link

An Increasingly Manipulative Newsfeed

Michaël Trazzi1 Jul 2019 15:26 UTC

63 points

16 comments5 min readLW link

A critique of Soares “4 background claims”

YanLyutnev27 Jan 2025 20:27 UTC

−8 points

0 comments14 min readLW link

Ramble on STUFF: intelligence, simulation, AI, doom, default mode, the usual

Bill Benzon26 Aug 2023 15:49 UTC

5 points

0 comments4 min readLW link

“Diamondoid bacteria” nanobots: deadly threat or dead-end? A nanotech investigation

titotal29 Sep 2023 14:01 UTC

160 points

79 comments20 min readLW link

(titotal.substack.com)

Gradient Filtering

Jozdien and janus

18 Jan 2023 20:09 UTC

56 points

16 comments13 min readLW link

When discussing AI doom barriers propose specific plausible scenarios

anithite18 Aug 2023 4:06 UTC

5 points

0 comments3 min readLW link

Top 9+2 myths about AI risk

Stuart_Armstrong29 Jun 2015 20:41 UTC

68 points

44 comments2 min readLW link

What does it mean for an AGI to be ‘safe’?

So8res7 Oct 2022 4:13 UTC

74 points

29 comments3 min readLW link

The Shutdown Problem: Incomplete Preferences as a Solution

EJT23 Feb 2024 16:01 UTC

53 points

33 comments42 min readLW link

AISC 2024 - Project Summaries

NickyP27 Nov 2023 22:32 UTC

48 points

3 comments18 min readLW link

Response to Katja Grace’s AI x-risk counterarguments

Erik Jenner and Johannes Treutlein

19 Oct 2022 1:17 UTC

77 points

18 comments15 min readLW link

Conditioning Generative Models for Alignment

Jozdien18 Jul 2022 7:11 UTC

60 points

8 comments20 min readLW link

A response to the Richards et al.’s “The Illusion of AI’s Existential Risk”

Harrison Fell26 Jul 2023 17:34 UTC

1 point

0 comments10 min readLW link

On the future of language models

owencb20 Dec 2023 16:58 UTC

105 points

17 comments36 min readLW link

Why do we post our AI safety plans on the Internet?

Peter S. Park3 Nov 2022 16:02 UTC

4 points

4 comments11 min readLW link

Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes

Andrea_Miotti, paulfchristiano, Gabriel Alfour and Olive Branch

24 Feb 2023 23:03 UTC

61 points

7 comments47 min readLW link

Causal confusion as an argument against the scaling hypothesis

RobertKirk and David Scott Krueger (formerly: capybaralet)

20 Jun 2022 10:54 UTC

86 points

30 comments15 min readLW link

The mind-killer

Paul Crowley2 May 2009 16:49 UTC

29 points

160 comments2 min readLW link

The AI governance gaps in developing countries

ntran17 Jun 2023 2:50 UTC

20 points

1 comment14 min readLW link

4 types of AGI selection, and how to constrain them

Remmelt8 Aug 2023 10:02 UTC

−4 points

3 comments3 min readLW link

[untitled post]

[Error communicating with LW2 server]20 May 2023 3:08 UTC

1 point

0 comments1 min readLW link

Q&A with Jürgen Schmidhuber on risks from AI

XiXiDu15 Jun 2011 15:51 UTC

63 points

45 comments4 min readLW link

[Question] Superintelligence Strategy: A Pragmatic Path to… Doom?

Mr Beastly19 Mar 2025 22:30 UTC

8 points

0 comments3 min readLW link

Artificial Intelligence and Living Wisdom

TMFOW29 Mar 2024 7:41 UTC

−6 points

1 comment17 min readLW link

(tmfow.substack.com)

My Assessment of the Chinese AI Safety Community

Lao Mein25 Apr 2023 4:21 UTC

252 points

95 comments3 min readLW link

Summary of “AGI Ruin: A List of Lethalities”

Stephen McAleese10 Jun 2022 22:35 UTC

45 points

2 comments8 min readLW link

The alignment problem in different capability regimes

Buck9 Sep 2021 19:46 UTC

88 points

12 comments5 min readLW link

Responsible Scaling Policies Are Risk Management Done Wrong

simeon_c25 Oct 2023 23:46 UTC

123 points

35 comments22 min readLW link 1 review

(www.navigatingrisks.ai)

Niceness is unnatural

So8res13 Oct 2022 1:30 UTC

134 points

20 comments8 min readLW link 1 review

[FICTION] Unboxing Elysium: An AI’S Escape

Super AGI10 Jun 2023 4:41 UTC

−16 points

4 comments14 min readLW link

Agentic Mess (A Failure Story)

Karl von Wendt, Sofia Bharadia, PeterDrotos, Artem Korotkov, mespa and mruwnik

6 Jun 2023 13:09 UTC

46 points

5 comments13 min readLW link

Henry Kissinger: AI Could Mean the End of Human History

ESRogs15 May 2018 20:11 UTC

17 points

12 comments1 min readLW link

(www.theatlantic.com)

All AGI Safety questions welcome (especially basic ones) [July 2023]

smallsilo20 Jul 2023 20:20 UTC

38 points

41 comments2 min readLW link

(forum.effectivealtruism.org)

Widening Overton Window—Open Thread

Prometheus31 Mar 2023 10:03 UTC

23 points

8 comments1 min readLW link

Metaphors for AI, and why I don’t like them

boazbarak28 Jun 2023 22:47 UTC

44 points

18 comments12 min readLW link

Static vs Dynamic Alignment

Gracie Green21 Mar 2024 17:44 UTC

5 points

0 comments12 min readLW link

[Question] What‘s in your list of unsolved problems in AI alignment?

jacquesthibs7 Mar 2023 18:58 UTC

60 points

9 comments1 min readLW link

[Question] Would a Misaligned SSI Really Kill Us All?

DragonGod14 Sep 2022 12:15 UTC

6 points

7 comments6 min readLW link

Risk aversion and GPT-3

casualphysicsenjoyer13 Sep 2022 20:50 UTC

1 point

0 comments1 min readLW link

Incremental AI Risks from Proxy-Simulations

kmenou19 Dec 2023 18:56 UTC

2 points

0 comments1 min readLW link

(individual.utoronto.ca)

AI Safety in a Vulnerable World: Requesting Feedback on Preliminary Thoughts

Jordan Arel6 Dec 2022 22:35 UTC

4 points

2 comments3 min readLW link

[Question] How does the ever-increasing use of AI in the military for the direct purpose of murdering people affect your p(doom)?

Justausername6 Apr 2024 6:31 UTC

19 points

16 comments1 min readLW link

A Tractarian Filter for Safer Language Models

Konstantinos Tsermenidis8 Jun 2025 8:19 UTC

0 points

0 comments3 min readLW link

Seeking Input to AI Safety Book for non-technical audience

Darren McKee10 Aug 2023 17:58 UTC

10 points

4 comments1 min readLW link

1hr talk: Intro to AGI safety

Steven Byrnes18 Jun 2019 21:41 UTC

36 points

4 comments24 min readLW link

A Friendly Face (Another Failure Story)

Karl von Wendt, Sofia Bharadia, PeterDrotos, Artem Korotkov, mespa and mruwnik

20 Jun 2023 10:31 UTC

65 points

21 comments16 min readLW link

AI as a Civilizational Risk Part 2/6: Behavioral Modification

PashaKamyshev30 Oct 2022 16:57 UTC

9 points

0 comments10 min readLW link

Representational Tethers: Tying AI Latents To Human Ones

Paul Bricman16 Sep 2022 14:45 UTC

30 points

0 comments16 min readLW link

Knowledge Base 8: The truth as an attractor in the information space

iwis25 Apr 2024 15:28 UTC

−8 points

0 comments2 min readLW link

Mesa-Optimization: Explain it like I’m 10 Edition

brook26 Aug 2023 23:04 UTC

20 points

1 comment6 min readLW link

[Question] Why not use active SETI to prevent AI Doom?

RomanS5 May 2023 14:41 UTC

13 points

13 comments1 min readLW link

Is AI Safety dropping the ball on privacy?

markov13 Sep 2023 13:07 UTC

50 points

17 comments7 min readLW link

Death with Awesomeness

osmarks1 Apr 2024 20:24 UTC

5 points

2 comments2 min readLW link

The burden of knowing

arisAlexis28 Feb 2023 18:40 UTC

5 points

0 comments2 min readLW link

Paths to failure

Karl von Wendt and mespa

25 Apr 2023 8:03 UTC

29 points

1 comment8 min readLW link

Grinding slimes in the dungeon of AI alignment research

Max H24 Mar 2023 4:51 UTC

10 points

2 comments4 min readLW link

Why I think it’s net harmful to do technical safety research at AGI labs

Remmelt7 Feb 2024 4:17 UTC

26 points

24 comments1 min readLW link

It Looks Like You’re Trying To Take Over The Narrative

George3d624 Aug 2022 13:36 UTC

3 points

20 comments9 min readLW link

(www.epistem.ink)

Limiting factors to predict AI take-off speed

Alfonso Pérez Escudero31 May 2023 23:19 UTC

1 point

0 comments6 min readLW link

Modeling Failure Modes of High-Level Machine Intelligence

Ben Cottier, Daniel_Eth and Sammy Martin

6 Dec 2021 13:54 UTC

54 points

1 comment12 min readLW link

Yoshua Bengio: “Slowing down development of AI systems passing the Turing test”

Roman Leventov6 Apr 2023 3:31 UTC

49 points

2 comments5 min readLW link

(yoshuabengio.org)

 Yoshua Bengio: How Rogue AIs may Arise

harfe23 May 2023 18:28 UTC

92 points

12 comments18 min readLW link

(yoshuabengio.org)

Why was the AI Alignment community so unprepared for this moment?

Ras151315 Jul 2023 0:26 UTC

123 points

65 comments2 min readLW link

The need for multi-agent experiments

Martín Soto1 Aug 2024 17:14 UTC

43 points

3 comments9 min readLW link

Ideological Inference Engines: Making Deontology Differentiable*

Paul Bricman12 Sep 2022 12:00 UTC

6 points

0 comments14 min readLW link

Safety of Self-Assembled Neuromorphic Hardware

Can26 Dec 2022 18:51 UTC

16 points

2 comments10 min readLW link

(forum.effectivealtruism.org)

A Double-Feature on The Extropians

Maxwell Tabarrok3 Jun 2023 18:27 UTC

59 points

4 comments1 min readLW link

Does natural selection favor AIs over humans?

cdkg3 Oct 2024 18:47 UTC

20 points

1 comment1 min readLW link

(link.springer.com)

Summaries: Alignment Fundamentals Curriculum

Leon Lang18 Sep 2022 13:08 UTC

44 points

3 comments1 min readLW link

(docs.google.com)

Responding to ‘Beyond Hyperanthropomorphism’

ukc1001414 Sep 2022 20:37 UTC

9 points

0 comments16 min readLW link

Case Story: Lack of Consumer Protection Procedures AI Manipulation and the Threat of Fund Concentration in Crypto Seeking Assistance to Fund a Civil Case to Establish Facts and Protect Vulnerable Consumers from Damage Caused by Automated Systems

Petr 'Margot' Andreev8 Aug 2024 5:55 UTC

−9 points

0 comments9 min readLW link

2 unusual reasons for why we can avoid being turned into paperclips

Artem Panush19 Aug 2023 10:28 UTC

1 point

0 comments4 min readLW link

DeepMind: Model evaluation for extreme risks

Zach Stein-Perlman25 May 2023 3:00 UTC

94 points

12 comments1 min readLW link 1 review

(arxiv.org)

[Question] How Politics interacts with AI ?

qbolec26 Mar 2023 9:53 UTC

−11 points

4 comments1 min readLW link

Taking features out of superposition with sparse autoencoders more quickly with informed initialization

Pierre Peigné23 Sep 2023 16:21 UTC

30 points

8 comments5 min readLW link

Everything’s normal until it’s not

Eleni Angelou10 Mar 2023 2:02 UTC

7 points

0 comments3 min readLW link

Reactive devaluation: Bias in Evaluating AGI X-Risks

Remmelt and flandry19

30 Dec 2022 9:02 UTC

−13 points

9 comments1 min readLW link

Deceptive Alignment is <1% Likely by Default

DavidW21 Feb 2023 15:09 UTC

89 points

31 comments14 min readLW link 1 review

The limited upside of interpretability

Peter S. Park15 Nov 2022 18:46 UTC

13 points

11 comments10 min readLW link

[untitled post]

NeuralSystem_e5e127 Apr 2023 17:37 UTC

3 points

0 comments1 min readLW link

A Solution for AGI/ASI Safety

Weibing Wang18 Dec 2024 19:44 UTC

50 points

29 comments1 min readLW link

[Crosspost] An AI Pause Is Humanity’s Best Bet For Preventing Extinction (TIME)

otto.barten24 Jul 2023 10:07 UTC

12 points

0 comments7 min readLW link

(time.com)

Reflections on the PIBBSS Fellowship 2022

Nora_Ammann and particlemania

11 Dec 2022 21:53 UTC

32 points

0 comments18 min readLW link

ICA Simulacra

Ozyrus5 Apr 2023 6:41 UTC

26 points

2 comments7 min readLW link

Please help us communicate AI xrisk. It could save the world.

otto.barten4 Jul 2022 21:47 UTC

4 points

7 comments2 min readLW link

The Underreaction to OpenAI

Sherrinford18 Jan 2024 22:08 UTC

21 points

0 comments6 min readLW link

AI Apocalypse and the Buddha

pchvykov22 Feb 2025 16:33 UTC

−17 points

6 comments9 min readLW link

Artificial Intelligence as exit strategy from the age of acute existential risk

Arturo Macias12 Apr 2023 14:48 UTC

−7 points

15 comments7 min readLW link

Emily Brontë on: Psychology Required for Serious™ AGI Safety Research

robertzk14 Sep 2022 14:47 UTC

2 points

0 comments1 min readLW link

Siren worlds and the perils of over-optimised search

Stuart_Armstrong7 Apr 2014 11:00 UTC

84 points

418 comments7 min readLW link

Conversation with Paul Christiano

abergal11 Sep 2019 23:20 UTC

44 points

6 comments30 min readLW link

(aiimpacts.org)

How To Go From Interpretability To Alignment: Just Retarget The Search

johnswentworth10 Aug 2022 16:08 UTC

212 points

34 comments3 min readLW link 1 review

Scale Was All We Needed, At First

Gabe M14 Feb 2024 1:49 UTC

296 points

35 comments8 min readLW link

(aiacumen.substack.com)

Misalignment-by-default in multi-agent systems

Edouard Harris and simonsdsuo

13 Oct 2022 15:38 UTC

21 points

8 comments20 min readLW link

(www.gladstone.ai)

Alignment is hard. Communicating that, might be harder

Eleni Angelou1 Sep 2022 16:57 UTC

7 points

8 comments3 min readLW link

Cooperation and Alignment in Delegation Games: You Need Both!

Oliver Sourbut, Lewis Hammond and HarrietW

3 Aug 2024 10:16 UTC

8 points

0 comments14 min readLW link

(www.oliversourbut.net)

Q&A with Abram Demski on risks from AI

XiXiDu17 Jan 2012 9:43 UTC

33 points

71 comments9 min readLW link

Hardcode the AGI to need our approval indefinitely?

MichaelStJules11 Nov 2021 7:04 UTC

2 points

2 comments1 min readLW link

Introduction: Bias in Evaluating AGI X-Risks

Remmelt and flandry19

27 Dec 2022 10:27 UTC

1 point

0 comments3 min readLW link

[Question] Is a Self-Iterating AGI Vulnerable to Thompson-style Trojans?

sxae25 Mar 2021 14:46 UTC

15 points

6 comments3 min readLW link

What is the most evil AI that we could build, today?

ThomasJ1 Nov 2021 19:58 UTC

−2 points

14 comments1 min readLW link

Some alignment ideas

SelonNerias10 Aug 2023 17:51 UTC

1 point

0 comments11 min readLW link

Preserving and continuing alignment research through a severe global catastrophe

A_donor6 Mar 2022 18:43 UTC

48 points

11 comments5 min readLW link

Benchmarking Proposals on Risk Scenarios

Paul Bricman20 Aug 2022 10:01 UTC

25 points

2 comments14 min readLW link

AGI goal space is big, but narrowing might not be as hard as it seems.

Jacy Reese Anthis12 Apr 2023 19:03 UTC

15 points

0 comments3 min readLW link

AGI doesn’t need understanding, intention, or consciousness in order to kill us, only intelligence

James Blaha20 Feb 2023 0:55 UTC

10 points

2 comments18 min readLW link

Dreams of Friendliness

Eliezer Yudkowsky31 Aug 2008 1:20 UTC

29 points

81 comments9 min readLW link

What can we learn from Lex Fridman’s interview with Sam Altman?

Karl von Wendt27 Mar 2023 6:27 UTC

56 points

22 comments9 min readLW link

Disproving and partially fixing a fully homomorphic encryption scheme with perfect secrecy

Lysandre Terrisse26 May 2024 14:56 UTC

16 points

1 comment18 min readLW link

Pivotal acts using an unaligned AGI?

Simon Fischer21 Aug 2022 17:13 UTC

28 points

3 comments7 min readLW link

[Question] How likely do you think worse-than-extinction type fates to be?

span11 Aug 2022 4:08 UTC

3 points

3 comments1 min readLW link

Podcast Transcript: Daniela and Dario Amodei on Anthropic

remember7 Mar 2023 16:47 UTC

46 points

2 comments79 min readLW link

(futureoflife.org)

No, really, it predicts next tokens.

simon18 Apr 2023 3:47 UTC

58 points

55 comments3 min readLW link

METR’s preliminary evaluation of o3 and o4-mini

Christopher King16 Apr 2025 20:23 UTC

14 points

7 comments1 min readLW link

(metr.github.io)

Key Questions for Digital Minds

Jacy Reese Anthis22 Mar 2023 17:13 UTC

22 points

0 comments7 min readLW link

(www.sentienceinstitute.org)

Exploring non-anthropocentric aspects of AI existential safety

mishka3 Apr 2023 18:07 UTC

9 points

0 comments3 min readLW link

AI as a Civilizational Risk Part 4/6: Bioweapons and Philosophy of Modification

PashaKamyshev1 Nov 2022 20:50 UTC

7 points

1 comment8 min readLW link

New AI risk intro from Vox [link post]

JakubK21 Dec 2022 6:00 UTC

5 points

1 comment2 min readLW link

(www.vox.com)

AI Summer Fellows Program

colm21 Mar 2018 15:32 UTC

21 points

0 comments1 min readLW link

Ilya: The AI scientist shaping the world

David Varga20 Nov 2023 13:09 UTC

11 points

0 comments4 min readLW link

Taboo “human-level intelligence”

Sherrinford26 Feb 2023 20:42 UTC

12 points

7 comments1 min readLW link

Three AI Safety Related Ideas

Wei Dai13 Dec 2018 21:32 UTC

70 points

38 comments2 min readLW link

GPT-4 busted? Clear self-interest when summarizing articles about itself vs when article talks about Claude, LLaMA, or DALL·E 2

Christopher King31 Mar 2023 17:05 UTC

6 points

4 comments4 min readLW link

Manifold Predicted the AI Extinction Statement and CAIS Wanted it Deleted

David Chee12 Jun 2023 15:54 UTC

71 points

15 comments12 min readLW link

Investigating AI Takeover Scenarios

Sammy Martin17 Sep 2021 18:47 UTC

31 points

1 comment27 min readLW link

Technical AI Safety Research Landscape [Slides]

Magdalena Wache18 Sep 2023 13:56 UTC

49 points

2 comments4 min readLW link

AISC team report: Soft-optimization, Bayes and Goodhart

Simon Fischer, benjaminko, jazcarretao, DFNaiff and Jeremy Gillen

27 Jun 2023 6:05 UTC

38 points

2 comments15 min readLW link

How will they feed us

meijer19731 Jun 2023 8:49 UTC

4 points

3 comments5 min readLW link

Does GPT-4 exhibit agency when summarizing articles?

Christopher King24 Mar 2023 15:49 UTC

16 points

2 comments5 min readLW link

[Question] How easy is it to supervise processes vs outcomes?

Noosphere8918 Oct 2022 17:48 UTC

3 points

0 comments1 min readLW link

Research Questions from Stained Glass Windows

StefanHex8 Jun 2022 12:38 UTC

4 points

0 comments2 min readLW link

Discussing how to align Transformative AI if it’s developed very soon

elifland28 Nov 2022 16:17 UTC

37 points

2 comments28 min readLW link

Analysing: Dangerous messages from future UFAI via Oracles

Stuart_Armstrong22 Nov 2019 14:17 UTC

22 points

16 comments4 min readLW link

I Recommend More Training Rationales

Gianluca Calcagni31 Dec 2024 14:06 UTC

2 points

0 comments6 min readLW link

AIS Hungary is hiring a part-time Technical Lead! (Deadline: Dec 31st)

gergogaspar17 Dec 2024 14:12 UTC

1 point

0 comments2 min readLW link

Cap Model Size for AI Safety

research_prime_space6 Mar 2023 1:11 UTC

0 points

4 comments1 min readLW link

[Question] What does pulling the fire alarm look like?

nem20 Mar 2023 21:45 UTC

2 points

0 comments1 min readLW link

AI demands unprecedented reliability

Jono9 Jan 2024 16:30 UTC

22 points

5 comments2 min readLW link

Am I secretly excited for AI getting weird?

porby29 Oct 2022 22:16 UTC

116 points

4 comments4 min readLW link

Freedom Is All We Need

Leo Glisic27 Apr 2023 0:09 UTC

−1 points

8 comments10 min readLW link

Pessimism about AI Safety

Max_He-Ho and Peter Kuhn

2 Apr 2023 7:43 UTC

4 points

1 comment25 min readLW link

Algo trading is a central example of AI risk

Vanessa Kosoy28 Jul 2018 20:31 UTC

27 points

5 comments1 min readLW link

Inference from a Mathematical Description of an Existing Alignment Research: a proposal for an outer alignment research program

Christopher King2 Jun 2023 21:54 UTC

7 points

4 comments16 min readLW link

Announcing the AI Safety Summit Talks with Yoshua Bengio

otto.barten14 May 2024 12:52 UTC

9 points

1 comment1 min readLW link

Computational signatures of psychopathy

Cameron Berg19 Dec 2022 17:01 UTC

30 points

3 comments20 min readLW link

Is There a Power Play Overhang?

crispweed8 May 2024 17:39 UTC

3 points

0 comments1 min readLW link

(upcoder.com)

AI-Plans.com—a contributable compendium

Iknownothing25 Jun 2023 14:40 UTC

39 points

7 comments4 min readLW link

(ai-plans.com)

The Polarity Problem [Draft]

Dan H, cdkg and Simon Goldstein

23 May 2023 21:05 UTC

24 points

3 comments44 min readLW link

Evaluating the feasibility of SI’s plan

JoshuaFox10 Jan 2013 8:17 UTC

39 points

187 comments4 min readLW link

Beware of black boxes in AI alignment research

cousin_it18 Jan 2018 15:07 UTC

39 points

10 comments1 min readLW link

I created an Asi Alignment Tier List

TimeGoat21 Apr 2024 18:44 UTC

−6 points

0 comments1 min readLW link

What if AGI is near?

Wulky Wilkinsen14 Apr 2021 0:05 UTC

11 points

5 comments1 min readLW link

Truthful and honest AI

abergal, Nick_Beckstead and Owain_Evans

29 Oct 2021 7:28 UTC

42 points

1 comment13 min readLW link

A Story of AI Risk: InstructGPT-N

peterbarnett26 May 2022 23:22 UTC

24 points

0 comments8 min readLW link

Rohin Shah on reasons for AI optimism

abergal31 Oct 2019 12:10 UTC

40 points

58 comments1 min readLW link

(aiimpacts.org)

AI Tracker: monitoring current and near-future risks from superscale models

Edouard Harris and Jeremie Harris

23 Nov 2021 19:16 UTC

67 points

13 comments3 min readLW link

(aitracker.org)

Second call: CFP for Rebellion and Disobedience in AI workshop

Ram Rachum5 Feb 2023 12:18 UTC

2 points

0 comments2 min readLW link

Review of METR’s public evaluation protocol

nahoj and JaimeRV

30 Jun 2024 22:03 UTC

10 points

0 comments5 min readLW link

Steering systems

Max H4 Apr 2023 0:56 UTC

50 points

1 comment15 min readLW link

Giving away your predictions

[Error communicating with LW2 server]8 Sep 2024 13:11 UTC

1 point

0 comments1 min readLW link

Worlds Where Iterative Design Fails

johnswentworth30 Aug 2022 20:48 UTC

222 points

30 comments10 min readLW link 1 review

Agency engineering: is AI-alignment “to human intent” enough?

catubc2 Sep 2022 18:14 UTC

9 points

10 comments6 min readLW link

[Question] Should I do it?

MrLight19 Nov 2020 1:08 UTC

−3 points

16 comments2 min readLW link

On the Impossibility of Intelligent Paperclip Maximizers

Michael Simkin29 May 2023 16:55 UTC

−21 points

5 comments4 min readLW link

3 levels of threat obfuscation

HoldenKarnofsky2 Aug 2023 14:58 UTC

69 points

14 comments7 min readLW link

Open-ended ethics of phenomena (a desiderata with universal morality)

Ryo 8 Nov 2023 20:10 UTC

1 point

0 comments8 min readLW link

When is it appropriate to use statistical models and probabilities for decision making ?

Younes Kamel5 Jul 2022 12:34 UTC

10 points

7 comments4 min readLW link

(youneskamel.substack.com)

How to solve the misuse problem assuming that in 10 years the default scenario is that AGI agents are capable of synthetizing pathogens

jeremtti27 Nov 2024 21:17 UTC

6 points

0 comments9 min readLW link

[Question] Whom Do You Trust?

JackOfAllTrades26 Feb 2024 19:38 UTC

1 point

0 comments1 min readLW link

Scenario planning for AI x-risk

Corin Katzke10 Feb 2024 0:14 UTC

24 points

12 comments14 min readLW link

(forum.effectivealtruism.org)

I Believe I Know Why AI Models Hallucinate

Richard Aragon19 Apr 2023 21:07 UTC

−10 points

6 comments7 min readLW link

(turingssolutions.com)

Timelines to Transformative AI: an investigation

Zershaaneh Qureshi26 Mar 2024 18:28 UTC

20 points

2 comments50 min readLW link

Friendly and Unfriendly AGI are Indistinguishable

ErgoEcho29 Dec 2022 22:13 UTC

−4 points

4 comments4 min readLW link

(neologos.co)

Friendship is Optimal: A My Little Pony fanfic about an optimization process

iceman8 Sep 2012 6:16 UTC

117 points

152 comments1 min readLW link

Optimising Society to Constrain Risk of War from an Artificial Superintelligence

JohnCDraper30 Apr 2020 10:47 UTC

4 points

1 comment51 min readLW link

Power-Seeking AI and Existential Risk

Antonio Franca11 Oct 2022 22:50 UTC

7 points

0 comments9 min readLW link

Anchoring focalism and the Identifiable victim effect: Bias in Evaluating AGI X-Risks

Remmelt7 Jan 2023 9:59 UTC

1 point

2 comments1 min readLW link

[Question] Does VETLM solve AI superalignment?

Oleg Trott8 Aug 2024 18:22 UTC

−1 points

10 comments1 min readLW link

GPT as an “Intelligence Forklift.”

boazbarak19 May 2023 21:15 UTC

49 points

27 comments3 min readLW link

The Magnitude of His Own Folly

Eliezer Yudkowsky30 Sep 2008 11:31 UTC

123 points

130 comments6 min readLW link

Become a PIBBSS Research Affiliate

Nora_Ammann and DusanDNesic

10 Oct 2023 7:41 UTC

24 points

6 comments6 min readLW link

Is Alignment enough?

gchu16 Aug 2024 11:46 UTC

1 point

0 comments1 min readLW link

[Linkpost] “Blueprint for an AI Bill of Rights”—Office of Science and Technology Policy, USA (2022)

T4315 Oct 2022 16:42 UTC

9 points

4 comments2 min readLW link

(www.whitehouse.gov)

Reflections on My Own Missing Mood

Lone Pine21 Apr 2022 16:19 UTC

53 points

25 comments5 min readLW link

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Soroush Pour, rusheb, Quentin FEUILLADE--MONTIXI, Arush and scasper

7 Nov 2023 17:59 UTC

38 points

2 comments2 min readLW link

(arxiv.org)

AISU 2021

Linda Linsefors30 Jan 2021 17:40 UTC

28 points

2 comments1 min readLW link

A Critique of AI Alignment Pessimism

ExCeph19 Jul 2022 2:28 UTC

9 points

1 comment9 min readLW link

How To Prevent a Dystopia

ank29 Jan 2025 14:16 UTC

−3 points

4 comments1 min readLW link

[Question] Will the first AGI agent have been designed as an agent (in addition to an AGI)?

nahoj3 Dec 2022 20:32 UTC

1 point

8 comments1 min readLW link

Internal Interfaces Are a High-Priority Interpretability Target

Thane Ruthenis29 Dec 2022 17:49 UTC

26 points

6 comments7 min readLW link

Survey of 2,778 AI authors: six parts in pictures

KatjaGrace6 Jan 2024 4:43 UTC

80 points

1 comment2 min readLW link

Linkpost: A Contra AI FOOM Reading List

DavidW13 Mar 2023 14:45 UTC

25 points

4 comments1 min readLW link

(magnusvinding.com)

Cataloguing Priors in Theory and Practice

Paul Bricman13 Oct 2022 12:36 UTC

13 points

8 comments7 min readLW link

A concise sum-up of the basic argument for AI doom

Mergimio H. Doefevmil24 Apr 2023 17:37 UTC

11 points

6 comments2 min readLW link

The (local) unit of intelligence is FLOPs

boazbarak5 Jun 2023 18:23 UTC

42 points

7 comments5 min readLW link

Reframing the AI Risk

Thane Ruthenis1 Jul 2022 18:44 UTC

26 points

7 comments6 min readLW link

Does AI care about reality or just its own perception?

RedFishBlueFish5 Jan 2024 4:05 UTC

−6 points

8 comments1 min readLW link

A Semiotic Critique of the Orthogonality Thesis

Nicolas Villarreal4 Jun 2024 18:52 UTC

3 points

10 comments15 min readLW link

Conjecture Second Hiring Round

Connor Leahy, Sid Black, Gabriel Alfour and Chris Scammell

23 Nov 2022 17:11 UTC

92 points

0 comments1 min readLW link

The Virus—Short Story

Michael Soareverix13 Apr 2023 18:18 UTC

4 points

0 comments4 min readLW link

Adapting to Change: Overcoming Chronostasis in AI Language Models

RationalMindset28 Mar 2023 14:32 UTC

−1 points

0 comments6 min readLW link

Enhancing Corrigibility in AI Systems through Robust Feedback Loops

Justausername24 Aug 2023 3:53 UTC

1 point

0 comments6 min readLW link

[Question] Could Simulating an AGI Taking Over the World Actually Lead to a LLM Taking Over the World?

simeon_c13 Jan 2023 6:33 UTC

15 points

1 comment1 min readLW link

Summary of 80k’s AI problem profile

JakubK1 Jan 2023 7:30 UTC

7 points

0 comments5 min readLW link

(forum.effectivealtruism.org)

Taking Away the Guns First: The Fundamental Flaw in AI Development

s-ice26 Nov 2024 22:11 UTC

1 point

0 comments17 min readLW link

Alignment Newsletter #13: 07/02/18

Rohin Shah2 Jul 2018 16:10 UTC

70 points

12 comments8 min readLW link

(mailchi.mp)

Rationalising humans: another mugging, but not Pascal’s

Stuart_Armstrong14 Nov 2017 15:46 UTC

7 points

1 comment3 min readLW link

Call for Cruxes by Rhyme, a Longtermist History Consultancy

Lara1 Mar 2023 18:39 UTC

1 point

0 comments3 min readLW link

(forum.effectivealtruism.org)

Takeaways from safety by default interviews

AI Impacts and abergal

3 Apr 2020 17:20 UTC

28 points

2 comments13 min readLW link

(aiimpacts.org)

Anti-squatted AI x-risk domains index

plex12 Aug 2022 12:01 UTC

59 points

6 comments1 min readLW link

Confronting the legion of doom.

Spiritus Dei13 Nov 2024 17:03 UTC

−20 points

3 comments5 min readLW link

Race to the Top: Benchmarks for AI Safety

Isabella Duan4 Dec 2022 18:48 UTC

29 points

6 comments1 min readLW link

Anthropic: Core Views on AI Safety: When, Why, What, and How

jonmenaster9 Mar 2023 17:34 UTC

17 points

1 comment22 min readLW link

(www.anthropic.com)

The Alignment Problem from a Deep Learning Perspective (major rewrite)

SoerenMind, Richard_Ngo and LawrenceC

10 Jan 2023 16:06 UTC

84 points

9 comments39 min readLW link

(arxiv.org)

Why Recursive Self-Improvement Might Not Be the Existential Risk We Fear

Nassim_A24 Nov 2024 17:17 UTC

1 point

0 comments9 min readLW link

Super intelligent AIs that don’t require alignment

Yair Halberstadt16 Nov 2021 19:55 UTC

10 points

2 comments6 min readLW link

AI Safety Newsletter #2: ChaosGPT, Natural Selection, and AI Safety in the Media

ozhang, Dan H and Orpheus16

18 Apr 2023 18:44 UTC

30 points

0 comments4 min readLW link

(newsletter.safe.ai)

CHAT Diplomacy: LLMs and National Security

SebastianG 5 May 2023 19:45 UTC

25 points

6 comments7 min readLW link

Truth Terminal: A reconstruction of events

crvr.fr and MTorrents

17 Nov 2024 23:51 UTC

5 points

1 comment7 min readLW link

AI 2030 – AI Policy Roadmap

LTM17 May 2024 23:29 UTC

8 points

0 comments1 min readLW link

HIRING: Inform and shape a new project on AI safety at Partnership on AI

Madhulika Srikumar24 Nov 2021 8:27 UTC

6 points

0 comments1 min readLW link

[Question] Updates on FLI’s Value Aligment Map?

T43117 Sep 2022 22:27 UTC

17 points

4 comments1 min readLW link

The convergent dynamic we missed

Remmelt12 Dec 2023 23:19 UTC

2 points

2 comments3 min readLW link

A Visualization of Nick Bostrom’s Superintelligence

[deleted]23 Jul 2014 0:24 UTC

62 points

28 comments3 min readLW link

New AI risks research institute at Oxford University

lukeprog16 Nov 2011 18:52 UTC

36 points

10 comments1 min readLW link

An Alternate History of the Future, 2025-2040

Mr Beastly24 Feb 2025 5:53 UTC

5 points

5 comments10 min readLW link

Why Uncontrollable AI Looks More Likely Than Ever

otto.barten and Roman_Yampolskiy

8 Mar 2023 15:41 UTC

18 points

0 comments4 min readLW link

(time.com)

Announcing the London Initiative for Safe AI (LISA)

James Fox, mike_safeAI and Ryan Kidd

2 Feb 2024 23:17 UTC

98 points

0 comments9 min readLW link

[Question] What qualities does an AGI need to have to realize the risk of false vacuum, without hardcoding physics theories into it?

RationalSieve3 Feb 2023 16:00 UTC

1 point

4 comments1 min readLW link

Human study on AI spear phishing campaigns

Simon Lermen, Fred Heiding and Andrew Kao

3 Jan 2025 15:11 UTC

81 points

8 comments5 min readLW link

Announcing AI Alignment workshop at the ALIFE 2023 conference

rorygreig8 Jul 2023 13:52 UTC

16 points

0 comments1 min readLW link

(humanvaluesandartificialagency.com)

A Proposal for AI Alignment: Using Directly Opposing Models

Arne B27 Apr 2023 18:05 UTC

0 points

5 comments3 min readLW link

Could Roko’s basilisk acausally bargain with a paperclip maximizer?

Christopher King13 Mar 2023 18:21 UTC

1 point

8 comments1 min readLW link

Polluting the agentic commons

hamandcheese13 Apr 2023 17:42 UTC

7 points

4 comments2 min readLW link

(www.secondbest.ca)

Reshaping the AI Industry

Thane Ruthenis29 May 2022 22:54 UTC

148 points

35 comments21 min readLW link

Six Thoughts on AI Safety

boazbarak24 Jan 2025 22:20 UTC

92 points

55 comments15 min readLW link

OpenAI: Our approach to AI safety

Jacob G-W5 Apr 2023 20:26 UTC

1 point

1 comment1 min readLW link

(openai.com)

Research proposal: Leveraging Jungian archetypes to create values-based models

MiguelDev5 Mar 2023 17:39 UTC

5 points

2 comments2 min readLW link

Mere exposure effect: Bias in Evaluating AGI X-Risks

Remmelt and flandry19

27 Dec 2022 14:05 UTC

0 points

2 comments1 min readLW link

A better analogy and example for teaching AI takeover: the ML Inferno

Christopher King14 Mar 2023 19:14 UTC

18 points

0 comments5 min readLW link

Tort Law Can Play an Important Role in Mitigating AI Risk

Gabriel Weil12 Feb 2024 17:17 UTC

39 points

9 comments5 min readLW link

Introducing the AI Alignment Forum (FAQ)

habryka, Ben Pace, Raemon and jimrandomh

29 Oct 2018 21:07 UTC

87 points

8 comments6 min readLW link

A general model of safety-oriented AI development

Wei Dai11 Jun 2018 21:00 UTC

68 points

8 comments1 min readLW link

[Question] Who are some prominent reasonable people who are confident that AI won’t kill everyone?

Optimization Process5 Dec 2022 9:12 UTC

72 points

54 comments1 min readLW link

A better “Statement on AI Risk?”

Knight Lee25 Nov 2024 4:50 UTC

9 points

6 comments3 min readLW link

Engaging First Introductions to AI Risk

Rob Bensinger19 Aug 2013 6:26 UTC

31 points

21 comments3 min readLW link

Halloween Problem

Saint Blasphemer24 Oct 2023 16:46 UTC

−10 points

1 comment1 min readLW link

Against sacrificing AI transparency for generality gains

Ape in the coat7 May 2023 6:52 UTC

4 points

0 comments2 min readLW link

On Generality

Eris Discordia26 Sep 2022 4:06 UTC

2 points

0 comments5 min readLW link

Imagine a world where Microsoft employees used Bing

Christopher King31 Mar 2023 18:36 UTC

6 points

2 comments2 min readLW link

The unspoken but ridiculous assumption of AI doom: the hidden doom assumption

Christopher King1 Jun 2023 17:01 UTC

−9 points

1 comment3 min readLW link

Without a trajectory change, the development of AGI is likely to go badly

Max H29 May 2023 23:42 UTC

16 points

2 comments13 min readLW link

G.K. Chesterton On AI Risk

Scott Alexander1 Apr 2017 19:00 UTC

23 points

0 comments7 min readLW link

Alignment of AutoGPT agents

Ozyrus12 Apr 2023 12:54 UTC

14 points

1 comment4 min readLW link

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC

58 points

0 comments59 min readLW link

The Genie in the Bottle: An Introduction to AI Alignment and Risk

Snorkelfarsan25 May 2023 16:30 UTC

5 points

1 comment25 min readLW link

Infant AI Scenario

Nathan112312 Aug 2022 21:20 UTC

1 point

0 comments3 min readLW link

Interpretability Tools Are an Attack Channel

Thane Ruthenis17 Aug 2022 18:47 UTC

42 points

14 comments1 min readLW link

[Question] What is to be done? (About the profit motive)

Connor Barber8 Sep 2023 19:27 UTC

1 point

21 comments1 min readLW link

Populectomy.ai

YonatanK24 Mar 2025 22:06 UTC

7 points

2 comments2 min readLW link

Grey Goo Requires AI

harsimony15 Jan 2021 4:45 UTC

11 points

11 comments4 min readLW link

(harsimony.wordpress.com)

The Dumbest Possible Gets There First

Artaxerxes13 Aug 2022 10:20 UTC

44 points

7 comments2 min readLW link

[Question] Employer considering partnering with major AI labs. What to do?

GraduallyMoreAgitated21 Mar 2023 17:43 UTC

37 points

7 comments2 min readLW link

Evaluating Superhuman Models with Consistency Checks

Daniel Paleka and Lukas Fluri

1 Aug 2023 7:51 UTC

21 points

2 comments9 min readLW link

(arxiv.org)

AI oracles on blockchain

Caravaggio6 Apr 2021 20:13 UTC

5 points

0 comments3 min readLW link

Book review: Architects of Intelligence by Martin Ford (2018)

Ofer11 Aug 2020 17:30 UTC

15 points

0 comments2 min readLW link

AI Risk in Terms of Unstable Nuclear Software

Thane Ruthenis26 Aug 2022 18:49 UTC

30 points

1 comment6 min readLW link

AI: How We Got Here—A Neuroscience Perspective

Mordechai Rorvig19 Jan 2025 23:51 UTC

5 points

0 comments2 min readLW link

(www.kickstarter.com)

OpenAI’s Alignment Plan is not S.M.A.R.T.

Søren Elverlin18 Jan 2023 6:39 UTC

9 points

19 comments4 min readLW link

Existentially relevant thought experiment: To kill or not to kill, a sniper, a man and a button.

AlexFromSafeTransition14 Aug 2023 10:53 UTC

−18 points

6 comments4 min readLW link

Thousands of malicious actors on the future of AI misuse

Zershaaneh Qureshi, Corin Katzke and Convergence Analysis

1 Apr 2024 10:08 UTC

37 points

0 comments1 min readLW link

Why I’m Optimistic About Near-Term AI Risk

harsimony15 May 2022 23:05 UTC

57 points

27 comments1 min readLW link

Video and Transcript of Presentation on Existential Risk from Power-Seeking AI

Joe Carlsmith8 May 2022 3:50 UTC

20 points

1 comment29 min readLW link

LLMs seem (relatively) safe

JustisMills25 Apr 2024 22:13 UTC

53 points

24 comments7 min readLW link

(justismills.substack.com)

An Overview of AI risks—the Flyer

Charbel-Raphaël, Jonathan Claybrough and tchauvin

17 Jul 2023 12:03 UTC

20 points

0 comments1 min readLW link

(docs.google.com)

How to Train Your AGI Dragon

Eris Discordia21 Sep 2022 22:28 UTC

−1 points

3 comments5 min readLW link

Should we cry “wolf”?

Tapatakt18 Feb 2023 11:24 UTC

24 points

5 comments1 min readLW link

Interviews with 97 AI Researchers: Quantitative Analysis

Maheen Shermohammed and Vael Gates

2 Feb 2023 1:01 UTC

23 points

0 comments7 min readLW link

Killswitch

Junio18 Nov 2023 22:53 UTC

2 points

0 comments3 min readLW link

[Question] Would a scope-insensitive AGI be less likely to incapacitate humanity?

Jim Buhler21 Jul 2024 14:15 UTC

2 points

3 comments1 min readLW link

I had a chat with GPT-4 on the future of AI and AI safety

Kristian Freed28 Mar 2023 17:47 UTC

1 point

0 comments8 min readLW link

Agentic Language Model Memes

FactorialCode1 Aug 2020 18:03 UTC

16 points

1 comment2 min readLW link

When is unaligned AI morally valuable?

paulfchristiano25 May 2018 1:57 UTC

81 points

53 comments10 min readLW link

[Question] Will OpenAI also require a “Super Red Team Agent” for its “Superalignment” Project?

Super AGI30 Mar 2024 5:25 UTC

2 points

2 comments1 min readLW link

Swimming Upstream: A Case Study in Instrumental Rationality

TurnTrout3 Jun 2018 3:16 UTC

77 points

7 comments8 min readLW link

Intrinsic vs. Extrinsic Alignment

Alfonso Pérez Escudero1 Jun 2023 1:06 UTC

1 point

1 comment3 min readLW link

2019 AI Alignment Literature Review and Charity Comparison

Larks19 Dec 2019 3:00 UTC

130 points

18 comments62 min readLW link

“Reframing Superintelligence” + LLMs + 4 years

Eric Drexler10 Jul 2023 13:42 UTC

118 points

9 comments12 min readLW link

Toy model of the AI control problem: animated version

Stuart_Armstrong10 Oct 2017 11:06 UTC

23 points

8 comments1 min readLW link

Un-unpluggability—can’t we just unplug it?

Oliver Sourbut15 May 2023 13:23 UTC

26 points

10 comments12 min readLW link

(www.oliversourbut.net)

A rejection of the Orthogonality Thesis

ArisC24 May 2023 16:37 UTC

−2 points

11 comments2 min readLW link

(medium.com)

[Question] What do you make of AGI:unaligned::spaceships:not enough food?

Ronny Fernandez22 Feb 2020 14:14 UTC

4 points

3 comments1 min readLW link

[Question] Resources on quantifiably forecasting future progress or reviewing past progress in AI safety?

C.S.W.13 Sep 2025 23:24 UTC

2 points

1 comment1 min readLW link

Corrigibility Via Thought-Process Deference

Thane Ruthenis24 Nov 2022 17:06 UTC

18 points

5 comments9 min readLW link

Static Place AI Makes Agentic AI Redundant: Multiversal AI Alignment & Rational Utopia

ank13 Feb 2025 22:35 UTC

1 point

2 comments11 min readLW link

Curiosity as a Solution to AGI Alignment

Harsha G.26 Feb 2023 23:36 UTC

7 points

7 comments3 min readLW link

Transcript: Yudkowsky on Bankless follow-up Q&A

vonk28 Feb 2023 3:46 UTC

54 points

40 comments22 min readLW link

We Should Talk About This More. Epistemic World Collapse as Imminent Safety Risk of Generative AI.

Joerg Weiss16 Nov 2023 18:46 UTC

11 points

2 comments29 min readLW link

Fear mitigated the nuclear threat, can it do the same to AGI risks?

Igor Ivanov9 Dec 2022 10:04 UTC

6 points

8 comments5 min readLW link

Andrew Ng wants to have a conversation about extinction risk from AI

Leon Lang5 Jun 2023 22:29 UTC

32 points

2 comments1 min readLW link

(twitter.com)

Reframing the Problem of AI Progress

Wei Dai12 Apr 2012 19:31 UTC

32 points

47 comments1 min readLW link

Eli’s review of “Is power-seeking AI an existential risk?”

elifland30 Sep 2022 12:21 UTC

67 points

0 comments3 min readLW link

(docs.google.com)

A necessary Membrane formalism feature

ThomasCederborg10 Sep 2024 21:33 UTC

20 points

6 comments11 min readLW link

Capabilities Denial: The Danger of Underestimating AI

Christopher King21 Mar 2023 1:24 UTC

6 points

5 comments3 min readLW link

Ask AI companies about what they are doing for AI safety?

mic9 Mar 2022 15:14 UTC

51 points

0 comments2 min readLW link

Distribution Shifts and The Importance of AI Safety

Leon Lang29 Sep 2022 22:38 UTC

17 points

2 comments9 min readLW link

Oversight Leagues: The Training Game as a Feature

Paul Bricman9 Sep 2022 10:08 UTC

20 points

6 comments10 min readLW link

Supplementary Alignment Insights Through a Highly Controlled Shutdown Incentive

Justausername23 Jul 2023 16:08 UTC

4 points

1 comment3 min readLW link

[Question] Are Mixture-of-Experts Transformers More Interpretable Than Dense Transformers?

simeon_c31 Dec 2022 11:34 UTC

8 points

5 comments1 min readLW link

Briefly how I’ve updated since ChatGPT

rime25 Apr 2023 14:47 UTC

48 points

2 comments2 min readLW link

A toy model of the treacherous turn

Stuart_Armstrong8 Jan 2016 12:58 UTC

43 points

13 comments6 min readLW link

AI as a Civilizational Risk Part 5/6: Relationship between C-risk and X-risk

PashaKamyshev3 Nov 2022 2:19 UTC

2 points

0 comments7 min readLW link

POWERplay: An open-source toolchain to study AI power-seeking

Edouard Harris24 Oct 2022 20:03 UTC

29 points

0 comments1 min readLW link

(github.com)

Two Tales of AI Takeover: My Doubts

Violet Hour5 Mar 2024 15:51 UTC

30 points

8 comments29 min readLW link

Bing finding ways to bypass Microsoft’s filters without being asked. Is it reproducible?

Christopher King20 Feb 2023 15:11 UTC

27 points

15 comments1 min readLW link

List of technical AI safety exercises and projects

JakubK19 Jan 2023 9:35 UTC

41 points

5 comments1 min readLW link

(docs.google.com)

AI Incident Sharing—Best practices from other fields and a comprehensive list of existing platforms

Štěpán Los28 Jun 2023 17:21 UTC

20 points

0 comments4 min readLW link

Investigating Alternative Futures: Human and Superintelligence Interaction Scenarios

Hiroshi Yamakawa3 Jan 2024 23:46 UTC

1 point

0 comments17 min readLW link

AGI & War

Calecute29 Jun 2023 22:20 UTC

9 points

1 comment1 min readLW link

Q&A with experts on risks from AI #1

XiXiDu8 Jan 2012 11:46 UTC

45 points

67 comments9 min readLW link

Morality as Cooperation Part II: Theory and Experiment

DeLesley Hutchins5 Dec 2024 9:04 UTC

2 points

0 comments17 min readLW link

AI Risk Intro 1: Advanced AI Might Be Very Bad

CallumMcDougall and L Rudolf L

11 Sep 2022 10:57 UTC

46 points

13 comments30 min readLW link

Consciousness is irrelevant—instead solve alignment by asking this question

Oliver Siegel4 Mar 2023 22:06 UTC

−10 points

6 comments1 min readLW link

Experts’ AI timelines are longer than you have been told?

Vasco Grilo16 Jan 2025 18:03 UTC

10 points

4 comments3 min readLW link

(bayes.net)

AI Alignment Research Overview (by Jacob Steinhardt)

Ben Pace6 Nov 2019 19:24 UTC

44 points

0 comments7 min readLW link

(docs.google.com)

Survey on intermediate goals in AI governance

MichaelA and MaxRa

17 Mar 2023 13:12 UTC

25 points

3 comments1 min readLW link

Using Consensus Mechanisms as an approach to Alignment

Prometheus10 Jun 2023 23:38 UTC

9 points

2 comments6 min readLW link

AGI Timelines in Governance: Different Strategies for Different Timeframes

simeon_c and AmberDawn

19 Dec 2022 21:31 UTC

65 points

28 comments10 min readLW link

Shah (DeepMind) and Leahy (Conjecture) Discuss Alignment Cruxes

Olive Branch, Rohin Shah, Connor Leahy and Andrea_Miotti

1 May 2023 16:47 UTC

96 points

10 comments30 min readLW link

MMLU’s Moral Scenarios Benchmark Doesn’t Measure What You Think it Measures

corey morris27 Sep 2023 17:54 UTC

18 points

3 comments4 min readLW link

(medium.com)

[Question] Self-censoring on AI x-risk discussions?

Decaeneus1 Jul 2024 18:24 UTC

17 points

2 comments1 min readLW link

[FICTION] Prometheus Rising: The Emergence of an AI Consciousness

Super AGI10 Jun 2023 4:41 UTC

−14 points

0 comments9 min readLW link

Retrospective on the 2022 Conjecture AI Discussions

Andrea_Miotti24 Feb 2023 22:41 UTC

90 points

5 comments2 min readLW link

Announcing Convergence Analysis: An Institute for AI Scenario & Governance Research

David_Kristoffersson and Deric Cheng

7 Mar 2024 21:37 UTC

23 points

1 comment4 min readLW link

[Question] What should an Einstein-like figure in Machine Learning do?

Razied5 Aug 2020 23:52 UTC

7 points

4 comments1 min readLW link

[Question] What are some claims or opinions about multi-multi delegation you’ve seen in the memeplex that you think deserve scrutiny?

Quinn27 Jun 2021 17:44 UTC

17 points

6 comments2 min readLW link

[Question] Realistic near-future scenarios of AI doom understandable for non-techy people?

RomanS28 Apr 2023 14:45 UTC

4 points

4 comments1 min readLW link

How teams went about their research at AI Safety Camp edition 8

Remmelt, Linda Linsefors and Kristi Uustalu

9 Sep 2023 16:34 UTC

28 points

0 comments13 min readLW link

Announcing aisafety.training

JJ Hepburn21 Jan 2023 1:01 UTC

61 points

4 comments1 min readLW link

Muehlhauser-Goertzel Dialogue, Part 1

lukeprog16 Mar 2012 17:12 UTC

42 points

161 comments33 min readLW link

What’s the Least Impressive Thing GPT-4 Won’t be Able to Do

Algon20 Aug 2022 19:48 UTC

80 points

125 comments1 min readLW link

[Question] Will research in AI risk jinx it? Consequences of training AI on AI risk arguments

Yann Dubois19 Dec 2022 22:42 UTC

5 points

6 comments1 min readLW link

[Question] Does it become easier, or harder, for the world to coordinate around not building AGI as time goes on?

Eli Tyre29 Jul 2019 22:59 UTC

86 points

31 comments3 min readLW link 2 reviews

[Question] How much of a concern are open-source LLMs in the short, medium and long terms?

JavierCC10 May 2023 9:14 UTC

5 points

0 comments1 min readLW link

[Question] AI interpretability could be harmful?

Roman Leventov10 May 2023 20:43 UTC

13 points

2 comments1 min readLW link

Ideation and Trajectory Modelling in Language Models

NickyP5 Oct 2023 19:21 UTC

16 points

2 comments10 min readLW link

Is technical AI alignment research a net positive?

cranberry_bear12 Apr 2022 13:07 UTC

6 points

2 comments2 min readLW link

Is this a weak pivotal act: creating nanobots that eat evil AGIs (but nothing else)?

Christopher King10 Feb 2023 19:26 UTC

0 points

3 comments1 min readLW link

New US Senate Bill on X-Risk Mitigation [Linkpost]

Evan R. Murphy4 Jul 2022 1:25 UTC

35 points

12 comments1 min readLW link

(www.hsgac.senate.gov)

Carl Shulman On Dwarkesh Podcast June 2023

Moonicker11 Feb 2024 21:02 UTC

18 points

0 comments159 min readLW link

AI X-risk is a possible solution to the Fermi Paradox

magic9mushroom30 May 2023 17:42 UTC

5 points

22 comments2 min readLW link 2 reviews

[Question] Have you ever considered taking the ‘Turing Test’ yourself?

Super AGI27 Jul 2023 3:48 UTC

2 points

6 comments1 min readLW link

Linkpost: ‘Dissolving’ AI Risk – Parameter Uncertainty in AI Future Forecasting

DavidW13 Mar 2023 16:52 UTC

6 points

0 comments1 min readLW link

(forum.effectivealtruism.org)

against “AI risk”

Wei Dai11 Apr 2012 22:46 UTC

35 points

91 comments1 min readLW link

Thoughts after the Wolfram and Yudkowsky discussion

Tahp14 Nov 2024 1:43 UTC

25 points

13 comments6 min readLW link

What I would like the SIAI to publish

XiXiDu1 Nov 2010 14:07 UTC

36 points

225 comments3 min readLW link

Why We MUST Create an AGI that Disempowers Humanity. For Real.

twkaiser22 Mar 2023 23:01 UTC

−17 points

1 comment4 min readLW link

AI Risk & Policy Forecasts from Metaculus & FLI’s AI Pathways Workshop

_will_16 May 2023 18:06 UTC

11 points

4 comments8 min readLW link

Critique of ‘Many People Fear A.I. They Shouldn’t’ by David Brooks.

Axel Ahlqvist15 Aug 2024 18:38 UTC

12 points

8 comments3 min readLW link

Video essay: How Will We Know When AI is Conscious?

JanPro6 Sep 2023 18:10 UTC

11 points

7 comments1 min readLW link

(www.youtube.com)

All AGI safety questions welcome (especially basic ones) [July 2022]

plex and Robert Miles

16 Jul 2022 12:57 UTC

84 points

132 comments3 min readLW link

2+2: Ontological Framework

Lyrialtus1 Feb 2022 1:07 UTC

−15 points

2 comments12 min readLW link

Conceptual issues in AI safety: the paradigmatic gap

vedevazz24 Jun 2018 15:09 UTC

33 points

0 comments1 min readLW link

(www.foldl.me)

Tetherware #2: What every human should know about our most likely AI future

Jáchym Fibír28 Feb 2025 11:12 UTC

3 points

0 comments11 min readLW link

(tetherware.substack.com)

Convergence Towards World-Models: A Gears-Level Model

Thane Ruthenis4 Aug 2022 23:31 UTC

38 points

1 comment13 min readLW link

Towards shutdownable agents via stochastic choice

EJT, alexr, christosi and LAThomson

8 Jul 2024 10:14 UTC

59 points

11 comments23 min readLW link

(arxiv.org)

Could you have stopped Chernobyl?

Carlos Ramirez27 Aug 2021 1:48 UTC

29 points

17 comments8 min readLW link

Embedding Ethical Priors into AI Systems: A Bayesian Approach

Justausername3 Aug 2023 15:31 UTC

−5 points

3 comments21 min readLW link

Averting Catastrophe: Decision Theory for COVID-19, Climate Change, and Potential Disasters of All Kinds

JakubK2 May 2023 22:50 UTC

10 points

0 comments1 min readLW link

(nyupress.org)

The potential Ai risk

The White Death7 May 2024 20:31 UTC

1 point

0 comments1 min readLW link

Conjecture: a retrospective after 8 months of work

Connor Leahy, Sid Black, Gabriel Alfour and Chris Scammell

23 Nov 2022 17:10 UTC

180 points

9 comments8 min readLW link

Proposing the Conditional AI Safety Treaty (linkpost TIME)

otto.barten15 Nov 2024 13:59 UTC

11 points

9 comments3 min readLW link

(time.com)

Reflections on “Making the Atomic Bomb”

boazbarak17 Aug 2023 2:48 UTC

51 points

7 comments8 min readLW link

A Playbook for AI Risk Reduction (focused on misaligned AI)

HoldenKarnofsky6 Jun 2023 18:05 UTC

90 points

42 comments14 min readLW link 1 review

Ideas for studies on AGI risk

dr_s20 Apr 2023 18:17 UTC

5 points

1 comment11 min readLW link

[Question] Is “brittle alignment” good enough?

the8thbit23 May 2023 17:35 UTC

9 points

5 comments3 min readLW link

Proof of posteriority: a defense against AI-generated misinformation

jchan17 Jul 2023 12:04 UTC

33 points

3 comments5 min readLW link

[Question] What’s your viewpoint on the likelihood of GPT-5 being able to autonomously create, train, and implement an AI superior to GPT-5?

Super AGI26 May 2023 1:43 UTC

7 points

15 comments1 min readLW link

(Structural) Stability of Coupled Optimizers

Paul Bricman30 Sep 2022 11:28 UTC

25 points

0 comments10 min readLW link

Back to the Past to the Future

Prometheus18 Oct 2023 16:51 UTC

5 points

0 comments1 min readLW link

Transcription of Eliezer’s January 2010 video Q&A

curiousepic14 Nov 2011 17:02 UTC

112 points

9 comments56 min readLW link

Corporate Governance for Frontier AI Labs: A Research Agenda

Matthew Wearden28 Feb 2024 11:29 UTC

5 points

0 comments16 min readLW link

(matthewwearden.co.uk)

Open-source LLMs may prove Bostrom’s vulnerable world hypothesis

Roope Ahvenharju15 Apr 2023 19:16 UTC

1 point

1 comment1 min readLW link

Is it time to talk about AI doomsday prepping yet?

bokov5 Mar 2023 21:17 UTC

0 points

8 comments1 min readLW link

AI Alignment Meme Viruses

RationalDino15 Jan 2025 15:55 UTC

5 points

0 comments2 min readLW link

[Question] How should we think about the decision relevance of models estimating p(doom)?

Mo Putera11 May 2023 4:16 UTC

12 points

1 comment3 min readLW link

Limit intelligent weapons

Lucas Pfeifer23 Mar 2023 17:54 UTC

−11 points

36 comments1 min readLW link

Biosafety Regulations (BMBL) and their relevance for AI

Štěpán Los29 Jun 2023 19:22 UTC

4 points

0 comments4 min readLW link

[Question] Daisy-chaining epsilon-step verifiers

Decaeneus6 Apr 2023 2:07 UTC

2 points

1 comment1 min readLW link

Alignment—Path to AI as ally, not slave nor foe

ozb30 Mar 2023 14:54 UTC

10 points

3 comments2 min readLW link

A Guide to Forecasting AI Science Capabilities

Eleni Angelou29 Apr 2023 23:24 UTC

6 points

1 comment4 min readLW link

Follow along with Columbia EA’s Advanced AI Safety Fellowship!

RohanS2 Jul 2022 17:45 UTC

3 points

0 comments2 min readLW link

(forum.effectivealtruism.org)

The Evil AI Overlord List

Stuart_Armstrong20 Nov 2012 17:02 UTC

44 points

80 comments1 min readLW link

Oh, Think of the Bananas

Jeffs1 Jun 2023 6:46 UTC

3 points

0 comments2 min readLW link

Safe Search is off: root causes of AI catastrophic risks

Jemal Young31 Jan 2025 18:22 UTC

4 points

0 comments3 min readLW link

How truthful is GPT-3? A benchmark for language models

Owain_Evans16 Sep 2021 10:09 UTC

58 points

24 comments6 min readLW link

AI Researchers On AI Risk

Scott Alexander22 May 2015 11:16 UTC

21 points

0 comments16 min readLW link

How can I reduce existential risk from AI?

lukeprog13 Nov 2012 21:56 UTC

63 points

92 comments8 min readLW link

Focus on existential risk is a distraction from the real issues. A false fallacy

Nik Samoylov30 Oct 2023 23:42 UTC

−19 points

11 comments2 min readLW link

[Question] Can singularity emerge from transformers?

MP8 Apr 2024 14:26 UTC

−3 points

1 comment1 min readLW link

Runaway Optimizers in Mind Space

silentbob16 Jul 2023 14:26 UTC

16 points

0 comments12 min readLW link

ACI#4: Seed AI is the new Perpetual Motion Machine

Akira Pyinya8 Jul 2023 1:17 UTC

−1 points

0 comments6 min readLW link

A problem shared by many different alignment targets

ThomasCederborg15 Jan 2025 14:22 UTC

13 points

18 comments36 min readLW link

Monthly Doom Argument Threads? Doom Argument Wiki?

LVSN4 Feb 2023 16:59 UTC

3 points

0 comments1 min readLW link

Controlling AGI Risk

TeaSea15 Mar 2024 4:56 UTC

6 points

8 comments4 min readLW link

Early situational awareness and its implications, a story

Jacob Pfau6 Feb 2023 20:45 UTC

29 points

6 comments3 min readLW link

Announcing the AI Safety Nudge Competition to Help Beat Procrastination

Marc Carauleanu1 Oct 2022 1:49 UTC

10 points

0 comments2 min readLW link

[Question] Seeking AI Alignment Tutor/Advisor: $100–150/hr

MrThink5 Oct 2024 21:28 UTC

28 points

3 comments2 min readLW link

AI alignment landscape

paulfchristiano13 Oct 2019 2:10 UTC

40 points

3 comments1 min readLW link

(ai-alignment.com)

Two Neglected Problems in Human-AI Safety

Wei Dai16 Dec 2018 22:13 UTC

102 points

25 comments2 min readLW link

My (naive) take on Risks from Learned Optimization

artkpv31 Oct 2022 10:59 UTC

7 points

0 comments5 min readLW link

Green goo is plausible

anithite18 Apr 2023 0:04 UTC

67 points

31 comments4 min readLW link 1 review

‘Dumb’ AI observes and manipulates controllers

Stuart_Armstrong13 Jan 2015 13:35 UTC

51 points

19 comments2 min readLW link

AI Safety Unconference NeurIPS 2022

Orpheus7 Nov 2022 15:39 UTC

25 points

0 comments1 min readLW link

(aisafetyevents.org)

Hands of gods

Anders L28 May 2023 15:15 UTC

1 point

0 comments9 min readLW link

(woodfromeden.substack.com)

Code Generation as an AI risk setting

Not Relevant17 Apr 2022 22:27 UTC

92 points

16 comments2 min readLW link

Clarifying AI X-risk

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

1 Nov 2022 11:03 UTC

127 points

24 comments4 min readLW link 1 review

Will AI kill everyone? Here’s what the godfathers of AI have to say [RA video]

Writer19 Aug 2023 17:29 UTC

58 points

8 comments2 min readLW link

(youtu.be)

The $100B plan with “70% risk of killing us all” w Stephen Fry [video]

Oleg Trott21 Jul 2024 20:06 UTC

35 points

8 comments1 min readLW link

(www.youtube.com)

AI Safety via Luck

Jozdien1 Apr 2023 20:13 UTC

82 points

7 comments11 min readLW link

The Social Alignment Problem

irving28 Apr 2023 14:16 UTC

99 points

13 comments8 min readLW link

AI Incident Reporting: A Regulatory Review

Deric Cheng and Elliot Mckernon

11 Mar 2024 21:03 UTC

16 points

0 comments6 min readLW link

Jimmy Apples, source of the rumor that OpenAI has achieved AGI internally, is a credible insider.

Jorterder28 Sep 2023 1:20 UTC

−6 points

2 comments1 min readLW link

(twitter.com)

Neural program synthesis is a dangerous technology

syllogism12 Jan 2018 16:19 UTC

10 points

6 comments2 min readLW link

Taking the parameters which seem to matter and rotating them until they don’t

Garrett Baker26 Aug 2022 18:26 UTC

120 points

48 comments1 min readLW link

Brainstorming: Slow Takeoff

David Piepgrass23 Jan 2024 6:58 UTC

3 points

0 comments51 min readLW link

Rational Effective Utopia & Narrow Way There: Math-Proven Safe Static Multiversal mAX-Intelligence (AXI), Multiversal Alignment, New Ethicophysics… (Aug 11)

ank11 Feb 2025 3:21 UTC

13 points

8 comments38 min readLW link

One Does Not Simply Replace the Humans

JerkyTreats6 Apr 2023 20:56 UTC

9 points

3 comments4 min readLW link

(www.lesswrong.com)

Tyler Cowen’s challenge to develop an ‘actual mathematical model’ for AI X-Risk

Joe Brenton16 May 2023 11:57 UTC

6 points

4 comments1 min readLW link

The Gradient – The Artificiality of Alignment

mic8 Oct 2023 4:06 UTC

12 points

1 comment5 min readLW link

(thegradient.pub)

Separating the “control problem” from the “alignment problem”

Yi-Yang11 May 2023 9:41 UTC

12 points

1 comment4 min readLW link

How to safely use an optimizer

Simon Fischer28 Mar 2024 16:11 UTC

47 points

21 comments7 min readLW link

AI safety advocates should consider providing gentle pushback following the events at OpenAI

civilsociety22 Dec 2023 18:55 UTC

16 points

5 comments3 min readLW link

Announcing Atlas Computing

miyazono11 Apr 2024 15:56 UTC

45 points

4 comments4 min readLW link

Military AI as a Convergent Goal of Self-Improving AI

avturchin13 Nov 2017 12:17 UTC

5 points

3 comments1 min readLW link

AI takeoff story: a continuation of progress by other means

Edouard Harris27 Sep 2021 15:55 UTC

76 points

13 comments10 min readLW link

Exploring Last-Resort Measures for AI Alignment: Humanity’s Extinction Switch

0xPetra23 Jun 2023 17:01 UTC

7 points

0 comments2 min readLW link

Challenge proposal: smallest possible self-hardening backdoor for RLHF

Christopher King29 Jun 2023 16:56 UTC

7 points

0 comments2 min readLW link

Critique my Model: The EV of AGI to Selfish Individuals

ozziegooen8 Apr 2018 20:04 UTC

19 points

9 comments4 min readLW link

The AI alignment problem in socio-technical systems from a computational perspective: A Top-Down-Top view and outlook

zhaoweizhang15 Jul 2024 18:56 UTC

3 points

0 comments9 min readLW link

Thoughts on the In-Context Scheming AI Experiment

ExCeph9 Jan 2025 2:19 UTC

2 points

0 comments4 min readLW link

Gearing Up for Long Timelines in a Hard World

Dalcy14 Jul 2023 6:11 UTC

18 points

0 comments4 min readLW link

Some Thoughts on Singularity Strategies

Wei Dai13 Jul 2011 2:41 UTC

45 points

30 comments3 min readLW link

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Johannes Treutlein and Owain_Evans

21 Jun 2024 15:54 UTC

163 points

13 comments8 min readLW link

(arxiv.org)

Linkpost: Are Emergent Abilities in Large Language Models just In-Context Learning?

Erich_Grunewald8 Oct 2023 12:14 UTC

12 points

7 comments2 min readLW link

(arxiv.org)

Making Nanobots isn’t a one-shot process, even for an artificial superintelligance

dankrad25 Apr 2023 0:39 UTC

20 points

13 comments6 min readLW link

Announcing #AISummitTalks featuring Professor Stuart Russell and many others

otto.barten24 Oct 2023 10:11 UTC

17 points

1 comment1 min readLW link

I designed an AI safety course (for a philosophy department)

Eleni Angelou23 Sep 2023 22:03 UTC

37 points

15 comments2 min readLW link

HIRING: Inform and shape a new project on AI safety at Partnership on AI

madhu_lika7 Dec 2021 19:37 UTC

1 point

0 comments1 min readLW link

Value Formation: An Overarching Model

Thane Ruthenis15 Nov 2022 17:16 UTC

34 points

20 comments34 min readLW link

Nonperson Predicates

Eliezer Yudkowsky27 Dec 2008 1:47 UTC

68 points

177 comments6 min readLW link

OpenAI Credit Account (2510$)

Emirhan BULUT21 Jan 2024 2:32 UTC

1 point

0 comments1 min readLW link

In favor of accelerating problems you’re trying to solve

Christopher King11 Apr 2023 18:15 UTC

2 points

2 comments4 min readLW link

Alignment, Goals, and The Gut-Head Gap: A Review of Ngo. et al.

Violet Hour11 May 2023 18:06 UTC

20 points

2 comments13 min readLW link

[Link post] Promising Paths to Alignment—Connor Leahy | Talk

frances_lorenz14 May 2022 16:01 UTC

34 points

0 comments1 min readLW link

Response to “Coordinated pausing: An evaluation-based coordination scheme for frontier AI developers”

Matthew Wearden30 Oct 2023 17:27 UTC

5 points

2 comments6 min readLW link

(matthewwearden.co.uk)

Regulate or Compete? The China Factor in U.S. AI Policy (NAIR #2)

charles_m5 May 2023 17:43 UTC

2 points

1 comment7 min readLW link

(navigatingairisks.substack.com)

[Question] Why is so much discussion happening in private Google Docs?

Wei Dai12 Jan 2019 2:19 UTC

108 points

22 comments1 min readLW link

Strategic implications of AIs’ ability to coordinate at low cost, for example by merging

Wei Dai25 Apr 2019 5:08 UTC

69 points

46 comments2 min readLW link 1 review

Sources of evidence in Alignment

Martín Soto2 Jul 2023 20:38 UTC

22 points

0 comments11 min readLW link

What is your timelines for ADI (artificial disempowering intelligence)?

Christopher King17 Apr 2023 17:01 UTC

3 points

3 comments2 min readLW link

Ideas for improving epistemics in AI safety outreach

mic21 Aug 2023 19:55 UTC

64 points

6 comments3 min readLW link

Soon: a weekly AI Safety prerequisites module on LessWrong

null30 Apr 2018 13:23 UTC

35 points

10 comments1 min readLW link

Aligning an H-JEPA agent via training on the outputs of an LLM-based “exemplary actor”

Roman Leventov29 May 2023 11:08 UTC

12 points

10 comments30 min readLW link

Why AI may not save the World

Alberto Zannoni9 Jun 2023 17:42 UTC

0 points

0 comments4 min readLW link

(a16z.com)

Decomposing independent generalizations in neural networks via Hessian analysis

Dmitry Vaintrob and Nina Panickssery

14 Aug 2023 17:04 UTC

84 points

4 comments1 min readLW link

Let’s talk about “Convergent Rationality”

David Scott Krueger (formerly: capybaralet)12 Jun 2019 21:53 UTC

44 points

33 comments6 min readLW link

Current AI Safety Roles for Software Engineers

ozziegooen9 Nov 2018 20:57 UTC

70 points

9 comments4 min readLW link

Counterfactual Oracles = online supervised learning with random selection of training episodes

Wei Dai10 Sep 2019 8:29 UTC

52 points

26 comments3 min readLW link

Thoughts on Ben Garfinkel’s “How sure are we about this AI stuff?”

David Scott Krueger (formerly: capybaralet)6 Feb 2019 19:09 UTC

25 points

17 comments1 min readLW link

[Question] Why don’t we consider large forms of social organization (economic and political forms, in particular) to qualify as AGI and Transformative AI?

T L16 Jul 2023 18:54 UTC

1 point

0 comments2 min readLW link

2017 AI Safety Literature Review and Charity Comparison

Larks24 Dec 2017 18:52 UTC

41 points

5 comments23 min readLW link

Reflecting on the transhumanist rebuttal to AI existential risk and critique of our debate methodologies and misuse of statistics

catgirlsruletheworld20 Aug 2024 1:59 UTC

−5 points

0 comments4 min readLW link

[Question] AI misalignment risk from GPT-like systems?

fiso6419 Jun 2022 17:35 UTC

10 points

8 comments1 min readLW link

[LINK] NYT Article about Existential Risk from AI

[deleted]28 Jan 2013 10:37 UTC

38 points

23 comments1 min readLW link

Explaining inner alignment to myself

Jeremy Gillen24 May 2022 23:10 UTC

9 points

2 comments10 min readLW link

[Question] Pink Shoggoths: What does alignment look like in practice?

Yuli_Ban25 Feb 2023 12:23 UTC

25 points

13 comments11 min readLW link

A more grounded idea of AI risk

Iknownothing11 May 2023 9:48 UTC

3 points

4 comments1 min readLW link

[Question] fake alignment solutions????

KvmanThinking11 Dec 2024 3:31 UTC

1 point

6 comments1 min readLW link

GPT-4 aligning with acasual decision theory when instructed to play games, but includes a CDT explanation that’s incorrect if they differ

Christopher King23 Mar 2023 16:16 UTC

7 points

4 comments8 min readLW link

AI can exploit safety plans posted on the Internet

Peter S. Park4 Dec 2022 12:17 UTC

−15 points

4 comments1 min readLW link

AI Safety Camp: Machine Learning for Scientific Discovery

Eleni Angelou6 Jan 2023 3:21 UTC

3 points

0 comments1 min readLW link

AI Safety without Alignment: How humans can WIN against AI

vicchain29 Jun 2023 17:53 UTC

1 point

1 comment2 min readLW link

[Crosspost] A recent write-up of the case for AI (existential) risk

Timsey18 May 2023 13:13 UTC

6 points

0 comments19 min readLW link

Toward a Dynamic Definition of Ethical Violation in AI Systems: A Risk-Based Systems Perspective

Parthiv21 Jun 2025 10:28 UTC

1 point

0 comments4 min readLW link

Autonomous Alignment Oversight Framework (AAOF)

Justausername25 Jul 2023 10:25 UTC

−9 points

0 comments4 min readLW link

NYT: Google will “recalibrate” the risk of releasing AI due to competition with OpenAI

Michael Huang22 Jan 2023 8:38 UTC

47 points

2 comments1 min readLW link

(www.nytimes.com)

AI community building: EliezerKart

Christopher King1 Apr 2023 15:25 UTC

46 points

0 comments2 min readLW link

On urgency, priority and collective reaction to AI-Risks: Part I

Denreik16 Apr 2023 19:14 UTC

−10 points

15 comments5 min readLW link

The genie knows, but doesn’t care

Rob Bensinger6 Sep 2013 6:42 UTC

123 points

495 comments8 min readLW link

Universality and the “Filter”

maggiehayes16 Dec 2021 0:47 UTC

10 points

2 comments11 min readLW link

Annotated reply to Bengio’s “AI Scientists: Safe and Useful AI?”

Roman Leventov8 May 2023 21:26 UTC

18 points

2 comments7 min readLW link

(yoshuabengio.org)

[Question] What projects and efforts are there to promote AI safety research?

Christopher King24 May 2023 0:33 UTC

4 points

0 comments1 min readLW link

Problems in AI Alignment that philosophers could potentially contribute to

Wei Dai17 Aug 2019 17:38 UTC

79 points

14 comments2 min readLW link

Understanding Conjecture: Notes from Connor Leahy interview

Orpheus1615 Sep 2022 18:37 UTC

107 points

23 comments15 min readLW link

Three scenarios of pseudo-alignment

Eleni Angelou3 Sep 2022 12:47 UTC

9 points

0 comments3 min readLW link

We Should Prepare for a Larger Representation of Academia in AI Safety

Leon Lang13 Aug 2023 18:03 UTC

90 points

14 comments5 min readLW link

Defining Boundaries on Outcomes

Takk7 Jun 2023 17:41 UTC

1 point

0 comments1 min readLW link

Status quo bias; System justification: Bias in Evaluating AGI X-Risks

Remmelt and flandry19

3 Jan 2023 2:50 UTC

−11 points

0 comments1 min readLW link

AI as a powerful meme, via CGP Grey

TheManxLoiner30 Oct 2024 18:31 UTC

47 points

8 comments4 min readLW link

I Have No Mouth but I Must Speak

Jack5 Apr 2025 7:42 UTC

7 points

8 comments8 min readLW link

I read every major AI lab’s safety plan so you don’t have to

sarahhw16 Dec 2024 18:51 UTC

20 points

0 comments12 min readLW link

(longerramblings.substack.com)

We will be around in 30 years

mukashi7 Jun 2022 3:47 UTC

12 points

205 comments2 min readLW link

We don’t trade with ants

KatjaGrace10 Jan 2023 23:50 UTC

277 points

109 comments7 min readLW link 1 review

(worldspiritsockpuppet.com)

Is there a ML agent that abandons it’s utility function out-of-distribution without losing capabilities?

Christopher King22 Feb 2023 16:49 UTC

1 point

7 comments1 min readLW link

We Shouldn’t Expect AI to Ever be Fully Rational

OneManyNone18 May 2023 17:09 UTC

19 points

31 comments6 min readLW link

Morality as Cooperation Part I: Humans

DeLesley Hutchins5 Dec 2024 8:16 UTC

5 points

0 comments19 min readLW link

Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results

Esben Kran, Fazl, Sabrina Zaki, gabrielrecc and rz2383

23 Feb 2023 10:48 UTC

8 points

0 comments6 min readLW link

Taxonomy of AI-risk counterarguments

Odd anon16 Oct 2023 0:12 UTC

65 points

12 comments8 min readLW link

Ted Kaczyinski proves instrumental convergence?

xXAlphaSigmaXx28 Jun 2024 3:50 UTC

0 points

0 comments1 min readLW link

Recursion in AI is scary. But let’s talk solutions.

Oleg Trott16 Jul 2024 20:34 UTC

3 points

10 comments2 min readLW link

Why I think that teaching philosophy is high impact

Eleni Angelou19 Dec 2022 3:11 UTC

5 points

0 comments2 min readLW link

Thoughts on the Feasibility of Prosaic AGI Alignment?

iamthouthouarti21 Aug 2020 23:25 UTC

8 points

10 comments1 min readLW link

ChatGPT Plugins—The Beginning of the End

Bary Levy25 Mar 2023 11:45 UTC

15 points

4 comments1 min readLW link

[Question] [timeboxed exercise] write me your model of AI human-existential safety and the alignment problems in 15 minutes

Quinn4 May 2021 19:10 UTC

6 points

2 comments1 min readLW link

Logic vs intuition ⇔ algorithm vs ML

pchvykov4 Jan 2025 9:06 UTC

5 points

0 comments7 min readLW link

A brief review of the reasons multi-objective RL could be important in AI Safety Research

Ben Smith and Roland Pihlakas

29 Sep 2021 17:09 UTC

30 points

7 comments10 min readLW link

Machine learning could be fundamentally unexplainable

George3d616 Dec 2020 13:32 UTC

26 points

15 comments15 min readLW link

(cerebralab.com)

[Question] Why Do AI researchers Rate the Probability of Doom So Low?

Aorou24 Sep 2022 2:33 UTC

7 points

6 comments3 min readLW link

Truthful AI: Developing and governing AI that does not lie

Owain_Evans, owencb and Lukas Finnveden

18 Oct 2021 18:37 UTC

82 points

9 comments10 min readLW link

Proposed Alignment Technique: OSNR (Output Sanitization via Noising and Reconstruction) for Safer Usage of Potentially Misaligned AGI

sudo29 May 2023 1:35 UTC

14 points

9 comments6 min readLW link

Introduction to Towards Causal Foundations of Safe AGI

tom4everitt, Lewis Hammond, Francis Rhys Ward, RyanCarey, James Fox, mattmacdermott and sbenthall

12 Jun 2023 17:55 UTC

73 points

6 comments4 min readLW link

RAND report finds no effect of current LLMs on viability of bioterrorism attacks

StellaAthena25 Jan 2024 19:17 UTC

94 points

14 comments1 min readLW link

(www.rand.org)

Generative, Episodic Objectives for Safe AI

Michael Glass5 Oct 2022 23:18 UTC

11 points

3 comments8 min readLW link

AISN #20: LLM Proliferation, AI Deception, and Continuing Drivers of AI Capabilities

Dan H29 Aug 2023 15:07 UTC

12 points

0 comments8 min readLW link

(newsletter.safe.ai)

How to measure FLOP/s for Neural Networks empirically?

Marius Hobbhahn29 Nov 2021 15:18 UTC

16 points

5 comments7 min readLW link

[Question] Convince me that humanity isn’t doomed by AGI

Yitz15 Apr 2022 17:26 UTC

61 points

50 comments1 min readLW link

What is the ground reality of countries taking steps to recalibrate AI development towards Alignment first?

Nebuch29 Jan 2023 13:26 UTC

8 points

6 comments3 min readLW link

Instrumental convergence in single-agent systems

Edouard Harris and simonsdsuo

12 Oct 2022 12:24 UTC

33 points

4 comments8 min readLW link

(www.gladstone.ai)

[Question] Best introductory overviews of AGI safety?

JakubK13 Dec 2022 19:01 UTC

21 points

9 comments2 min readLW link

(forum.effectivealtruism.org)

The Intelligence Curse: an essay series

L Rudolf L and lukedrago

24 Apr 2025 12:59 UTC

72 points

10 comments2 min readLW link

Mauhn Releases AI Safety Documentation

Berg Severens3 Jul 2021 21:23 UTC

4 points

0 comments1 min readLW link

Biosecurity and AI: Risks and Opportunities

Steve Newman27 Feb 2024 18:45 UTC

11 points

1 comment7 min readLW link

(www.safe.ai)

The “Everyone Can’t Be Wrong” Prior causes AI risk denial but helped prehistoric people

Knight Lee9 Jan 2025 5:54 UTC

1 point

0 comments2 min readLW link

And the AI would have got away with it too, if...

Stuart_Armstrong22 May 2019 21:35 UTC

75 points

7 comments1 min readLW link

Report on Frontier Model Training

YafahEdelman30 Aug 2023 20:02 UTC

122 points

21 comments21 min readLW link

(docs.google.com)

Provably Honest—A First Step

Srijanak De5 Nov 2022 19:18 UTC

10 points

2 comments8 min readLW link

AI as a Civilizational Risk Part 3/6: Anti-economy and Signal Pollution

PashaKamyshev31 Oct 2022 17:03 UTC

7 points

4 comments14 min readLW link

A Letter to the Editor of MIT Technology Review

Jeffs30 Aug 2023 16:59 UTC

0 points

0 comments2 min readLW link

SIGMI Certification Criteria

a littoral wizard20 Jan 2025 2:41 UTC

6 points

0 comments1 min readLW link

My summary of “Pragmatic AI Safety”

Eleni Angelou5 Nov 2022 12:54 UTC

3 points

0 comments5 min readLW link

Confusions and updates on STEM AI

Eleni Angelou19 May 2023 21:34 UTC

23 points

0 comments3 min readLW link

[Question] What if AGI had its own universe to maybe wreck?

mseale26 Oct 2023 17:49 UTC

−1 points

2 comments1 min readLW link

Levels of AI Self-Improvement

avturchin29 Apr 2018 11:45 UTC

11 points

1 comment39 min readLW link

Course recommendations for Friendliness researchers

Louie9 Jan 2013 14:33 UTC

96 points

112 comments10 min readLW link

An Appeal to AI Superintelligence: Reasons Not to Preserve (most of) Humanity

Alex Beyman22 Mar 2023 4:09 UTC

−14 points

6 comments19 min readLW link

How reasonable is taking extinction risk?

FVelde23 Jul 2024 18:05 UTC

2 points

4 comments4 min readLW link

[Question] Have we seen any “ReLU instead of sigmoid-type improvements” recently

KvmanThinking23 Nov 2024 3:51 UTC

2 points

4 comments1 min readLW link

[Question] Is there any policy for a fair treatment of AIs whose friendliness is in doubt?

nahoj18 Nov 2022 19:01 UTC

16 points

10 comments1 min readLW link

AISN #21: Google DeepMind’s GPT-4 Competitor, Military Investments in Autonomous Drones, The UK AI Safety Summit, and Case Studies in AI Policy

Dan H5 Sep 2023 15:03 UTC

15 points

0 comments5 min readLW link

(newsletter.safe.ai)

Announcement: AI alignment prize round 4 winners

cousin_it20 Jan 2019 14:46 UTC

74 points

41 comments1 min readLW link

I made AI Risk Propaganda

monkymind29 Mar 2023 14:26 UTC

−3 points

0 comments1 min readLW link

Labor Participation is a High-Priority AI Alignment Risk

alex17 Jun 2024 18:09 UTC

7 points

0 comments17 min readLW link

Williams-Beuren Syndrome: Frendly Mutations

Takk5 Apr 2023 20:59 UTC

−1 points

1 comment1 min readLW link

Towards AI Safety Infrastructure: Talk & Outline

Paul Bricman7 Jan 2024 9:31 UTC

11 points

0 comments2 min readLW link

(www.youtube.com)

Reframing misaligned AGI’s: well-intentioned non-neurotypical assistants

zhukeepa1 Apr 2018 1:22 UTC

46 points

14 comments2 min readLW link

(retired article) AGI With Internet Access: Why we won’t stuff the genie back in its bottle.

Max TK18 Mar 2023 3:43 UTC

5 points

10 comments4 min readLW link

Palisade is hiring: Exec Assistant, Content Lead, Ops Lead, and Policy Lead

Charlie Rogers-Smith9 Oct 2024 0:04 UTC

11 points

0 comments4 min readLW link

Quantitative cruxes in Alignment

Martín Soto2 Jul 2023 20:38 UTC

19 points

0 comments23 min readLW link

Acceptability Verification: A Research Agenda

David Udell and evhub

12 Jul 2022 20:11 UTC

50 points

0 comments1 min readLW link

(docs.google.com)

ChatGPT’s “fuzzy alignment” isn’t evidence of AGI alignment: the banana test

Michael Tontchev23 Mar 2023 7:12 UTC

23 points

6 comments4 min readLW link

Is “red” for GPT-4 the same as “red” for you?

Yusuke Hayashi6 May 2023 17:55 UTC

9 points

6 comments2 min readLW link

Symbiotic self-alignment of AIs.

Spiritus Dei7 Nov 2023 17:18 UTC

1 point

0 comments3 min readLW link

The Importance of AI Alignment, explained in 5 points

Daniel_Eth11 Feb 2023 2:56 UTC

33 points

2 comments13 min readLW link

Rethink Priorities: Seeking Expressions of Interest for Special Projects Next Year

kierangreig29 Nov 2023 13:59 UTC

4 points

0 comments5 min readLW link

$250K in Prizes: SafeBench Competition Announcement

ozhang3 Apr 2024 22:07 UTC

26 points

0 comments1 min readLW link

Curse of knowledge and Naive realism: Bias in Evaluating AGI X-Risks

Remmelt and flandry19

31 Dec 2022 13:33 UTC

−7 points

1 comment1 min readLW link

(www.lesswrong.com)

Why kill everyone?

arisAlexis5 Mar 2023 11:53 UTC

7 points

5 comments2 min readLW link

Two Stupid AI Alignment Ideas

aphyer16 Nov 2021 16:13 UTC

27 points

3 comments4 min readLW link

An “Observatory” For a Shy Super AI?

Sherrinford27 Sep 2024 21:22 UTC

5 points

0 comments1 min readLW link

(robreid.substack.com)

How to express this system for ethically aligned AGI as a Mathematical formula?

Oliver Siegel19 Apr 2023 20:13 UTC

−1 points

0 comments1 min readLW link

[Question] Has Eliezer publicly and satisfactorily responded to attempted rebuttals of the analogy to evolution?

kaler28 Jul 2024 12:23 UTC

10 points

14 comments1 min readLW link

Limits of safe and aligned AI

Shivam8 Oct 2024 21:30 UTC

2 points

0 comments4 min readLW link

[Linkpost] A survey on over 300 works about interpretability in deep networks

scasper12 Sep 2022 19:07 UTC

97 points

7 comments2 min readLW link

(arxiv.org)

Yoshua Bengio: Reasoning through arguments against taking AI safety seriously

Judd Rosenblatt11 Jul 2024 23:53 UTC

70 points

3 comments1 min readLW link

(yoshuabengio.org)

Qualitative Strategies of Friendliness

Eliezer Yudkowsky30 Aug 2008 2:12 UTC

30 points

56 comments12 min readLW link

Two ideas for alignment, perpetual mutual distrust and induction

APaleBlueDot25 May 2023 0:56 UTC

1 point

2 comments4 min readLW link

“Sorcerer’s Apprentice” from Fantasia as an analogy for alignment

awg29 Mar 2023 18:21 UTC

9 points

4 comments1 min readLW link

(video.disney.com)

Possible miracles

Orpheus16 and Thomas Larsen

9 Oct 2022 18:17 UTC

64 points

34 comments8 min readLW link

A conversation with Pi, a conversational AI.

Spiritus Dei15 Sep 2023 23:13 UTC

1 point

0 comments1 min readLW link

AI Risk US Presidential Candidate

Simon Berens11 Apr 2023 19:31 UTC

5 points

3 comments1 min readLW link

False Positives in Entity-Level Hallucination Detection: A Technical Challenge

MaxKamachee14 Jan 2025 19:22 UTC

1 point

0 comments2 min readLW link

What will the first human-level AI look like, and how might things go wrong?

EuanMcLean23 May 2024 11:17 UTC

20 points

2 comments15 min readLW link

What mistakes has the AI safety movement made?

EuanMcLean23 May 2024 11:19 UTC

64 points

29 comments12 min readLW link

My Most Likely Reason to Die Young is AI X-Risk

AISafetyIsNotLongtermist4 Jul 2022 17:08 UTC

61 points

24 comments4 min readLW link

(forum.effectivealtruism.org)

Charbel-Raphaël and Lucius discuss interpretability

Mateusz Bagiński, Charbel-Raphaël and Lucius Bushnaq

30 Oct 2023 5:50 UTC

112 points

7 comments21 min readLW link

Capability and Agency as Cornerstones of AI risk — My current model

wilm15 Sep 2022 8:25 UTC

10 points

4 comments12 min readLW link

Pausing AI Developments Isn’t Enough. We Need to Shut it All Down by Eliezer Yudkowsky

jacquesthibs29 Mar 2023 23:16 UTC

293 points

297 comments3 min readLW link

(time.com)

Do we have a plan for the “first critical try” problem?

Christopher King3 Apr 2023 16:27 UTC

−3 points

14 comments1 min readLW link

A trick for Safer GPT-N

Razied23 Aug 2020 0:39 UTC

7 points

1 comment2 min readLW link

“Smarter than us” is out!

Stuart_Armstrong25 Feb 2014 15:50 UTC

41 points

57 comments1 min readLW link

AMA on Truthful AI: Owen Cotton-Barratt, Owain Evans & co-authors

Owain_Evans22 Oct 2021 16:23 UTC

31 points

15 comments1 min readLW link

Instrumental convergence: scale and physical interactions

Edouard Harris and simonsdsuo

14 Oct 2022 15:50 UTC

22 points

0 comments17 min readLW link

(www.gladstone.ai)

[Question] Clarifying how misalignment can arise from scaling LLMs

Util19 Aug 2023 14:16 UTC

3 points

1 comment1 min readLW link

Darwinian Traps and Existential Risks

KristianRonn25 Aug 2024 22:37 UTC

85 points

14 comments10 min readLW link

AI as Super-Demagogue

RationalDino5 Nov 2023 21:21 UTC

11 points

12 comments9 min readLW link

Lightning Post: Things people in AI Safety should stop talking about

Prometheus20 Jun 2023 15:00 UTC

23 points

6 comments2 min readLW link

Open Source LLMs Can Now Actively Lie

Josh Levy1 Jun 2023 22:03 UTC

6 points

0 comments3 min readLW link

On the possibility of impossibility of AGI Long-Term Safety

Roman Yen13 May 2023 18:38 UTC

8 points

3 comments9 min readLW link

Support me in a Week-Long Picketing Campaign Near OpenAI’s HQ: Seeking Support and Ideas from the LessWrong Community

Percy30 Apr 2023 17:48 UTC

−21 points

15 comments1 min readLW link

[LQ] Some Thoughts on Messaging Around AI Risk

DragonGod25 Jun 2022 13:53 UTC

5 points

3 comments6 min readLW link

Factoring P(doom) into a bayesian network

Joseph Gardi17 Oct 2024 17:55 UTC

1 point

0 comments1 min readLW link

[untitled post]

superads916 Feb 2022 20:39 UTC

−5 points

8 comments1 min readLW link

Navigating public AI x-risk hype while pursuing technical solutions

Dan Braun19 Feb 2023 12:22 UTC

18 points

0 comments2 min readLW link

[Question] What do you mean with ‘alignment is solvable in principle’?

Remmelt17 Jan 2025 15:03 UTC

3 points

9 comments1 min readLW link

[Question] Is there any literature on using socialization for AI alignment?

Nathan112319 Apr 2023 22:16 UTC

10 points

9 comments2 min readLW link

The Dark Side of Cognition Hypothesis

Cameron Berg3 Oct 2021 20:10 UTC

19 points

1 comment16 min readLW link

Proposal: we should start referring to the risk from unaligned AI as a type of accident risk

Christopher King16 May 2023 15:18 UTC

22 points

6 comments2 min readLW link

LLMs stifle creativity, eliminate opportunities for serendipitous discovery and disrupt intergenerational transfer of wisdom

Ghdz5 Aug 2024 18:27 UTC

6 points

2 comments7 min readLW link

[Question] AI Safety orgs- what’s your biggest bottleneck right now?

Kabir Kumar16 Nov 2023 2:02 UTC

1 point

0 comments1 min readLW link

Proposal: labs should precommit to pausing if an AI argues for itself to be improved

NickGabs2 Jun 2023 22:31 UTC

3 points

3 comments4 min readLW link

Leveraging Legal Informatics to Align AI

John Nay18 Sep 2022 20:39 UTC

11 points

0 comments3 min readLW link

(forum.effectivealtruism.org)

Dario Amodei’s “Machines of Loving Grace” sound incredibly dangerous, for Humans

Super AGI27 Oct 2024 5:05 UTC

8 points

1 comment1 min readLW link

AGI ruin scenarios are likely (and disjunctive)

So8res27 Jul 2022 3:21 UTC

177 points

38 comments6 min readLW link

Plausibly, almost every powerful algorithm would be manipulative

Stuart_Armstrong6 Feb 2020 11:50 UTC

38 points

25 comments3 min readLW link

The way AGI wins could look very stupid

Christopher King12 May 2023 16:34 UTC

56 points

22 comments1 min readLW link

Thoughts on “AI is easy to control” by Pope & Belrose

Steven Byrnes1 Dec 2023 17:30 UTC

197 points

63 comments14 min readLW link 1 review

[Question] Is there anything that can stop AGI development in the near term?

Wulky Wilkinsen22 Apr 2021 20:37 UTC

5 points

5 comments1 min readLW link

[Question] Would more model evals teams be good?

Ryan Kidd25 Feb 2023 22:01 UTC

20 points

4 comments1 min readLW link

AI as a Civilizational Risk Part 6/6: What can be done

PashaKamyshev3 Nov 2022 19:48 UTC

2 points

4 comments4 min readLW link

More Than Just A, T, C, and G: Screening for Hidden Dangers in DNA Sequences

sgd21 Apr 2025 20:12 UTC

1 point

0 comments11 min readLW link

AI Risk Intro 2: Solving The Problem

CallumMcDougall and L Rudolf L

22 Sep 2022 13:55 UTC

22 points

0 comments27 min readLW link

Reflections on Trusting Trust & AI

Itay Yona16 Jan 2023 6:36 UTC

10 points

1 comment3 min readLW link

(mentaleap.ai)

[Question] What are the biggest current impacts of AI?

Sam Clarke7 Mar 2021 21:44 UTC

15 points

5 comments1 min readLW link

The humanity’s biggest mistake

RomanS10 Mar 2023 16:30 UTC

0 points

1 comment2 min readLW link

Alignment being impossible might be better than it being really difficult

Martín Soto25 Jul 2022 23:57 UTC

13 points

2 comments2 min readLW link

[Thought Experiment] Tomorrow’s Echo—The future of synthetic companionship.

Vimal Naran26 Oct 2023 17:54 UTC

−7 points

2 comments2 min readLW link

Oren’s Field Guide of Bad AGI Outcomes

Eris Discordia26 Sep 2022 4:06 UTC

0 points

0 comments1 min readLW link

A plausible story about AI risk.

DeLesley Hutchins10 Jun 2022 2:08 UTC

16 points

2 comments4 min readLW link

The public supports regulating AI for safety

Zach Stein-Perlman17 Feb 2023 4:10 UTC

114 points

9 comments1 min readLW link

(aiimpacts.org)

Alignment’s phlogiston

Eleni Angelou18 Aug 2022 22:27 UTC

10 points

2 comments2 min readLW link

What should AI safety be trying to achieve?

EuanMcLean23 May 2024 11:17 UTC

17 points

1 comment13 min readLW link

Saying the quiet part out loud: trading off x-risk for personal immortality

disturbance2 Nov 2023 17:43 UTC

84 points

89 comments5 min readLW link

In Defense of Wrapper-Minds

Thane Ruthenis28 Dec 2022 18:28 UTC

24 points

38 comments3 min readLW link

Why empiricists should believe in AI risk

Knight Lee11 Dec 2024 3:51 UTC

5 points

0 comments1 min readLW link

Paper: Identifying the Risks of LM Agents with an LM-Emulated Sandbox—University of Toronto 2023 - Benchmark consisting of 36 high-stakes tools and 144 test cases!

Singularian25019 Oct 2023 0:00 UTC

6 points

0 comments1 min readLW link

The Governance Problem and the “Pretty Good” X-Risk

Zach Stein-Perlman29 Aug 2021 18:00 UTC

5 points

2 comments11 min readLW link

What Environment Properties Select Agents For World-Modeling?

Thane Ruthenis23 Jul 2022 19:27 UTC

25 points

1 comment12 min readLW link

A&I (Rihanna ‘S&M’ parody lyrics)

nahoj21 May 2023 22:34 UTC

−2 points

0 comments2 min readLW link

Towards empathy in RL agents and beyond: Insights from cognitive science for AI Alignment

Marc Carauleanu3 Apr 2023 19:59 UTC

15 points

6 comments1 min readLW link

(clipchamp.com)

Challenge to the notion that anything is (maybe) possible with AGI

Remmelt and flandry19

1 Jan 2023 3:57 UTC

−27 points

4 comments1 min readLW link

(mflb.com)

Contra EY: Can AGI destroy us without trial & error?

nsokolsky13 Jun 2022 18:26 UTC

137 points

72 comments15 min readLW link

The simple picture on AI safety

Alex Flint27 May 2018 19:43 UTC

31 points

10 comments2 min readLW link

Goal Alignment Is Robust To the Sharp Left Turn

Thane Ruthenis13 Jul 2022 20:23 UTC

43 points

16 comments4 min readLW link

Measuring artificial intelligence on human benchmarks is naive

Anomalous11 Apr 2023 11:34 UTC

11 points

4 comments1 min readLW link

(forum.effectivealtruism.org)

Opportunities for individual donors in AI safety

Alex Flint31 Mar 2018 18:37 UTC

30 points

3 comments11 min readLW link

[Question] Why is violence against AI labs a taboo?

ArisC26 May 2023 8:00 UTC

−21 points

63 comments1 min readLW link

[Linkpost] Mark Zuckerberg confronted about Meta’s Llama 2 AI’s ability to give users detailed guidance on making anthrax—Business Insider

mic26 Sep 2023 12:05 UTC

18 points

11 comments2 min readLW link

(www.businessinsider.com)

Hope to live or fear to die?

Knight Lee27 Nov 2024 10:42 UTC

3 points

0 comments1 min readLW link

The two paragraph argument for AI risk

CronoDAS25 Nov 2023 2:01 UTC

19 points

8 comments1 min readLW link

The benefits and risks of optimism (about AI safety)

Karl von Wendt3 Dec 2023 12:45 UTC

−7 points

6 comments5 min readLW link

Accurate Models of AI Risk Are Hyperexistential Exfohazards

Thane Ruthenis25 Dec 2022 16:50 UTC

33 points

38 comments9 min readLW link

The Alignment Problem Needs More Positive Fiction

Netcentrica21 Aug 2022 22:01 UTC

6 points

3 comments5 min readLW link

Interlude: But Who Optimizes The Optimizer?

Paul Bricman23 Sep 2022 15:30 UTC

15 points

0 comments10 min readLW link

Belief Bias: Bias in Evaluating AGI X-Risks

Remmelt and flandry19

2 Jan 2023 8:59 UTC

−10 points

1 comment1 min readLW link

An International Manhattan Project for Artificial Intelligence

Glenn Clayton27 Apr 2023 17:34 UTC

−9 points

2 comments5 min readLW link

[Question] Oracle AGI—How can it escape, other than security issues? (Steganography?)

RationalSieve25 Dec 2022 20:14 UTC

3 points

6 comments1 min readLW link

Agency As a Natural Abstraction

Thane Ruthenis13 May 2022 18:02 UTC

55 points

9 comments13 min readLW link

Complexity No Bar to AI (Or, why Computational Complexity matters less than you think for real life problems)

Noosphere897 Aug 2022 19:55 UTC

17 points

14 comments3 min readLW link

(www.gwern.net)

DeepMind’s generalist AI, Gato: A non-technical explainer

frances_lorenz, Nora Belrose and jonmenaster

16 May 2022 21:21 UTC

63 points

6 comments6 min readLW link

Intelligence–Agency Equivalence ≈ Mass–Energy Equivalence: On Static Nature of Intelligence & Physicalization of Ethics

ank22 Feb 2025 0:12 UTC

1 point

0 comments6 min readLW link

Post series on “Liability Law for reducing Existential Risk from AI”

Nora_Ammann29 Feb 2024 4:39 UTC

42 points

1 comment1 min readLW link

(forum.effectivealtruism.org)

AI Alignment Prize: Round 2 due March 31, 2018

Zvi12 Mar 2018 12:10 UTC

28 points

2 comments3 min readLW link

(thezvi.wordpress.com)

Morality as Cooperation Part III: Failure Modes

DeLesley Hutchins5 Dec 2024 9:39 UTC

4 points

0 comments20 min readLW link

Superintelligence will outsmart us or it isn’t superintelligence

Neil 3 Apr 2023 15:01 UTC

−4 points

4 comments1 min readLW link

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

Simon Lermen and Jeffrey Ladish

12 Oct 2023 19:58 UTC

151 points

29 comments14 min readLW link

Prediction: any uncontrollable AI will turn earth into a giant computer

Karl von Wendt17 Apr 2023 12:30 UTC

11 points

8 comments3 min readLW link

Introducing AI Alignment Inc., a California public benefit corporation...

TherapistAI7 Mar 2023 18:47 UTC

1 point

4 comments1 min readLW link

Will the world’s elites navigate the creation of AI just fine?

lukeprog31 May 2013 18:49 UTC

36 points

266 comments2 min readLW link

Aligned Behavior is not Evidence of Alignment Past a Certain Level of Intelligence

Ronny Fernandez5 Dec 2022 15:19 UTC

19 points

5 comments7 min readLW link

Idea: Open Access AI Safety Journal

Gordon Seidoh Worley23 Mar 2018 18:27 UTC

28 points

11 comments1 min readLW link

AGI-Automated Interpretability is Suicide

__RicG__10 May 2023 14:20 UTC

25 points

33 comments7 min readLW link

Something Is Lost When AI Makes Art

utilistrutil18 Aug 2024 22:53 UTC

18 points

1 comment10 min readLW link

[Question] Any further work on AI Safety Success Stories?

Krieger2 Oct 2022 9:53 UTC

8 points

6 comments1 min readLW link

Analysis of key AI analogies

Kevin Kohler29 Jun 2024 10:55 UTC

10 points

2 comments15 min readLW link

World and Mind in Artificial Intelligence: arguments against the AI pause

Arturo Macias18 Apr 2023 14:40 UTC

1 point

0 comments1 min readLW link

(forum.effectivealtruism.org)

Allegory On AI Risk, Game Theory, and Mithril

James_Miller13 Feb 2017 20:41 UTC

45 points

57 comments3 min readLW link

A call for a quantitative report card for AI bioterrorism threat models

Juno4 Dec 2023 6:35 UTC

12 points

0 comments10 min readLW link

Reflection Mechanisms as an Alignment Target—Attitudes on “near-term” AI

elandgre, Beth Barnes and Marius Hobbhahn

2 Mar 2023 4:29 UTC

21 points

0 comments8 min readLW link

[Question] What’s the protocol for if a novice has ML ideas that are unlikely to work, but might improve capabilities if they do work?

drocta9 Jan 2024 22:51 UTC

6 points

2 comments2 min readLW link

The Last Light

Bridgett Kay14 Apr 2025 15:41 UTC

31 points

2 comments4 min readLW link

A god in a box

predict-woo29 Jan 2025 0:55 UTC

1 point

0 comments7 min readLW link

Notes on “the hot mess theory of AI misalignment”

JakubK21 Apr 2023 10:07 UTC

16 points

0 comments5 min readLW link

(sohl-dickstein.github.io)

Planning to build a cryptographic box with perfect secrecy

Lysandre Terrisse31 Dec 2023 9:31 UTC

40 points

6 comments11 min readLW link

On unfixably unsafe AGI architectures

Steven Byrnes19 Feb 2020 21:16 UTC

33 points

8 comments5 min readLW link

Big Picture AI Safety: Introduction

EuanMcLean23 May 2024 11:15 UTC

46 points

7 comments5 min readLW link

AI Alternative Futures: Scenario Mapping Artificial Intelligence Risk—Request for Participation (Closed)

Kakili27 Apr 2022 22:07 UTC

10 points

2 comments8 min readLW link

Do Earths with slower economic growth have a better chance at FAI?

Eliezer Yudkowsky12 Jun 2013 19:54 UTC

59 points

175 comments4 min readLW link

Causal representation learning as a technique to prevent goal misgeneralization

PabloAMC4 Jan 2023 0:07 UTC

21 points

0 comments8 min readLW link

[Question] Can someone explain to me why MIRI is so pessimistic of our chances of survival?

iamthouthouarti14 Apr 2022 20:28 UTC

10 points

7 comments1 min readLW link

How I Learned To Stop Worrying And Love The Shoggoth

Peter Merel12 Jul 2023 17:47 UTC

9 points

15 comments5 min readLW link

Knowledge Base 1: Could it increase intelligence and make it safer?

iwis30 Sep 2024 16:00 UTC

−4 points

0 comments4 min readLW link

Proposing the Post-Singularity Symbiotic Researches

Hiroshi Yamakawa20 Jun 2024 4:05 UTC

6 points

1 comment12 min readLW link

A flaw in the A.G.I. Ruin Argument

Cole Wyeth19 May 2023 19:40 UTC

1 point

7 comments3 min readLW link

(colewyeth.com)

“Unintentional AI safety research”: Why not systematically mine AI technical research for safety purposes?

Jemal Young29 Mar 2023 15:56 UTC

27 points

3 comments6 min readLW link

[Crosspost] Organizing a debate with experts and MPs to raise AI xrisk awareness: a possible blueprint

otto.barten19 Apr 2023 11:45 UTC

8 points

0 comments4 min readLW link

(forum.effectivealtruism.org)

AI Safety Newsletter #4: AI and Cybersecurity, Persuasive AIs, Weaponization, and Geoffrey Hinton talks AI risks

ozhang, Dan H and Orpheus16

2 May 2023 18:41 UTC

32 points

0 comments5 min readLW link

(newsletter.safe.ai)

Formal Solution to the Inner Alignment Problem

michaelcohen18 Feb 2021 14:51 UTC

49 points

123 comments2 min readLW link

Formulating the AI Doom Argument for Analytic Philosophers

JonathanErhardt12 May 2023 7:54 UTC

13 points

0 comments2 min readLW link

Loss of control of AI is not a likely source of AI x-risk

squek7 Nov 2022 18:44 UTC

−6 points

0 comments5 min readLW link

Reviews of “Is power-seeking AI an existential risk?”

Joe Carlsmith16 Dec 2021 20:48 UTC

80 points

20 comments1 min readLW link

Why I’m not worried about imminent doom

kwiat.dev10 Apr 2023 15:31 UTC

7 points

2 comments4 min readLW link

The Universal Lockpick Hypothesis: A Structural Vulnerability of All Life

Cole Holin23 Mar 2025 19:27 UTC

1 point

0 comments2 min readLW link

Why Yudkowsky Is Wrong And What He Does Can Be More Dangerous

idontagreewiththat6 Jun 2023 17:59 UTC

−38 points

4 comments3 min readLW link

Bandwagon effect: Bias in Evaluating AGI X-Risks

Remmelt and flandry19

28 Dec 2022 7:54 UTC

1 point

0 comments1 min readLW link

A long reply to Ben Garfinkel on Scrutinizing Classic AI Risk Arguments

Søren Elverlin27 Sep 2020 17:51 UTC

17 points

6 comments1 min readLW link

Access to AI: a human right?

dmtea25 Jul 2020 9:38 UTC

5 points

3 comments2 min readLW link

I Vouch For MIRI

Zvi17 Dec 2017 17:50 UTC

39 points

9 comments5 min readLW link

(thezvi.wordpress.com)

A Modest Pivotal Act

anonymousaisafety13 Jun 2022 19:24 UTC

−16 points

1 comment5 min readLW link

Formalizing the “AI x-risk is unlikely because it is ridiculous” argument

Christopher King3 May 2023 18:56 UTC

48 points

17 comments3 min readLW link

Is AI Gain-of-Function research a thing?

MadHatter12 Nov 2022 2:33 UTC

9 points

2 comments2 min readLW link

[Question] AI Rights: In your view, what would be required for an AGI to gain rights and protections from the various Governments of the World?

Super AGI9 Jun 2023 1:24 UTC

10 points

26 comments1 min readLW link

Is AGI suicidality the golden ray of hope?

Alex Kirko4 Apr 2023 23:29 UTC

−18 points

4 comments1 min readLW link

New, Brief Popular-Level Introduction to AI Risks and Superintelligence

LyleN23 Jan 2015 15:43 UTC

33 points

3 comments1 min readLW link

Towards Gears-Level Understanding of Agency

Thane Ruthenis16 Jun 2022 22:00 UTC

25 points

4 comments18 min readLW link

Half-baked alignment idea

ozb28 Mar 2023 17:47 UTC

6 points

27 comments1 min readLW link

Community Building for Graduate Students: A Targeted Approach

Neil Crawford6 Sep 2022 17:17 UTC

6 points

0 comments4 min readLW link

[Question] Wouldn’t an intelligent agent keep us alive and help us align itself to our values in order to prevent risk ? by Risk I mean experimentation by trying to align potentially smarter replicas?

Terrence Rotoufle21 Mar 2023 17:44 UTC

−3 points

1 comment2 min readLW link

European Master’s Programs in Machine Learning, Artificial Intelligence, and related fields

Master Programs ML/AI14 Nov 2020 15:51 UTC

34 points

6 comments1 min readLW link

H-JEPA might be technically alignable in a modified form

Roman Leventov8 May 2023 23:04 UTC

12 points

2 comments7 min readLW link

PCAST Working Group on Generative AI Invites Public Input

Christopher King13 May 2023 22:49 UTC

7 points

0 comments1 min readLW link

(terrytao.wordpress.com)

Exploring the Precautionary Principle in AI Development: Historical Analogies and Lessons Learned

Christopher King21 Mar 2023 3:53 UTC

−1 points

2 comments9 min readLW link

AI Safety Research Project Ideas

Owain_Evans21 May 2021 13:39 UTC

58 points

2 comments3 min readLW link

Miles Brundage resigned from OpenAI, and his AGI readiness team was disbanded

garrison23 Oct 2024 23:40 UTC

118 points

1 comment7 min readLW link

(garrisonlovely.substack.com)

Q&A with experts on risks from AI #2

XiXiDu9 Jan 2012 19:40 UTC

22 points

29 comments7 min readLW link

The Compendium, A full argument about extinction risk from AGI

adamShimi, Gabriel Alfour, Connor Leahy, Chris Scammell and Andrea_Miotti

31 Oct 2024 12:01 UTC

196 points

52 comments2 min readLW link

(www.thecompendium.ai)

The Friendly AI Game

bentarm15 Mar 2011 16:45 UTC

50 points

178 comments1 min readLW link

Memes and Rational Decisions

inferential9 Jan 2015 6:42 UTC

35 points

18 comments10 min readLW link

Big list of AI safety videos

JakubK9 Jan 2023 6:12 UTC

11 points

2 comments1 min readLW link

(docs.google.com)

Three pillars for avoiding AGI catastrophe: Technical alignment, deployment decisions, and coordination

LintzA3 Aug 2022 23:15 UTC

24 points

0 comments11 min readLW link

Value drift threat models

Garrett Baker12 May 2023 23:03 UTC

27 points

4 comments5 min readLW link

4 Key Assumptions in AI Safety

Prometheus7 Nov 2022 10:50 UTC

20 points

5 comments7 min readLW link

Stability AI releases StableLM, an open-source ChatGPT counterpart

Ozyrus20 Apr 2023 6:04 UTC

11 points

3 comments1 min readLW link

(github.com)

AI Safety Camp, Virtual Edition 2023

Linda Linsefors6 Jan 2023 11:09 UTC

40 points

10 comments3 min readLW link

(aisafety.camp)

Geoffrey Hinton on the Past, Present, and Future of AI

Stephen McAleese12 Oct 2024 16:41 UTC

23 points

5 comments18 min readLW link

[Question] Danger(s) of theorem-proving AI?

Yitz16 Mar 2022 2:47 UTC

8 points

8 comments1 min readLW link

AI acceleration, DeepSeek, moral philosophy

Josh H2 Feb 2025 0:08 UTC

2 points

0 comments12 min readLW link

Ilya Sutskever’s thoughts on AI safety (July 2023): a transcript with my comments

mishka10 Aug 2023 19:07 UTC

22 points

3 comments5 min readLW link

Sticky goals: a concrete experiment for understanding deceptive alignment

evhub2 Sep 2022 21:57 UTC

39 points

13 comments3 min readLW link

I just watched don’t look up.

ATheCoder23 Jun 2023 21:22 UTC

0 points

5 comments2 min readLW link

[Question] Does agency necessarily imply self-preservation instinct?

Mislav Jurić1 May 2023 16:06 UTC

5 points

8 comments1 min readLW link

Breaking Oracles: superrationality and acausal trade

Stuart_Armstrong25 Nov 2019 10:40 UTC

26 points

15 comments1 min readLW link

Unpredictability and the Increasing Difficulty of AI Alignment for Increasingly Intelligent AI

Max_He-Ho31 May 2023 22:25 UTC

5 points

2 comments20 min readLW link

AGI deployment as an act of aggression

dr_s5 Apr 2023 6:39 UTC

28 points

30 comments13 min readLW link

The Astronomical Sacrifice Dilemma

Matthew McRedmond11 Mar 2024 19:58 UTC

15 points

3 comments4 min readLW link

From Paperclips to Bombs: The Evolution of AI Risk Discourse on LessWrong

David Harket16 Jun 2025 5:16 UTC

3 points

0 comments24 min readLW link

Quick Thoughts on Language Models

RohanS18 Jul 2023 20:38 UTC

6 points

0 comments4 min readLW link

We don’t need AGI for an amazing future

Karl von Wendt4 May 2023 12:10 UTC

19 points

32 comments5 min readLW link

Takeoff speeds presentation at Anthropic

Tom Davidson4 Jun 2024 22:46 UTC

93 points

0 comments25 min readLW link

Field-Building and Deep Models

Ben Pace13 Jan 2018 21:16 UTC

21 points

12 comments4 min readLW link

CFP for Rebellion and Disobedience in AI workshop

Ram Rachum29 Dec 2022 16:08 UTC

15 points

0 comments1 min readLW link

More experiments in GPT-4 agency: writing memos

Christopher King24 Mar 2023 17:51 UTC

5 points

2 comments10 min readLW link

Speculation on mapping the moral landscape for future Ai Alignment

Sven Heinz (Welwordion)16 Apr 2023 13:43 UTC

1 point

0 comments1 min readLW link

Risk Map of AI Systems

VojtaKovarik and Jan_Kulveit

15 Dec 2020 9:16 UTC

28 points

3 comments8 min readLW link

Illusion of truth effect and Ambiguity effect: Bias in Evaluating AGI X-Risks

Remmelt5 Jan 2023 4:05 UTC

−13 points

2 comments1 min readLW link

Ethical Deception: Should AI Ever Lie?

Jason Reid2 Aug 2024 17:53 UTC

5 points

2 comments7 min readLW link

On excluding dangerous information from training

ShayBenMoshe17 Nov 2023 11:14 UTC

23 points

5 comments3 min readLW link

But What If We Actually Want To Maximize Paperclips?

snerx25 May 2023 7:13 UTC

−17 points

6 comments7 min readLW link

Distinguishing AI takeover scenarios

Sam Clarke and Sammy Martin

8 Sep 2021 16:19 UTC

74 points

11 comments14 min readLW link

FAI Research Constraints and AGI Side Effects

JustinShovelain3 Jun 2015 19:25 UTC

27 points

59 comments7 min readLW link

Lenses of Control

WillPetillo22 Oct 2024 7:51 UTC

14 points

0 comments9 min readLW link

TED talk by Eliezer Yudkowsky: Unleashing the Power of Artificial Intelligence

bayesed7 May 2023 5:45 UTC

49 points

36 comments1 min readLW link

(www.youtube.com)

Foresight for AGI Safety Strategy: Mitigating Risks and Identifying Golden Opportunities

jacquesthibs5 Dec 2022 16:09 UTC

28 points

6 comments8 min readLW link

The Consciousness Conundrum: Why We Can’t Dismiss Machine Sentience

SystematicApproach13 Aug 2024 18:01 UTC

−22 points

1 comment3 min readLW link

Annual AGI Benchmarking Event

Lawrence Phillips27 Aug 2022 0:06 UTC

24 points

3 comments2 min readLW link

(www.metaculus.com)

The Human Alignment Problem for AIs

rife22 Jan 2025 4:06 UTC

10 points

5 comments3 min readLW link

A gentle apocalypse

pchvykov16 Aug 2021 5:03 UTC

3 points

5 comments3 min readLW link

AGI Clinics: A Safe Haven for Humanity’s First Encounters with Superintelligence

portr.17 Apr 2023 1:52 UTC

−5 points

1 comment1 min readLW link

Call for submissions: Choice of Futures survey questions

c.trout30 Apr 2023 6:59 UTC

4 points

0 comments2 min readLW link

(airtable.com)

Online AI Safety Discussion Day

Linda Linsefors8 Oct 2020 12:11 UTC

5 points

0 comments1 min readLW link

[Question] Accuracy of arguments that are seen as ridiculous and intuitively false but don’t have good counter-arguments

Christopher King29 Apr 2023 23:58 UTC

30 points

39 comments1 min readLW link

[Question] What is being improved in recursive self improvement?

Lone Pine25 Apr 2022 18:30 UTC

7 points

6 comments1 min readLW link

More Thoughts on the Human-AGI War

Seth Ahrenbach27 Dec 2023 1:03 UTC

−3 points

4 comments7 min readLW link

Current AI harms are also sci-fi

Christopher King8 Jun 2023 17:49 UTC

26 points

3 comments1 min readLW link

[Question] What criterion would you use to select companies likely to cause AI doom?

momom213 Jul 2023 20:31 UTC

8 points

4 comments1 min readLW link

Alignment: “Do what I would have wanted you to do”

Oleg Trott12 Jul 2024 16:47 UTC

11 points

48 comments1 min readLW link

Someone should fund an AGI Blockbuster

pinto28 Jul 2025 21:14 UTC

5 points

11 comments4 min readLW link

AI Incident Monitoring: A Brief Analysis

Spencer Ames2 May 2025 15:06 UTC

3 points

0 comments5 min readLW link

Non-loss of control AGI-related catastrophes are out of control too

Yi-Yang, Mo Putera and zeshen

12 Jun 2023 12:01 UTC

2 points

3 comments24 min readLW link

Are Generative World Models a Mesa-Optimization Risk?

Thane Ruthenis29 Aug 2022 18:37 UTC

14 points

2 comments3 min readLW link

Near-Term Risks of an Obedient Artificial Intelligence

ymeskhout18 Feb 2023 18:30 UTC

20 points

1 comment6 min readLW link

[Question] Term/Category for AI with Neutral Impact?

isomic11 May 2023 22:00 UTC

6 points

1 comment1 min readLW link

Help Understanding Preferences And Evil

Netcentrica27 Aug 2022 3:42 UTC

6 points

7 comments2 min readLW link

A great talk for AI noobs (according to an AI noob)

dov23 Apr 2023 5:34 UTC

10 points

1 comment1 min readLW link

(forum.effectivealtruism.org)

An AGI kill switch with defined security properties

Peterpiper5 Jul 2023 17:40 UTC

−5 points

6 comments1 min readLW link

What would we do if alignment were futile?

Grant Demaree14 Nov 2021 8:09 UTC

75 points

39 comments3 min readLW link

Next steps after AGISF at UMich

JakubK25 Jan 2023 20:57 UTC

10 points

0 comments5 min readLW link

(docs.google.com)

The Intelligence Curse

lukedrago3 Jan 2025 19:07 UTC

138 points

27 comments18 min readLW link

(lukedrago.substack.com)

Should ethicists be inside or outside a profession?

Eliezer Yudkowsky12 Dec 2018 1:40 UTC

97 points

7 comments9 min readLW link

Too Many Metaphors: A Case for Plain Talk in AI Safety

David Harket30 May 2025 19:29 UTC

0 points

8 comments2 min readLW link

Modeling AGI Safety Frameworks with Causal Influence Diagrams

Ramana Kumar21 Jun 2019 12:50 UTC

43 points

6 comments1 min readLW link

(arxiv.org)

AI Safety Research Camp—Project Proposal

David_Kristoffersson2 Feb 2018 4:25 UTC

29 points

11 comments8 min readLW link

A rant against robots

Lê Nguyên Hoang14 Jan 2020 22:03 UTC

65 points

7 comments5 min readLW link

Smarter Models Lie Less

Expertium20 Jun 2025 13:31 UTC

6 points

0 comments2 min readLW link

A decade of lurking, a month of posting

Max H9 Apr 2023 0:21 UTC

70 points

4 comments5 min readLW link

It’s (not) how you use it

Eleni Angelou7 Sep 2022 17:15 UTC

8 points

1 comment2 min readLW link

A Game About AI Alignment (& Meta-Ethics): What Are the Must Haves?

JonathanErhardt5 Sep 2022 7:55 UTC

18 points

15 comments2 min readLW link

Worldview iPeople—Future Fund’s AI Worldview Prize

Toni MUENDEL28 Oct 2022 1:53 UTC

−21 points

4 comments9 min readLW link

[Question] Is an AI religion justified?

p4rziv4l6 Aug 2024 15:42 UTC

−35 points

11 comments1 min readLW link

Q&A with Stan Franklin on risks from AI

XiXiDu11 Jun 2011 15:22 UTC

36 points

10 comments2 min readLW link

Poorly-Aimed Death Rays

Thane Ruthenis11 Jun 2022 18:29 UTC

48 points

5 comments4 min readLW link

Speed running everyone through the bad alignment bingo. $5k bounty for a LW conversational agent

ArthurB9 Mar 2023 9:26 UTC

140 points

33 comments2 min readLW link

Cheat sheet of AI X-risk

momom229 Jun 2023 4:28 UTC

19 points

1 comment7 min readLW link

Launched: Friendship is Optimal

iceman15 Nov 2012 4:57 UTC

77 points

32 comments1 min readLW link

[Question] What would you expect a massive multimodal online federated learner to be capable of?

Aryeh Englander27 Aug 2022 17:31 UTC

13 points

4 comments1 min readLW link

Proposal: Using Monte Carlo tree search instead of RLHF for alignment research

Christopher King20 Apr 2023 19:57 UTC

2 points

7 comments3 min readLW link

Survey on AI existential risk scenarios

Sam Clarke, apc and Jonas Schuett

8 Jun 2021 17:12 UTC

65 points

11 comments7 min readLW link

Immortality or death by AGI

ImmortalityOrDeathByAGI21 Sep 2023 23:59 UTC

47 points

30 comments4 min readLW link

(forum.effectivealtruism.org)

The idea of an “aligned superintelligence” seems misguided

ssadler27 Feb 2023 11:19 UTC

6 points

7 comments3 min readLW link

(ssadler.substack.com)

On taking AI risk seriously

Eleni Angelou13 Mar 2023 5:50 UTC

6 points

0 comments1 min readLW link

(www.nytimes.com)

AI Safety proposal—Influencing the superintelligence explosion

Morgan22 May 2024 23:31 UTC

0 points

2 comments7 min readLW link

Can we achieve AGI Alignment by balancing multiple human objectives?

Ben Smith3 Jul 2022 2:51 UTC

11 points

1 comment4 min readLW link

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

Yohan Mathew, joanv, robert mccarthy, ollie, Nandi and Dylan Cope

25 Sep 2024 14:52 UTC

37 points

2 comments4 min readLW link

(arxiv.org)

AI Regulation May Be More Important Than AI Alignment For Existential Safety

otto.barten24 Aug 2023 11:41 UTC

65 points

39 comments5 min readLW link

AGI’s Opposing Force

SimonBaars16 Aug 2024 4:18 UTC

9 points

2 comments1 min readLW link

Greed Is the Root of This Evil

Thane Ruthenis13 Oct 2022 20:40 UTC

21 points

7 comments8 min readLW link

Precise P(doom) isn’t very important for prioritization or strategy

harsimony14 Sep 2022 17:19 UTC

14 points

6 comments1 min readLW link

Threat Model Literature Review

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

1 Nov 2022 11:03 UTC

79 points

4 comments25 min readLW link

On being sort of back and sort of new here

Loki zen16 Jul 2025 12:55 UTC

32 points

13 comments3 min readLW link

6-paragraph AI risk intro for MAISI

JakubK19 Jan 2023 9:22 UTC

11 points

0 comments2 min readLW link

(www.maisi.club)

A response to Conjecture’s CoEm proposal

Kristian Freed24 Apr 2023 17:23 UTC

7 points

0 comments4 min readLW link

New economic system for AI era

ksme sho17 Mar 2023 17:42 UTC

−1 points

1 comment5 min readLW link

The Game of Dominance

Karl von Wendt27 Aug 2023 11:04 UTC

24 points

15 comments6 min readLW link

Do you want a first-principled preparedness guide to prepare yourself and loved ones for potential catastrophes?

Ulrik Horn14 Nov 2023 12:13 UTC

16 points

5 comments15 min readLW link

How harmful are improvements in AI? + Poll

tilmanr and Marius Hobbhahn

15 Feb 2022 18:16 UTC

15 points

4 comments8 min readLW link

AISN #19: US-China Competition on AI Chips, Measuring Language Agent Developments, Economic Analysis of Language Model Propaganda, and White House AI Cyber Challenge

Dan H15 Aug 2023 16:10 UTC

21 points

0 comments5 min readLW link

(newsletter.safe.ai)

The Overlooked Necessity of Complete Semantic Representation in AI Safety and Alignment

williamsae15 Aug 2024 19:42 UTC

−1 points

0 comments3 min readLW link

Alignment Is Not All You Need

Adam Jones2 Jan 2025 17:50 UTC

43 points

10 comments6 min readLW link

(adamjones.me)

Winners-take-how-much?

YonatanK29 May 2023 21:56 UTC

3 points

2 comments3 min readLW link

Elon Musk donates $10M to the Future of Life Institute to keep AI beneficial

Paul Crowley15 Jan 2015 16:33 UTC

79 points

52 comments1 min readLW link

Linkpost: A tale of 2.5 orthogonality theses

DavidW13 Mar 2023 14:19 UTC

9 points

3 comments1 min readLW link

(forum.effectivealtruism.org)

“Destroy humanity” as an immediate subgoal

Seth Ahrenbach22 Dec 2023 18:52 UTC

3 points

13 comments3 min readLW link

What Failure Looks Like is not an existential risk (and alignment is not the solution)

otto.barten2 Feb 2024 18:59 UTC

13 points

12 comments9 min readLW link

Alignment Can Reduce Performance on Simple Ethical Questions

Daan Henselmans3 Feb 2025 19:35 UTC

16 points

7 comments6 min readLW link

Simple alignment plan that maybe works

Iknownothing18 Jul 2023 22:48 UTC

4 points

8 comments1 min readLW link

AI existential risk probabilities are too unreliable to inform policy

Oleg Trott28 Jul 2024 0:59 UTC

18 points

5 comments1 min readLW link

(www.aisnakeoil.com)

AI Safety Discussion Day

Linda Linsefors15 Sep 2020 14:40 UTC

20 points

0 comments1 min readLW link

Humanity’s Lack of Unity Will Lead to AGI Catastrophe

MiguelDev19 Mar 2023 19:18 UTC

3 points

2 comments4 min readLW link

MIT FutureTech are hiring for an Operations and Project Management role.

peterslattery17 May 2024 23:21 UTC

2 points

0 comments3 min readLW link

What are red flags for Neural Network suffering?

Marius Hobbhahn8 Nov 2021 12:51 UTC

29 points

15 comments12 min readLW link

Actionable-guidance and roadmap recommendations for the NIST AI Risk Management Framework

Dan H and Tony Barrett

17 May 2022 15:26 UTC

26 points

0 comments3 min readLW link

Who Aligns the Alignment Researchers?

Ben Smith5 Mar 2023 23:22 UTC

48 points

0 comments11 min readLW link

Strategies to Prevent AI Annihilation

lastchanceformankind4 Apr 2023 8:59 UTC

−2 points

0 comments4 min readLW link

Is There a Valley of Bad Civilizational Adequacy?

lbThingrb11 Mar 2022 19:49 UTC

13 points

1 comment2 min readLW link

Do you feel that AGI Alignment could be achieved in a Type 0 civilization?

Super AGI6 Jul 2023 4:52 UTC

−2 points

1 comment1 min readLW link

Aligned Objectives Prize Competition

Prometheus15 Jun 2023 12:42 UTC

8 points

0 comments2 min readLW link

(app.impactmarkets.io)

No comments.