Research Agendas

TagLast edit: 16 Sep 2021 15:08 UTC by plex

Research Agendas lay out the areas of research which individuals or groups are working on, or those that they believe would be valuable for others to work on. They help make research more legible and encourage discussion of priorities.

The Learning-Theoretic AI Alignment Research Agenda

Vanessa Kosoy4 Jul 2018 9:53 UTC

95 points

37 comments32 min readLW link

New safety research agenda: scalable agent alignment via reward modeling

Vika20 Nov 2018 17:29 UTC

34 points

12 comments1 min readLW link

(medium.com)

Embedded Agents

abramdemski and Scott Garrabrant

29 Oct 2018 19:53 UTC

237 points

42 comments1 min readLW link 2 reviews

On how various plans miss the hard bits of the alignment challenge

So8res12 Jul 2022 2:49 UTC

316 points

89 comments29 min readLW link 3 reviews

AI Governance: A Research Agenda

habryka5 Sep 2018 18:00 UTC

25 points

3 comments1 min readLW link

(www.fhi.ox.ac.uk)

Paul’s research agenda FAQ

zhukeepa1 Jul 2018 6:25 UTC

131 points

74 comments19 min readLW link 1 review

Research Agenda v0.9: Synthesising a human’s preferences into a utility function

Stuart_Armstrong17 Jun 2019 17:46 UTC

74 points

26 comments33 min readLW link

Our take on CHAI’s research agenda in under 1500 words

Alex Flint17 Jun 2020 12:24 UTC

113 points

18 comments5 min readLW link

An overview of 11 proposals for building safe advanced AI

evhub29 May 2020 20:38 UTC

221 points

37 comments38 min readLW link 2 reviews

Shallow review of technical AI safety, 2025

technicalities, Tomáš Gavenčiak, Stephen McAleese, peligrietzer, Stag, jordine, ozziegooen, Violet Hour and ramennaut

17 Dec 2025 18:18 UTC

151 points

9 comments83 min readLW link

Research Adenda: Modelling Trajectories of Language Models

NickyP13 Nov 2023 14:33 UTC

28 points

0 comments12 min readLW link

Embedded Agency (full-text version)

Scott Garrabrant and abramdemski

15 Nov 2018 19:49 UTC

211 points

17 comments54 min readLW link

The ‘Neglected Approaches’ Approach: AE Studio’s Alignment Agenda

Cameron Berg, Judd Rosenblatt, Trent Hodgeson and Marc Carauleanu

18 Dec 2023 20:35 UTC

190 points

23 comments12 min readLW link 1 review

Trying to isolate objectives: approaches toward high-level interpretability

Jozdien9 Jan 2023 18:33 UTC

49 points

14 comments8 min readLW link

MIRI’s technical research agenda

So8res23 Dec 2014 18:45 UTC

55 points

52 comments3 min readLW link

Davidad’s Bold Plan for Alignment: An In-Depth Explanation

Charbel-Raphaël and Gabin

19 Apr 2023 16:09 UTC

167 points

40 comments21 min readLW link 2 reviews

Some conceptual alignment research projects

Richard_Ngo25 Aug 2022 22:51 UTC

177 points

15 comments3 min readLW link

Preface to CLR’s Research Agenda on Cooperation, Conflict, and TAI

JesseClifton13 Dec 2019 21:02 UTC

62 points

10 comments2 min readLW link

The Learning-Theoretic Agenda: Status 2023

Vanessa Kosoy19 Apr 2023 5:21 UTC

144 points

22 comments56 min readLW link 3 reviews

Deconfusing Human Values Research Agenda v1

Gordon Seidoh Worley23 Mar 2020 16:25 UTC

28 points

12 comments4 min readLW link

Research agenda update

Steven Byrnes6 Aug 2021 19:24 UTC

55 points

40 comments7 min readLW link

Thoughts on Human Models

Ramana Kumar and Scott Garrabrant

21 Feb 2019 9:10 UTC

127 points

32 comments10 min readLW link 1 review

Theories of impact for Science of Deep Learning

Marius Hobbhahn1 Dec 2022 14:39 UTC

25 points

0 comments11 min readLW link

UK AISI’s Alignment Team: Research Agenda

Benjamin Hilton, Jacob Pfau, Marie_DB and Geoffrey Irving

7 May 2025 16:33 UTC

113 points

2 comments11 min readLW link

New year, new research agenda post

Charlie Steiner12 Jan 2022 17:58 UTC

29 points

4 comments16 min readLW link

Key Questions for Digital Minds

Jacy Reese Anthis22 Mar 2023 17:13 UTC

22 points

0 comments7 min readLW link

(www.sentienceinstitute.org)

Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda

Logan Riggs and Gurkenglas

3 Sep 2020 18:27 UTC

68 points

11 comments2 min readLW link

Announcing the Alignment of Complex Systems Research Group

Jan_Kulveit and technicalities

4 Jun 2022 4:10 UTC

92 points

20 comments5 min readLW link

Towards Hodge-podge Alignment

Cleo Nardo19 Dec 2022 20:12 UTC

95 points

30 comments9 min readLW link

Sparsify: A mechanistic interpretability research agenda

Lee Sharkey3 Apr 2024 12:34 UTC

96 points

23 comments22 min readLW link

The space of systems and the space of maps

Jan_Kulveit, rosehadshar, Nora_Ammann and clem_acs

22 Mar 2023 14:59 UTC

38 points

0 comments5 min readLW link

Sections 3 & 4: Credibility, Peaceful Bargaining Mechanisms

JesseClifton17 Dec 2019 21:46 UTC

20 points

2 comments12 min readLW link

Research Reflections

abramdemski4 Nov 2025 4:33 UTC

80 points

3 comments3 min readLW link

AI Research Considerations for Human Existential Safety (ARCHES)

habryka9 Jul 2020 2:49 UTC

60 points

7 comments1 min readLW link

(arxiv.org)

An overview of some promising work by junior alignment researchers

Orpheus1626 Dec 2022 17:23 UTC

34 points

0 comments4 min readLW link

Why I’m not working on {debate, RRM, ELK, natural abstractions}

Steven Byrnes10 Feb 2023 19:22 UTC

74 points

19 comments10 min readLW link

Alignment proposals and complexity classes

evhub16 Jul 2020 0:27 UTC

40 points

26 comments13 min readLW link

The Plan

johnswentworth10 Dec 2021 23:41 UTC

261 points

78 comments14 min readLW link 1 review

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

scasper21 May 2024 20:15 UTC

157 points

16 comments3 min readLW link

My AGI safety research—2022 review, ’23 plans

Steven Byrnes14 Dec 2022 15:15 UTC

51 points

10 comments7 min readLW link

[Linkpost] Interpretability Dreams

DanielFilan24 May 2023 21:08 UTC

39 points

2 comments2 min readLW link

(transformer-circuits.pub)

Testing The Natural Abstraction Hypothesis: Project Update

johnswentworth20 Sep 2021 3:44 UTC

88 points

17 comments8 min readLW link 1 review

Constructability: Plainly-coded AGIs may be feasible in the near future

Épiphanie Gédéon and Charbel-Raphaël

27 Apr 2024 16:04 UTC

91 points

15 comments13 min readLW link

Paradigm-building: Introduction

Cameron Berg8 Feb 2022 0:06 UTC

28 points

0 comments2 min readLW link

The Goodhart Game

John_Maxwell18 Nov 2019 23:22 UTC

13 points

5 comments5 min readLW link

AI, learn to be conservative, then learn to be less so: reducing side-effects, learning preserved features, and going beyond conservatism

Stuart_Armstrong20 Sep 2021 11:56 UTC

14 points

4 comments3 min readLW link

Robust Delegation

abramdemski and Scott Garrabrant

4 Nov 2018 16:38 UTC

116 points

10 comments1 min readLW link

The Shortest Path Between Scylla and Charybdis

Thane Ruthenis18 Dec 2023 20:08 UTC

50 points

8 comments5 min readLW link

Four visions of Transformative AI success

Steven Byrnes17 Jan 2024 20:45 UTC

112 points

22 comments15 min readLW link

Decision Theory

abramdemski and Scott Garrabrant

31 Oct 2018 18:41 UTC

123 points

45 comments1 min readLW link

Gradient Descent on the Human Brain

Jozdien and gaspode

1 Apr 2024 22:39 UTC

61 points

5 comments2 min readLW link

Subsystem Alignment

abramdemski and Scott Garrabrant

6 Nov 2018 16:16 UTC

102 points

12 comments1 min readLW link

Seeking Collaborators

abramdemski1 Nov 2024 17:13 UTC

62 points

15 comments7 min readLW link

(My understanding of) What Everyone in Technical Alignment is Doing and Why

Thomas Larsen and elifland

29 Aug 2022 1:23 UTC

413 points

90 comments37 min readLW link 1 review

Sections 5 & 6: Contemporary Architectures, Humans in the Loop

JesseClifton20 Dec 2019 3:52 UTC

27 points

4 comments10 min readLW link

Acceptability Verification: A Research Agenda

David Udell and evhub

12 Jul 2022 20:11 UTC

50 points

0 comments1 min readLW link

(docs.google.com)

Announcing Human-aligned AI Summer School

Jan_Kulveit and Tomáš Gavenčiak

22 May 2024 8:55 UTC

51 points

0 comments1 min readLW link

(humanaligned.ai)

The Prop-room and Stage Cognitive Architecture

Robert Kralisch29 Apr 2024 0:48 UTC

14 points

4 comments14 min readLW link

Immobile AI makes a move: anti-wireheading, ontology change, and model splintering

Stuart_Armstrong17 Sep 2021 15:24 UTC

32 points

3 comments2 min readLW link

Worrisome misunderstanding of the core issues with AI transition

Roman Leventov18 Jan 2024 10:05 UTC

5 points

2 comments4 min readLW link

Assessment of AI safety agendas: think about the downside risk

Roman Leventov19 Dec 2023 9:00 UTC

13 points

1 comment1 min readLW link

Shallow review of technical AI safety, 2024

technicalities, Stag, Stephen McAleese, jordine and Dr. David Mathers

29 Dec 2024 12:01 UTC

201 points

35 comments41 min readLW link

Remarks 1–18 on GPT (compressed)

Cleo Nardo20 Mar 2023 22:27 UTC

147 points

35 comments31 min readLW link

My research agenda in agent foundations

Alex_Altair28 Jun 2023 18:00 UTC

76 points

9 comments11 min readLW link

Self-prediction acts as an emergent regularizer

Cameron Berg, Judd Rosenblatt, Mike Vaiana, Diogo de Lucena, florin_pop and Trent Hodgeson

23 Oct 2024 22:27 UTC

92 points

9 comments4 min readLW link

What and Why: Developmental Interpretability of Reinforcement Learning

Garrett Baker9 Jul 2024 14:09 UTC

67 points

4 comments6 min readLW link

Section 7: Foundations of Rational Agency

JesseClifton22 Dec 2019 2:05 UTC

14 points

4 comments8 min readLW link

My AGI safety research—2024 review, ’25 plans

Steven Byrnes31 Dec 2024 21:05 UTC

111 points

4 comments8 min readLW link

Towards the Operationalization of Philosophy & Wisdom

Thane Ruthenis28 Oct 2024 19:45 UTC

20 points

2 comments33 min readLW link

(aiimpacts.org)

Ultra-simplified research agenda

Stuart_Armstrong22 Nov 2019 14:29 UTC

34 points

4 comments1 min readLW link

AI Red Lines: A Research Agenda

Charbel-Raphaël22 Nov 2025 8:41 UTC

29 points

1 comment5 min readLW link

Acknowledgements & References

JesseClifton14 Dec 2019 7:04 UTC

6 points

0 comments14 min readLW link

The Plan − 2023 Version

johnswentworth29 Dec 2023 23:34 UTC

152 points

40 comments31 min readLW link 1 review

Distilled Representations Research Agenda

Hoagy and mishajw

18 Oct 2022 20:59 UTC

15 points

2 comments8 min readLW link

Research Agenda: Synthesizing Standalone World-Models

Thane Ruthenis22 Sep 2025 19:06 UTC

78 points

31 comments11 min readLW link

Gaia Network: a practical, incremental pathway to Open Agency Architecture

Roman Leventov and Rafael Kaufmann Nedal

20 Dec 2023 17:11 UTC

22 points

8 comments16 min readLW link

Sections 1 & 2: Introduction, Strategy and Governance

JesseClifton17 Dec 2019 21:27 UTC

35 points

8 comments14 min readLW link

Deep Forgetting & Unlearning for Safely-Scoped LLMs

scasper5 Dec 2023 16:48 UTC

127 points

30 comments13 min readLW link

Notes on notes on virtues

David Gross30 Dec 2020 17:47 UTC

71 points

11 comments11 min readLW link

Embedded World-Models

abramdemski and Scott Garrabrant

2 Nov 2018 16:07 UTC

96 points

16 comments1 min readLW link

Selection Theorems: A Program For Understanding Agents

johnswentworth28 Sep 2021 5:03 UTC

134 points

28 comments6 min readLW link 2 reviews

AI Safety Interventions

Gunnar_Zarncke24 Nov 2025 22:28 UTC

28 points

0 comments10 min readLW link

World-Model Interpretability Is All We Need

Thane Ruthenis14 Jan 2023 19:37 UTC

36 points

22 comments21 min readLW link

Orthogonal’s Formal-Goal Alignment theory of change

Tamsin Leake5 May 2023 22:36 UTC

69 points

13 comments4 min readLW link

(carado.moe)

Embedded Curiosities

Scott Garrabrant and abramdemski

8 Nov 2018 14:19 UTC

91 points

1 comment2 min readLW link

My AGI safety research—2025 review, ’26 plans

Steven Byrnes11 Dec 2025 17:05 UTC

130 points

4 comments12 min readLW link

Research agenda: Supervising AIs improving AIs

Quintin Pope, Owen D, Roman Engeler and jacquesthibs

29 Apr 2023 17:09 UTC

76 points

5 comments19 min readLW link

EIS III: Broad Critiques of Interpretability Research

scasper14 Feb 2023 18:24 UTC

20 points

2 comments11 min readLW link

AISC 2024 - Project Summaries

NickyP27 Nov 2023 22:32 UTC

48 points

3 comments18 min readLW link

Question 3: Control proposals for minimizing bad outcomes

Cameron Berg12 Feb 2022 19:13 UTC

5 points

1 comment7 min readLW link

Announcement: AI alignment prize round 3 winners and next round

cousin_it15 Jul 2018 7:40 UTC

93 points

7 comments1 min readLW link

False Positives in Entity-Level Hallucination Detection: A Technical Challenge

MaxKamachee14 Jan 2025 19:22 UTC

1 point

0 comments2 min readLW link

A FLI postdoctoral grant application: AI alignment via causal analysis and design of agents

PabloAMC13 Nov 2021 1:44 UTC

4 points

0 comments7 min readLW link

Shallow review of live agendas in alignment & safety

technicalities and Stag

27 Nov 2023 11:10 UTC

349 points

73 comments29 min readLW link 1 review

Notes on the importance and implementation of safety-first cognitive architectures for AI

Brendon_Wong11 May 2023 10:03 UTC

3 points

0 comments3 min readLW link

The Deontological Firewall (DFW)

Aletheia_Path16 Dec 2025 19:12 UTC

1 point

0 comments2 min readLW link

Elicit: Language Models as Research Assistants

stuhlmueller and jungofthewon

9 Apr 2022 14:56 UTC

73 points

6 comments13 min readLW link

AI Safety in a World of Vulnerable Machine Learning Systems

AdamGleave and EuanMcLean

8 Mar 2023 2:40 UTC

70 points

29 comments29 min readLW link

(far.ai)

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Joar Skalse28 Feb 2025 19:20 UTC

29 points

4 comments14 min readLW link

Question 1: Predicted architecture of AGI learning algorithm(s)

Cameron Berg10 Feb 2022 17:22 UTC

13 points

1 comment7 min readLW link

Machine Learning Projects on IDA

Owain_Evans, William_S and stuhlmueller

24 Jun 2019 18:38 UTC

49 points

3 comments2 min readLW link

AI researchers announce NeuroAI agenda

Cameron Berg24 Oct 2022 0:14 UTC

37 points

12 comments6 min readLW link

(arxiv.org)

Towards empathy in RL agents and beyond: Insights from cognitive science for AI Alignment

Marc Carauleanu3 Apr 2023 19:59 UTC

15 points

6 comments1 min readLW link

(clipchamp.com)

Research Agenda in reverse: what would a solution look like?

Stuart_Armstrong25 Jun 2019 13:52 UTC

35 points

25 comments1 min readLW link

EIS IV: A Spotlight on Feature Attribution/Saliency

scasper15 Feb 2023 18:46 UTC

19 points

1 comment4 min readLW link

EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety

scasper17 Feb 2023 20:48 UTC

49 points

9 comments12 min readLW link

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC

58 points

0 comments59 min readLW link

Forecasting AI Progress: A Research Agenda

rossg10 Aug 2020 1:04 UTC

39 points

4 comments1 min readLW link

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

maxnadeau, Xander Davies, Buck and Nate Thomas

27 Oct 2022 1:32 UTC

135 points

14 comments12 min readLW link

The Mirror Loop: Recursive Non-Convergence in Generative Reasoning Systems

Bentley DeVilling28 Oct 2025 8:46 UTC

1 point

0 comments3 min readLW link

Why Academia is Mostly Not Truth-Seeking

Zero Contradictions16 Oct 2024 19:14 UTC

−7 points

6 comments1 min readLW link

(thewaywardaxolotl.blogspot.com)

Framing approaches to alignment and the hard problem of AI cognition

ryan_greenblatt15 Dec 2021 19:06 UTC

16 points

15 comments27 min readLW link

EIS XI: Moving Forward

scasper22 Feb 2023 19:05 UTC

19 points

2 comments9 min readLW link

Language Field Reconstruction Theory: A User-Originated Observation of Tier Lock and Semantic Personality in GPT-4o

許皓翔15 Jun 2025 16:28 UTC

1 point

0 comments2 min readLW link

Calibrated Transparency: Causal Safety for Frontier AI

KiyoshiSasano13 Oct 2025 1:58 UTC

1 point

0 comments6 min readLW link

What’s new at FAR AI

AdamGleave and EuanMcLean

4 Dec 2023 21:18 UTC

41 points

0 comments5 min readLW link

(far.ai)

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

evhub, Nicholas Schiefer, Carson Denison and Ethan Perez

8 Aug 2023 1:30 UTC

322 points

30 comments18 min readLW link 1 review

Generative, Episodic Objectives for Safe AI

Michael Glass5 Oct 2022 23:18 UTC

11 points

3 comments8 min readLW link

Eliciting Latent Knowledge (ELK) - Distillation/Summary

Marius Hobbhahn8 Jun 2022 13:18 UTC

70 points

2 comments21 min readLW link

[Question] How far along Metr’s law can AI start automating or helping with alignment research?

Christopher King20 Mar 2025 15:58 UTC

20 points

21 comments1 min readLW link

The Unjournal’s “Pivotal Questions” project

david reinstein8 Jul 2025 15:55 UTC

6 points

1 comment1 min readLW link

(forum.effectivealtruism.org)

Resources for AI Alignment Cartography

Gyrodiot4 Apr 2020 14:20 UTC

45 points

8 comments9 min readLW link

Paradigm-building: The hierarchical question framework

Cameron Berg9 Feb 2022 16:47 UTC

11 points

15 comments3 min readLW link

Question 5: The timeline hyperparameter

Cameron Berg14 Feb 2022 16:38 UTC

8 points

3 comments7 min readLW link

EIS V: Blind Spots In AI Safety Interpretability Research

scasper16 Feb 2023 19:09 UTC

58 points

24 comments10 min readLW link

EIS II: What is “Interpretability”?

scasper9 Feb 2023 16:48 UTC

28 points

6 comments4 min readLW link

Trying to understand John Wentworth’s research agenda

johnswentworth, habryka and David Lorell

20 Oct 2023 0:05 UTC

99 points

13 comments12 min readLW link

Exploring a Vision for AI as Compassionate, Emotionally Intelligent Partners — Seeking Collaboration and Insights

theophilos14 Jul 2025 23:22 UTC

1 point

0 comments1 min readLW link

Technical AGI safety research outside AI

Richard_Ngo18 Oct 2019 15:00 UTC

43 points

3 comments3 min readLW link

Intelligence–Agency Equivalence ≈ Mass–Energy Equivalence: On Static Nature of Intelligence & Physicalization of Ethics

ank22 Feb 2025 0:12 UTC

1 point

0 comments6 min readLW link

The AI Control Problem in a wider intellectual context

philosophybear13 Jan 2023 0:28 UTC

11 points

3 comments12 min readLW link

Introducing the Longevity Research Institute

sarahconstantin8 May 2018 3:30 UTC

54 points

20 comments1 min readLW link

(srconstantin.wordpress.com)

Creating Welfare Biology: A Research Proposal

ozymandias16 Nov 2017 19:06 UTC

20 points

5 comments4 min readLW link

AI Existential Safety Fellowships

mmfli28 Oct 2023 18:07 UTC

5 points

0 comments1 min readLW link

EIS VII: A Challenge for Mechanists

scasper18 Feb 2023 18:27 UTC

36 points

4 comments3 min readLW link

Give Neo a Chance

ank6 Mar 2025 1:48 UTC

3 points

7 comments7 min readLW link

How I think about alignment

Linda Linsefors13 Aug 2022 10:01 UTC

31 points

11 comments5 min readLW link

EIS VIII: An Engineer’s Understanding of Deceptive Alignment

scasper19 Feb 2023 15:25 UTC

30 points

5 comments4 min readLW link

Announcing: The Independent AI Safety Registry

Shoshannah Tekofsky26 Dec 2022 21:22 UTC

53 points

9 comments1 min readLW link

An Open Philanthropy grant proposal: Causal representation learning of human preferences

PabloAMC11 Jan 2022 11:28 UTC

19 points

6 comments8 min readLW link

Prove You’re Not a Bat: The Bilateral Verification Challenge

Chris Hendy3 Dec 2025 6:35 UTC

1 point

0 comments8 min readLW link

A multi-disciplinary view on AI safety research

Roman Leventov8 Feb 2023 16:50 UTC

47 points

4 comments26 min readLW link

EIS IX: Interpretability and Adversaries

scasper20 Feb 2023 18:25 UTC

30 points

8 comments8 min readLW link

EIS XII: Summary

scasper23 Feb 2023 17:45 UTC

19 points

0 comments6 min readLW link

Why I am not currently working on the AAMLS agenda

jessicata1 Jun 2017 17:57 UTC

28 points

3 comments5 min readLW link

Paradigm-building: Conclusion and practical takeaways

Cameron Berg15 Feb 2022 16:11 UTC

5 points

1 comment2 min readLW link

Inference from a Mathematical Description of an Existing Alignment Research: a proposal for an outer alignment research program

Christopher King2 Jun 2023 21:54 UTC

7 points

4 comments16 min readLW link

Introducing Leap Labs, an AI interpretability startup

Jessica Rumbelow6 Mar 2023 16:16 UTC

103 points

12 comments1 min readLW link

NAO Updates, Fall 2024

jefftk18 Oct 2024 0:00 UTC

32 points

2 comments4 min readLW link

(naobservatory.org)

Funding Good Research

lukeprog27 May 2012 6:41 UTC

38 points

44 comments2 min readLW link

Rational Effective Utopia & Narrow Way There: Math-Proven Safe Static Multiversal mAX-Intelligence (AXI), Multiversal Alignment, New Ethicophysics… (Aug 11)

ank11 Feb 2025 3:21 UTC

13 points

8 comments38 min readLW link

Human-AI Relationality is Already Here

bridgebot20 Feb 2025 7:08 UTC

17 points

0 comments15 min readLW link

H-JEPA might be technically alignable in a modified form

Roman Leventov8 May 2023 23:04 UTC

12 points

2 comments7 min readLW link

Conditioning Generative Models for Alignment

Jozdien18 Jul 2022 7:11 UTC

60 points

8 comments20 min readLW link

My summary of “Pragmatic AI Safety”

Eleni Angelou5 Nov 2022 12:54 UTC

3 points

0 comments5 min readLW link

Which of these five AI alignment research projects ideas are no good?

rmoehn8 Aug 2019 7:17 UTC

25 points

13 comments1 min readLW link

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

lukemarks, Amirali Abdullah, Rauno Arike, Fazl and nothoughtsheadempty

3 Oct 2023 7:45 UTC

18 points

0 comments5 min readLW link

[UPDATE: deadline extended to July 24!] New wind in rationality’s sails: Applications for Epistea Residency 2023 are now open

Jana Meixnerová and Irena Kotíková

11 Jul 2023 11:02 UTC

80 points

7 comments3 min readLW link

For alignment, we should simultaneously use multiple theories of cognition and value

Roman Leventov24 Apr 2023 10:37 UTC

23 points

5 comments5 min readLW link

Roadmap for a collaborative prototype of an Open Agency Architecture

Deger Turan10 May 2023 17:41 UTC

31 points

0 comments12 min readLW link

Thoughts On (Solving) Deep Deception

Jozdien21 Oct 2023 22:40 UTC

72 points

6 comments6 min readLW link

Notes on effective-altruism-related research, writing, testing fit, learning, and the EA Forum

MichaelA28 Mar 2021 23:43 UTC

14 points

0 comments4 min readLW link

[Question] How can we secure more research positions at our universities for x-risk researchers?

Neil Crawford6 Sep 2022 17:17 UTC

11 points

0 comments1 min readLW link

Labor Participation is a High-Priority AI Alignment Risk

alex17 Jun 2024 18:09 UTC

7 points

0 comments17 min readLW link

Annotated reply to Bengio’s “AI Scientists: Safe and Useful AI?”

Roman Leventov8 May 2023 21:26 UTC

18 points

2 comments7 min readLW link

(yoshuabengio.org)

AI learns betrayal and how to avoid it

Stuart_Armstrong30 Sep 2021 9:39 UTC

30 points

4 comments2 min readLW link

Alignment Org Cheat Sheet

Orpheus16 and Thomas Larsen

20 Sep 2022 17:36 UTC

65 points

8 comments4 min readLW link

Science of Deep Learning—a technical agenda

Marius Hobbhahn18 Oct 2022 14:54 UTC

37 points

7 comments4 min readLW link

How to Contribute to Theoretical Reward Learning Research

Joar Skalse28 Feb 2025 19:27 UTC

16 points

0 comments21 min readLW link

AI Alignment Research Overview (by Jacob Steinhardt)

Ben Pace6 Nov 2019 19:24 UTC

44 points

0 comments7 min readLW link

(docs.google.com)

The Road to Evil Is Paved with Good Objectives: Framework to Classify and Fix Misalignments.

Shivam30 Jan 2025 2:44 UTC

1 point

0 comments11 min readLW link

Unaligned AGI & Brief History of Inequality

ank22 Feb 2025 16:26 UTC

−20 points

4 comments7 min readLW link

You should delay engineering-heavy research in light of R&D automation

Daniel Paleka7 Jan 2025 2:11 UTC

44 points

3 comments5 min readLW link

(newsletter.danielpaleka.com)

Towards White Box Deep Learning

Maciej Satkiewicz27 Mar 2024 18:20 UTC

18 points

5 comments1 min readLW link

(arxiv.org)

Synthetic Neuroscience

hpcfung25 Mar 2025 17:45 UTC

2 points

3 comments3 min readLW link

A call for a quantitative report card for AI bioterrorism threat models

Juno4 Dec 2023 6:35 UTC

12 points

0 comments10 min readLW link

Shard Theory: An Overview

David Udell11 Aug 2022 5:44 UTC

167 points

34 comments10 min readLW link

Natural abstractions are observer-dependent: a conversation with John Wentworth

Martín Soto12 Feb 2024 17:28 UTC

40 points

13 comments7 min readLW link

A Multidisciplinary Approach to Alignment (MATA) and Archetypal Transfer Learning (ATL)

MiguelDev19 Jun 2023 2:32 UTC

4 points

2 comments7 min readLW link

The Metaethics and Normative Ethics of AGI Value Alignment: Many Questions, Some Implications

Eleos Arete Citrini16 Sep 2021 16:13 UTC

6 points

0 comments8 min readLW link

EIS X: Continual Learning, Modularity, Compression, and Biological Brains

scasper21 Feb 2023 16:59 UTC

14 points

4 comments3 min readLW link

Artificial Specific Intelligence: Forging AI into Depth and Identity.

Skalisko1 Sep 2025 1:19 UTC

1 point

0 comments1 min readLW link

Gaia Network: An Illustrated Primer

Rafael Kaufmann Nedal and Roman Leventov

18 Jan 2024 18:23 UTC

3 points

2 comments15 min readLW link

[Linkpost] Interpretable Analysis of Features Found in Open-source Sparse Autoencoder (partial replication)

Fernando Avalos9 Sep 2024 3:33 UTC

6 points

1 comment1 min readLW link

(forum.effectivealtruism.org)

What should AI safety be trying to achieve?

EuanMcLean23 May 2024 11:17 UTC

17 points

1 comment13 min readLW link

Reinforcement Learning using Layered Morphology (RLLM)

MiguelDev1 Dec 2023 5:18 UTC

7 points

0 comments29 min readLW link

[Question] Research ideas (AI Interpretability & Neurosciences) for a 2-months project

flux8 Jan 2023 15:36 UTC

4 points

1 comment1 min readLW link

Question 2: Predicted bad outcomes of AGI learning architecture

Cameron Berg11 Feb 2022 22:23 UTC

5 points

1 comment10 min readLW link

Research agenda: Formalizing abstractions of computations

Erik Jenner2 Feb 2023 4:29 UTC

93 points

10 comments31 min readLW link

Interview with Vanessa Kosoy on the Value of Theoretical Research for AI

WillPetillo4 Dec 2023 22:58 UTC

37 points

0 comments35 min readLW link

AISC project: TinyEvals

Jett Janiak22 Nov 2023 20:47 UTC

26 points

0 comments4 min readLW link

Retrospective: PIBBSS Fellowship 2024

DusanDNesic, clem_acs and Lucas Teixeira

20 Dec 2024 15:55 UTC

64 points

1 comment4 min readLW link

Technical AI Safety research taxonomy attempt (2025)

Benjamin Plaut27 Aug 2025 22:17 UTC

2 points

0 comments2 min readLW link

No comments.

Re­search Agendas

Research Agendas