Aligned AI Proposals

TagLast edit: 20 Dec 2022 15:20 UTC by particlemania

How to safely use an optimizer

Simon Fischer28 Mar 2024 16:11 UTC

47 points

21 comments7 min readLW link

Strong-Misalignment: Does Yudkowsky (or Christiano, or TurnTrout, or Wolfram, or…etc.) Have an Elevator Speech I’m Missing?

Benjamin Bourlier15 Mar 2024 23:17 UTC

−4 points

3 comments16 min readLW link

Update on Developing an Ethics Calculator to Align an AGI to

sweenesm12 Mar 2024 12:33 UTC

4 points

2 comments8 min readLW link

Alignment in Thought Chains

Faust Nemesis4 Mar 2024 19:24 UTC

1 point

0 comments2 min readLW link

Requirements for a Basin of Attraction to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC

20 points

6 comments31 min readLW link

Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

RogerDearnaley1 Feb 2024 21:15 UTC

4 points

15 comments13 min readLW link

Proposal for an AI Safety Prize

sweenesm31 Jan 2024 18:35 UTC

3 points

0 comments2 min readLW link

Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?

RogerDearnaley11 Jan 2024 12:56 UTC

22 points

4 comments39 min readLW link

Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor

RogerDearnaley9 Jan 2024 20:42 UTC

46 points

8 comments36 min readLW link

Striking Implications for Learning Theory, Interpretability — and Safety?

RogerDearnaley5 Jan 2024 8:46 UTC

35 points

4 comments2 min readLW link

Safety First: safety before full alignment. The deontic sufficiency hypothesis.

Chipmonk3 Jan 2024 17:55 UTC

47 points

3 comments3 min readLW link

AI Alignment Metastrategy

Vanessa Kosoy31 Dec 2023 12:06 UTC

113 points

12 comments7 min readLW link

Interpreting the Learning of Deceit

RogerDearnaley18 Dec 2023 8:12 UTC

30 points

8 comments9 min readLW link

Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem

Ansh Radhakrishnan, Buck, ryan_greenblatt and Fabien Roger

16 Dec 2023 5:49 UTC

71 points

3 comments6 min readLW link

Language Model Memorization, Copyright Law, and Conditional Pretraining Alignment

RogerDearnaley7 Dec 2023 6:14 UTC

3 points

0 comments11 min readLW link

How to Control an LLM’s Behavior (why my P(DOOM) went down)

RogerDearnaley28 Nov 2023 19:56 UTC

64 points

30 comments11 min readLW link

Is Interpretability All We Need?

RogerDearnaley14 Nov 2023 5:31 UTC

1 point

1 comment1 min readLW link

We have promising alignment plans with low taxes

Seth Herd10 Nov 2023 18:51 UTC

30 points

9 comments5 min readLW link

AI Alignment: A Comprehensive Survey

Stephen McAleer1 Nov 2023 17:35 UTC

15 points

1 comment1 min readLW link

(arxiv.org)

The (partial) fallacy of dumb superintelligence

Seth Herd18 Oct 2023 21:25 UTC

27 points

5 comments4 min readLW link

A list of core AI safety problems and how I hope to solve them

davidad26 Aug 2023 15:12 UTC

161 points

23 comments5 min readLW link

Enhancing Corrigibility in AI Systems through Robust Feedback Loops

Justausername24 Aug 2023 3:53 UTC

1 point

0 comments6 min readLW link

[Question] Bostrom’s Solution

James Blackmon14 Aug 2023 17:09 UTC

1 point

0 comments1 min readLW link

Reducing the risk of catastrophically misaligned AI by avoiding the Singleton scenario: the Manyton Variant

GravitasGradient6 Aug 2023 14:24 UTC

−6 points

0 comments3 min readLW link

Embedding Ethical Priors into AI Systems: A Bayesian Approach

Justausername3 Aug 2023 15:31 UTC

−5 points

3 comments21 min readLW link

Autonomous Alignment Oversight Framework (AAOF)

Justausername25 Jul 2023 10:25 UTC

−9 points

0 comments4 min readLW link

Supplementary Alignment Insights Through a Highly Controlled Shutdown Incentive

Justausername23 Jul 2023 16:08 UTC

4 points

1 comment3 min readLW link

Desiderata for an AI

Nathan Helm-Burger19 Jul 2023 16:18 UTC

8 points

0 comments4 min readLW link

Two paths to win the AGI transition

Nathan Helm-Burger6 Jul 2023 21:59 UTC

11 points

8 comments4 min readLW link

an Evangelion dialogue explaining the QACI alignment plan

Tamsin Leake10 Jun 2023 3:28 UTC

45 points

15 comments43 min readLW link

(carado.moe)

An LLM-based “exemplary actor”

Roman Leventov29 May 2023 11:12 UTC

16 points

0 comments12 min readLW link

Aligning an H-JEPA agent via training on the outputs of an LLM-based “exemplary actor”

Roman Leventov29 May 2023 11:08 UTC

12 points

10 comments30 min readLW link

The Goal Misgeneralization Problem

Myspy18 May 2023 23:40 UTC

1 point

0 comments1 min readLW link

(drive.google.com)

Proposal: Align Systems Earlier In Training

OneManyNone16 May 2023 16:24 UTC

18 points

0 comments11 min readLW link

Annotated reply to Bengio’s “AI Scientists: Safe and Useful AI?”

Roman Leventov8 May 2023 21:26 UTC

18 points

2 comments7 min readLW link

(yoshuabengio.org)

Against sacrificing AI transparency for generality gains

Ape in the coat7 May 2023 6:52 UTC

4 points

0 comments2 min readLW link

A Proposal for AI Alignment: Using Directly Opposing Models

Arne B27 Apr 2023 18:05 UTC

0 points

5 comments3 min readLW link

How to express this system for ethically aligned AGI as a Mathematical formula?

Oliver Siegel19 Apr 2023 20:13 UTC

−1 points

0 comments1 min readLW link

Speculation on mapping the moral landscape for future Ai Alignment

Sven Heinz (Welwordion)16 Apr 2023 13:43 UTC

1 point

0 comments1 min readLW link

An Open Agency Architecture for Safe Transformative AI

davidad20 Dec 2022 13:04 UTC

79 points

22 comments4 min readLW link

Moral realism and AI alignment

Caspar Oesterheld3 Sep 2018 18:46 UTC

13 points

10 comments1 min readLW link

(casparoesterheld.com)

No comments.