Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Aligned AI Proposals
Tag
Last edit:
20 Dec 2022 15:20 UTC
by
particlemania
Relevant
New
Old
How to safely use an optimizer
Simon Fischer
28 Mar 2024 16:11 UTC
47
points
21
comments
7
min read
LW
link
Strong-Misalignment: Does Yudkowsky (or Christiano, or TurnTrout, or Wolfram, or…etc.) Have an Elevator Speech I’m Missing?
Benjamin Bourlier
15 Mar 2024 23:17 UTC
−4
points
3
comments
16
min read
LW
link
Update on Developing an Ethics Calculator to Align an AGI to
sweenesm
12 Mar 2024 12:33 UTC
4
points
2
comments
8
min read
LW
link
Alignment in Thought Chains
Faust Nemesis
4 Mar 2024 19:24 UTC
1
point
0
comments
2
min read
LW
link
Requirements for a Basin of Attraction to Alignment
RogerDearnaley
14 Feb 2024 7:10 UTC
20
points
6
comments
31
min read
LW
link
Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis
RogerDearnaley
1 Feb 2024 21:15 UTC
4
points
15
comments
13
min read
LW
link
Proposal for an AI Safety Prize
sweenesm
31 Jan 2024 18:35 UTC
3
points
0
comments
2
min read
LW
link
Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?
RogerDearnaley
11 Jan 2024 12:56 UTC
22
points
4
comments
39
min read
LW
link
Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor
RogerDearnaley
9 Jan 2024 20:42 UTC
46
points
8
comments
36
min read
LW
link
Striking Implications for Learning Theory, Interpretability — and Safety?
RogerDearnaley
5 Jan 2024 8:46 UTC
35
points
4
comments
2
min read
LW
link
Safety First: safety before full alignment. The deontic sufficiency hypothesis.
Chipmonk
3 Jan 2024 17:55 UTC
47
points
3
comments
3
min read
LW
link
AI Alignment Metastrategy
Vanessa Kosoy
31 Dec 2023 12:06 UTC
113
points
12
comments
7
min read
LW
link
Interpreting the Learning of Deceit
RogerDearnaley
18 Dec 2023 8:12 UTC
30
points
8
comments
9
min read
LW
link
Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem
Ansh Radhakrishnan
,
Buck
,
ryan_greenblatt
and
Fabien Roger
16 Dec 2023 5:49 UTC
71
points
3
comments
6
min read
LW
link
Language Model Memorization, Copyright Law, and Conditional Pretraining Alignment
RogerDearnaley
7 Dec 2023 6:14 UTC
3
points
0
comments
11
min read
LW
link
How to Control an LLM’s Behavior (why my P(DOOM) went down)
RogerDearnaley
28 Nov 2023 19:56 UTC
64
points
30
comments
11
min read
LW
link
Is Interpretability All We Need?
RogerDearnaley
14 Nov 2023 5:31 UTC
1
point
1
comment
1
min read
LW
link
We have promising alignment plans with low taxes
Seth Herd
10 Nov 2023 18:51 UTC
30
points
9
comments
5
min read
LW
link
AI Alignment: A Comprehensive Survey
Stephen McAleer
1 Nov 2023 17:35 UTC
15
points
1
comment
1
min read
LW
link
(arxiv.org)
The (partial) fallacy of dumb superintelligence
Seth Herd
18 Oct 2023 21:25 UTC
27
points
5
comments
4
min read
LW
link
A list of core AI safety problems and how I hope to solve them
davidad
26 Aug 2023 15:12 UTC
161
points
23
comments
5
min read
LW
link
Enhancing Corrigibility in AI Systems through Robust Feedback Loops
Justausername
24 Aug 2023 3:53 UTC
1
point
0
comments
6
min read
LW
link
[Question]
Bostrom’s Solution
James Blackmon
14 Aug 2023 17:09 UTC
1
point
0
comments
1
min read
LW
link
Reducing the risk of catastrophically misaligned AI by avoiding the Singleton scenario: the Manyton Variant
GravitasGradient
6 Aug 2023 14:24 UTC
−6
points
0
comments
3
min read
LW
link
Embedding Ethical Priors into AI Systems: A Bayesian Approach
Justausername
3 Aug 2023 15:31 UTC
−5
points
3
comments
21
min read
LW
link
Autonomous Alignment Oversight Framework (AAOF)
Justausername
25 Jul 2023 10:25 UTC
−9
points
0
comments
4
min read
LW
link
Supplementary Alignment Insights Through a Highly Controlled Shutdown Incentive
Justausername
23 Jul 2023 16:08 UTC
4
points
1
comment
3
min read
LW
link
Desiderata for an AI
Nathan Helm-Burger
19 Jul 2023 16:18 UTC
8
points
0
comments
4
min read
LW
link
Two paths to win the AGI transition
Nathan Helm-Burger
6 Jul 2023 21:59 UTC
11
points
8
comments
4
min read
LW
link
an Evangelion dialogue explaining the QACI alignment plan
Tamsin Leake
10 Jun 2023 3:28 UTC
45
points
15
comments
43
min read
LW
link
(carado.moe)
An LLM-based “exemplary actor”
Roman Leventov
29 May 2023 11:12 UTC
16
points
0
comments
12
min read
LW
link
Aligning an H-JEPA agent via training on the outputs of an LLM-based “exemplary actor”
Roman Leventov
29 May 2023 11:08 UTC
12
points
10
comments
30
min read
LW
link
The Goal Misgeneralization Problem
Myspy
18 May 2023 23:40 UTC
1
point
0
comments
1
min read
LW
link
(drive.google.com)
Proposal: Align Systems Earlier In Training
OneManyNone
16 May 2023 16:24 UTC
18
points
0
comments
11
min read
LW
link
Annotated reply to Bengio’s “AI Scientists: Safe and Useful AI?”
Roman Leventov
8 May 2023 21:26 UTC
18
points
2
comments
7
min read
LW
link
(yoshuabengio.org)
Against sacrificing AI transparency for generality gains
Ape in the coat
7 May 2023 6:52 UTC
4
points
0
comments
2
min read
LW
link
A Proposal for AI Alignment: Using Directly Opposing Models
Arne B
27 Apr 2023 18:05 UTC
0
points
5
comments
3
min read
LW
link
How to express this system for ethically aligned AGI as a Mathematical formula?
Oliver Siegel
19 Apr 2023 20:13 UTC
−1
points
0
comments
1
min read
LW
link
Speculation on mapping the moral landscape for future Ai Alignment
Sven Heinz (Welwordion)
16 Apr 2023 13:43 UTC
1
point
0
comments
1
min read
LW
link
An Open Agency Architecture for Safe Transformative AI
davidad
20 Dec 2022 13:04 UTC
79
points
22
comments
4
min read
LW
link
Moral realism and AI alignment
Caspar Oesterheld
3 Sep 2018 18:46 UTC
13
points
10
comments
1
min read
LW
link
(casparoesterheld.com)
No comments.
Back to top