Archive
Sequences
About
Search
Log In
Home
Featured
All
Tags
Recent
Comments
Questions
Events
Shortform
Alignment Forum
AF Comments
RSS
New
Hot
Active
Old
Page
1
Criticism of the main framework in AI alignment
Michele Campolo
31 Jan 2023 23:01 UTC
8
points
0
comments
7
min read
LW
link
Mechanistic Interpretability Quickstart Guide
Neel Nanda
31 Jan 2023 16:35 UTC
25
points
1
comment
6
min read
LW
link
(www.neelnanda.io)
Inner Misalignment in “Simulator” LLMs
Adam Scherlis
31 Jan 2023 8:33 UTC
48
points
3
comments
4
min read
LW
link
Why I hate the “accident vs. misuse” AI x-risk dichotomy (quick thoughts on “structural risk”)
David Scott Krueger (formerly: capybaralet)
30 Jan 2023 18:50 UTC
26
points
25
comments
2
min read
LW
link
Call for submissions: “(In)human Values and Artificial Agency”, ALIFE 2023
the gears to ascenscion
30 Jan 2023 17:37 UTC
27
points
3
comments
1
min read
LW
link
(humanvaluesandartificialagency.com)
What I mean by “alignment is in large part about making cognition aimable at all”
So8res
30 Jan 2023 15:22 UTC
113
points
8
comments
2
min read
LW
link
Structure, creativity, and novelty
TsviBT
29 Jan 2023 14:30 UTC
14
points
1
comment
7
min read
LW
link
Stop-gradients lead to fixed point predictions
Johannes_Treutlein
,
Caspar42
,
Rubi J. Hudson
and
Emery Cooper
28 Jan 2023 22:47 UTC
26
points
0
comments
24
min read
LW
link
Spooky action at a distance in the loss landscape
jhoogland
and
Filip Sondej
28 Jan 2023 0:22 UTC
43
points
3
comments
3
min read
LW
link
(www.jessehoogland.com)
The role of Bayesian ML in AI safety—an overview
Marius Hobbhahn
27 Jan 2023 19:40 UTC
20
points
4
comments
10
min read
LW
link
AGI will have learnt utility functions
beren
25 Jan 2023 19:42 UTC
26
points
2
comments
13
min read
LW
link
Thoughts on the impact of RLHF research
paulfchristiano
25 Jan 2023 17:23 UTC
186
points
88
comments
9
min read
LW
link
Quick thoughts on “scalable oversight” / “super-human feedback” research
David Scott Krueger (formerly: capybaralet)
25 Jan 2023 12:55 UTC
25
points
5
comments
2
min read
LW
link
Alexander and Yudkowsky on AGI goals
Scott Alexander
and
Eliezer Yudkowsky
24 Jan 2023 21:09 UTC
141
points
49
comments
26
min read
LW
link
Inverse Scaling Prize: Second Round Winners
Ian McKenzie
,
Sam Bowman
and
Ethan Perez
24 Jan 2023 20:12 UTC
48
points
13
comments
15
min read
LW
link
How-to Transformer Mechanistic Interpretability—in 50 lines of code or less!
StefanHex
24 Jan 2023 18:45 UTC
38
points
3
comments
13
min read
LW
link
Gradient hacking is extremely difficult
beren
24 Jan 2023 15:45 UTC
139
points
18
comments
5
min read
LW
link
“Endgame safety” for AGI
Steven Byrnes
24 Jan 2023 14:15 UTC
72
points
4
comments
5
min read
LW
link
Thoughts on hardware / compute requirements for AGI
Steven Byrnes
24 Jan 2023 14:03 UTC
28
points
28
comments
21
min read
LW
link
Some of my disagreements with List of Lethalities
TurnTrout
24 Jan 2023 0:25 UTC
67
points
7
comments
10
min read
LW
link
Back to top
Next