Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
Archive
Sequences
About
Search
Log In
All
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
All
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
All
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Page
1
Let’s think about slowing down AI
KatjaGrace
Dec 22, 2022, 5:40 PM
551
points
182
comments
38
min read
LW
link
3
reviews
(aiimpacts.org)
Staring into the abyss as a core life skill
benkuhn
Dec 22, 2022, 3:30 PM
356
points
22
comments
12
min read
LW
link
1
review
(www.benkuhn.net)
Models Don’t “Get Reward”
Sam Ringer
Dec 30, 2022, 10:37 AM
316
points
62
comments
5
min read
LW
link
1
review
A challenge for AGI organizations, and a challenge for readers
Rob Bensinger
and
Eliezer Yudkowsky
Dec 1, 2022, 11:11 PM
302
points
33
comments
2
min read
LW
link
Sazen
Duncan Sabien (Inactive)
Dec 21, 2022, 7:54 AM
285
points
83
comments
12
min read
LW
link
2
reviews
AI alignment is distinct from its near-term applications
paulfchristiano
Dec 13, 2022, 7:10 AM
255
points
21
comments
2
min read
LW
link
(ai-alignment.com)
How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme
Collin
Dec 15, 2022, 6:22 PM
244
points
39
comments
16
min read
LW
link
1
review
Jailbreaking ChatGPT on Release Day
Zvi
Dec 2, 2022, 1:10 PM
242
points
77
comments
6
min read
LW
link
1
review
(thezvi.wordpress.com)
The Plan − 2022 Update
johnswentworth
Dec 1, 2022, 8:43 PM
239
points
37
comments
8
min read
LW
link
1
review
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
LawrenceC
,
Adrià Garriga-alonso
,
Nicholas Goldowsky-Dill
,
ryan_greenblatt
,
jenny
,
Ansh Radhakrishnan
,
Buck
and
Nate Thomas
Dec 3, 2022, 12:58 AM
206
points
35
comments
20
min read
LW
link
1
review
The next decades might be wild
Marius Hobbhahn
Dec 15, 2022, 4:10 PM
175
points
42
comments
41
min read
LW
link
1
review
What AI Safety Materials Do ML Researchers Find Compelling?
Vael Gates
and
Collin
Dec 28, 2022, 2:03 AM
175
points
34
comments
2
min read
LW
link
Finite Factored Sets in Pictures
Magdalena Wache
Dec 11, 2022, 6:49 PM
174
points
35
comments
12
min read
LW
link
Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong
and
rgorman
Dec 6, 2022, 7:54 PM
170
points
85
comments
9
min read
LW
link
Things that can kill you quickly: What everyone should know about first aid
jasoncrawford
Dec 27, 2022, 4:23 PM
166
points
21
comments
2
min read
LW
link
1
review
(jasoncrawford.org)
Logical induction for software engineers
Alex Flint
Dec 3, 2022, 7:55 PM
163
points
8
comments
27
min read
LW
link
1
review
[Interim research report] Taking features out of superposition with sparse autoencoders
Lee Sharkey
,
Dan Braun
and
beren
Dec 13, 2022, 3:41 PM
150
points
23
comments
22
min read
LW
link
2
reviews
Shard Theory in Nine Theses: a Distillation and Critical Appraisal
LawrenceC
Dec 19, 2022, 10:52 PM
150
points
30
comments
18
min read
LW
link
Inner and outer alignment decompose one hard problem into two extremely hard problems
TurnTrout
Dec 2, 2022, 2:43 AM
149
points
22
comments
47
min read
LW
link
3
reviews
A Year of AI Increasing AI Progress
TW123
Dec 30, 2022, 2:09 AM
148
points
3
comments
2
min read
LW
link
K-complexity is silly; use cross-entropy instead
So8res
Dec 20, 2022, 11:06 PM
147
points
54
comments
14
min read
LW
link
2
reviews
Updating my AI timelines
Matthew Barnett
Dec 5, 2022, 8:46 PM
145
points
50
comments
2
min read
LW
link
[Question]
How to Convince my Son that Drugs are Bad
concerned_dad
Dec 17, 2022, 6:47 PM
140
points
84
comments
2
min read
LW
link
Deconfusing Direct vs Amortised Optimization
beren
Dec 2, 2022, 11:30 AM
136
points
19
comments
10
min read
LW
link
The case against AI alignment
andrew sauer
Dec 24, 2022, 6:57 AM
128
points
110
comments
5
min read
LW
link
Re-Examining LayerNorm
Eric Winsor
Dec 1, 2022, 10:20 PM
127
points
12
comments
5
min read
LW
link
Shared reality: a key driver of human behavior
kdbscott
Dec 24, 2022, 7:35 PM
126
points
25
comments
4
min read
LW
link
Did ChatGPT just gaslight me?
TW123
Dec 1, 2022, 5:41 AM
123
points
45
comments
9
min read
LW
link
(aiwatchtower.substack.com)
[Question]
Why The Focus on Expected Utility Maximisers?
DragonGod
Dec 27, 2022, 3:49 PM
118
points
84
comments
3
min read
LW
link
Trying to disambiguate different questions about whether RLHF is “good”
Buck
Dec 14, 2022, 4:03 AM
108
points
47
comments
7
min read
LW
link
1
review
200 Concrete Open Problems in Mechanistic Interpretability: Introduction
Neel Nanda
Dec 28, 2022, 9:06 PM
106
points
0
comments
10
min read
LW
link
Language models are nearly AGIs but we don’t notice it because we keep shifting the bar
philosophybear
Dec 30, 2022, 5:15 AM
105
points
13
comments
7
min read
LW
link
But is it really in Rome? An investigation of the ROME model editing technique
jacquesthibs
Dec 30, 2022, 2:40 AM
104
points
2
comments
18
min read
LW
link
Finding gliders in the game of life
paulfchristiano
Dec 1, 2022, 8:40 PM
104
points
8
comments
16
min read
LW
link
(ai-alignment.com)
Slightly against aligning with neo-luddites
Matthew Barnett
Dec 26, 2022, 10:46 PM
104
points
31
comments
4
min read
LW
link
[Linkpost] The Story Of VaccinateCA
hath
Dec 9, 2022, 11:54 PM
103
points
4
comments
10
min read
LW
link
(www.worksinprogress.co)
Applied Linear Algebra Lecture Series
johnswentworth
Dec 22, 2022, 6:57 AM
103
points
8
comments
1
min read
LW
link
Thoughts on AGI organizations and capabilities work
Rob Bensinger
and
So8res
Dec 7, 2022, 7:46 PM
102
points
17
comments
5
min read
LW
link
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
and
Ethan Perez
Dec 20, 2022, 8:08 PM
100
points
34
comments
1
min read
LW
link
(www.anthropic.com)
Bad at Arithmetic, Promising at Math
cohenmacaulay
Dec 18, 2022, 5:40 AM
100
points
19
comments
20
min read
LW
link
1
review
[Link] Why I’m optimistic about OpenAI’s alignment approach
janleike
Dec 5, 2022, 10:51 PM
98
points
15
comments
1
min read
LW
link
(aligned.substack.com)
You can still fetch the coffee today if you’re dead tomorrow
davidad
Dec 9, 2022, 2:06 PM
96
points
19
comments
5
min read
LW
link
Towards Hodge-podge Alignment
Cleo Nardo
Dec 19, 2022, 8:12 PM
95
points
30
comments
9
min read
LW
link
The LessWrong 2021 Review: Intellectual Circle Expansion
Ruby
and
Raemon
Dec 1, 2022, 9:17 PM
95
points
55
comments
8
min read
LW
link
Revisiting algorithmic progress
Tamay
and
Ege Erdil
Dec 13, 2022, 1:39 AM
95
points
15
comments
2
min read
LW
link
1
review
(arxiv.org)
A Comprehensive Mechanistic Interpretability Explainer & Glossary
Neel Nanda
Dec 21, 2022, 12:35 PM
91
points
6
comments
2
min read
LW
link
(neelnanda.io)
Can we efficiently distinguish different mechanisms?
paulfchristiano
Dec 27, 2022, 12:20 AM
91
points
30
comments
16
min read
LW
link
(ai-alignment.com)
Setting the Zero Point
Duncan Sabien (Inactive)
Dec 9, 2022, 6:06 AM
90
points
43
comments
20
min read
LW
link
1
review
Local Memes Against Geometric Rationality
Scott Garrabrant
Dec 21, 2022, 3:53 AM
90
points
3
comments
6
min read
LW
link
Consider using reversible automata for alignment research
Alex_Altair
Dec 11, 2022, 1:00 AM
88
points
30
comments
2
min read
LW
link
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel