Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
Archive
Sequences
About
Search
Log In
All
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
All
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
All
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Page
1
Short Notes on Research Process
Shoshannah Tekofsky
Feb 22, 2023, 11:41 PM
21
points
0
comments
2
min read
LW
link
Video/animation: Neel Nanda explains what mechanistic interpretability is
DanielFilan
Feb 22, 2023, 10:42 PM
24
points
7
comments
1
min read
LW
link
(youtu.be)
A Telepathic Exam about AI and Consequentialism
alkexr
Feb 22, 2023, 9:00 PM
4
points
4
comments
4
min read
LW
link
[Question]
Injecting noise to GPT to get multiple answers
bipolo
Feb 22, 2023, 8:02 PM
1
point
1
comment
1
min read
LW
link
EIS XI: Moving Forward
scasper
Feb 22, 2023, 7:05 PM
19
points
2
comments
9
min read
LW
link
Building and Entertaining Couples
Jacob Falkovich
Feb 22, 2023, 7:02 PM
86
points
11
comments
4
min read
LW
link
Can submarines swim?
jasoncrawford
Feb 22, 2023, 6:48 PM
18
points
14
comments
13
min read
LW
link
(rootsofprogress.org)
Is there a ML agent that abandons it’s utility function out-of-distribution without losing capabilities?
Christopher King
Feb 22, 2023, 4:49 PM
1
point
7
comments
1
min read
LW
link
The male AI alignment solution
TekhneMakre
Feb 22, 2023, 4:34 PM
−25
points
24
comments
1
min read
LW
link
Progress links and tweets, 2023-02-22
jasoncrawford
Feb 22, 2023, 4:23 PM
13
points
0
comments
1
min read
LW
link
(rootsofprogress.org)
Cyborg Periods: There will be multiple AI transitions
Jan_Kulveit
and
rosehadshar
Feb 22, 2023, 4:09 PM
108
points
9
comments
6
min read
LW
link
The Open Agency Model
Eric Drexler
Feb 22, 2023, 10:35 AM
114
points
18
comments
4
min read
LW
link
Intervening in the Residual Stream
MadHatter
Feb 22, 2023, 6:29 AM
30
points
1
comment
9
min read
LW
link
What do language models know about fictional characters?
skybrian
Feb 22, 2023, 5:58 AM
6
points
0
comments
4
min read
LW
link
Power-Seeking = Minimising free energy
Jonas Hallgren
Feb 22, 2023, 4:28 AM
21
points
10
comments
7
min read
LW
link
The shallow reality of ‘deep learning theory’
Jesse Hoogland
Feb 22, 2023, 4:16 AM
34
points
11
comments
3
min read
LW
link
(www.jessehoogland.com)
Candyland is Terrible
jefftk
Feb 22, 2023, 1:50 AM
16
points
2
comments
1
min read
LW
link
(www.jefftk.com)
A proof of inner Löb’s theorem
James Payor
Feb 21, 2023, 9:11 PM
13
points
0
comments
2
min read
LW
link
Fighting For Our Lives—What Ordinary People Can Do
TinkerBird
Feb 21, 2023, 8:36 PM
14
points
18
comments
4
min read
LW
link
The Emotional Type of a Decision
moridinamael
Feb 21, 2023, 8:35 PM
13
points
0
comments
4
min read
LW
link
What is it like doing AI safety work?
KatWoods
Feb 21, 2023, 8:12 PM
57
points
2
comments
LW
link
Pretraining Language Models with Human Preferences
Tomek Korbak
,
Sam Bowman
and
Ethan Perez
Feb 21, 2023, 5:57 PM
135
points
20
comments
11
min read
LW
link
2
reviews
A Stranger Priority? Topics at the Outer Reaches of Effective Altruism (my dissertation)
Joe Carlsmith
Feb 21, 2023, 5:26 PM
38
points
16
comments
1
min read
LW
link
EIS X: Continual Learning, Modularity, Compression, and Biological Brains
scasper
Feb 21, 2023, 4:59 PM
14
points
4
comments
3
min read
LW
link
No Room for Political Philosophy
Arturo Macias
Feb 21, 2023, 4:11 PM
−1
points
7
comments
3
min read
LW
link
Deceptive Alignment is <1% Likely by Default
DavidW
Feb 21, 2023, 3:09 PM
89
points
31
comments
14
min read
LW
link
1
review
AI #1: Sydney and Bing
Zvi
Feb 21, 2023, 2:00 PM
171
points
45
comments
61
min read
LW
link
1
review
(thezvi.wordpress.com)
You’re not a simulation, ’cause you’re hallucinating
Stuart_Armstrong
Feb 21, 2023, 12:12 PM
25
points
6
comments
1
min read
LW
link
Basic facts about language models during training
beren
Feb 21, 2023, 11:46 AM
98
points
15
comments
18
min read
LW
link
[Preprint] Pretraining Language Models with Human Preferences
Giulio
Feb 21, 2023, 11:44 AM
12
points
0
comments
1
min read
LW
link
(arxiv.org)
Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning
Roger Dearnaley
Feb 21, 2023, 9:05 AM
10
points
1
comment
23
min read
LW
link
Medlife Crisis: “Why Do People Keep Falling For Things That Don’t Work?”
RomanHauksson
Feb 21, 2023, 6:22 AM
12
points
5
comments
1
min read
LW
link
(www.youtube.com)
A foundation model approach to value inference
sen
Feb 21, 2023, 5:09 AM
6
points
0
comments
3
min read
LW
link
Instrumentality makes agents agenty
porby
Feb 21, 2023, 4:28 AM
20
points
7
comments
6
min read
LW
link
Gamified narrow reverse imitation learning
TekhneMakre
Feb 21, 2023, 4:26 AM
8
points
0
comments
2
min read
LW
link
Feelings are Good, Actually
Gordon Seidoh Worley
Feb 21, 2023, 2:38 AM
18
points
1
comment
4
min read
LW
link
AI alignment researchers don’t (seem to) stack
So8res
Feb 21, 2023, 12:48 AM
193
points
40
comments
3
min read
LW
link
EA & LW Forum Weekly Summary (6th − 19th Feb 2023)
Zoe Williams
Feb 21, 2023, 12:26 AM
8
points
0
comments
LW
link
What to think when a language model tells you it’s sentient
Robbo
Feb 21, 2023, 12:01 AM
9
points
6
comments
6
min read
LW
link
On second thought, prompt injections are probably examples of misalignment
lc
Feb 20, 2023, 11:56 PM
22
points
5
comments
1
min read
LW
link
Nothing Is Ever Taught Correctly
LVSN
Feb 20, 2023, 10:31 PM
5
points
3
comments
1
min read
LW
link
Behavioral and mechanistic definitions (often confuse AI alignment discussions)
LawrenceC
Feb 20, 2023, 9:33 PM
33
points
5
comments
6
min read
LW
link
Validator models: A simple approach to detecting goodharting
beren
Feb 20, 2023, 9:32 PM
14
points
1
comment
4
min read
LW
link
There are no coherence theorems
Dan H
and
EJT
Feb 20, 2023, 9:25 PM
149
points
130
comments
19
min read
LW
link
1
review
[Question]
Are there any AI safety relevant fully remote roles suitable for someone with 2-3 years of machine learning engineering industry experience?
Malleable_shape
Feb 20, 2023, 7:57 PM
7
points
2
comments
1
min read
LW
link
A circuit for Python docstrings in a 4-layer attention-only transformer
StefanHex
and
Jett Janiak
Feb 20, 2023, 7:35 PM
96
points
8
comments
21
min read
LW
link
Sydney the Bingenator Can’t Think, But It Still Threatens People
Valentin Baltadzhiev
Feb 20, 2023, 6:37 PM
−3
points
2
comments
8
min read
LW
link
EIS IX: Interpretability and Adversaries
scasper
Feb 20, 2023, 6:25 PM
30
points
8
comments
8
min read
LW
link
What AI companies can do today to help with the most important century
HoldenKarnofsky
Feb 20, 2023, 5:00 PM
38
points
3
comments
9
min read
LW
link
(www.cold-takes.com)
Bankless Podcast: 159 - We’re All Gonna Die with Eliezer Yudkowsky
bayesed
Feb 20, 2023, 4:42 PM
83
points
54
comments
1
min read
LW
link
(www.youtube.com)
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel