Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
Archive
Sequences
About
Search
Log In
All
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
All
Jan
Feb
Mar
Apr
May
Jun
Jul
All
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Page
1
Anomalous Tokens in DeepSeek-V3 and r1
henry
Jan 25, 2025, 10:55 PM
137
points
3
comments
7
min read
LW
link
The Rising Sea
Jesse Hoogland
Jan 25, 2025, 8:48 PM
92
points
2
comments
2
min read
LW
link
Monet: Mixture of Monosemantic Experts for Transformers Explained
CalebMaresca
Jan 25, 2025, 7:37 PM
20
points
2
comments
11
min read
LW
link
AI and Non-Existence.
Eleven
Jan 25, 2025, 7:36 PM
−3
points
9
comments
2
min read
LW
link
Agents don’t have to be aligned to help us achieve an indefinite pause.
Hastings
Jan 25, 2025, 6:51 PM
29
points
0
comments
3
min read
LW
link
[Question]
AI Safety in secret
Michael Flood
Jan 25, 2025, 6:16 PM
7
points
0
comments
1
min read
LW
link
On polytopes
Dmitry Vaintrob
Jan 25, 2025, 1:56 PM
56
points
5
comments
12
min read
LW
link
Attribution-based parameter decomposition
Lucius Bushnaq
,
Dan Braun
,
StefanHex
,
jake_mendel
and
Lee Sharkey
Jan 25, 2025, 1:12 PM
108
points
22
comments
4
min read
LW
link
(publications.apolloresearch.ai)
A concise definition of what it means to win
testingthewaters
Jan 25, 2025, 6:37 AM
4
points
1
comment
5
min read
LW
link
(aclevername.substack.com)
[Question]
A Floating Cube—Rejected HLE submission
Shankar Sivarajan
Jan 25, 2025, 4:52 AM
7
points
1
comment
1
min read
LW
link
Why I’m Pouring Cold Water in My Left Ear, and You Should Too
Maloew
Jan 24, 2025, 11:13 PM
12
points
0
comments
2
min read
LW
link
Counterintuitive effects of minimum prices
dynomight
Jan 24, 2025, 11:05 PM
25
points
0
comments
8
min read
LW
link
(dynomight.net)
AXRP Episode 38.6 - Joel Lehman on Positive Visions of AI
DanielFilan
Jan 24, 2025, 11:00 PM
10
points
0
comments
9
min read
LW
link
Locating and Editing Knowledge in LMs
Dhananjay Ashok
Jan 24, 2025, 10:53 PM
1
point
0
comments
4
min read
LW
link
How are Those AI Participants Doing Anyway?
mushroomsoup
Jan 24, 2025, 10:37 PM
4
points
0
comments
10
min read
LW
link
Six Thoughts on AI Safety
boazbarak
Jan 24, 2025, 10:20 PM
91
points
55
comments
15
min read
LW
link
Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals
johnswentworth
and
David Lorell
Jan 24, 2025, 8:20 PM
181
points
61
comments
5
min read
LW
link
Yudkowsky on The Trajectory podcast
Seth Herd
Jan 24, 2025, 7:52 PM
71
points
39
comments
2
min read
LW
link
(www.youtube.com)
Empirical Insights into Feature Geometry in Sparse Autoencoders
Jason Boxi Zhang
Jan 24, 2025, 7:02 PM
7
points
0
comments
11
min read
LW
link
Ideas for CoT Models: A Geometric Perspective on Latent Space Reasoning
Rohan Ganapavarapu
Jan 24, 2025, 7:01 PM
2
points
0
comments
2
min read
LW
link
(rohan.ga)
Liron Shapira vs Ken Stanley on Doom Debates. A review
TheManxLoiner
Jan 24, 2025, 6:01 PM
9
points
0
comments
14
min read
LW
link
Is there such a thing as an impossible protein?
Abhishaike Mahajan
Jan 24, 2025, 5:12 PM
15
points
3
comments
4
min read
LW
link
(www.owlposting.com)
Stargate AI-1
Zvi
Jan 24, 2025, 3:20 PM
85
points
1
comment
18
min read
LW
link
(thezvi.wordpress.com)
QFT and neural nets: the basic idea
Dmitry Vaintrob
Jan 24, 2025, 1:54 PM
26
points
0
comments
8
min read
LW
link
Eliciting bad contexts
Geoffrey Irving
,
Joseph Bloom
and
Tomek Korbak
Jan 24, 2025, 10:39 AM
34
points
9
comments
3
min read
LW
link
Insights from “The Manga Guide to Physiology”
TurnTrout
Jan 24, 2025, 5:18 AM
26
points
3
comments
1
min read
LW
link
(turntrout.com)
[Question]
Do you consider perfect surveillance inevitable?
samuelshadrach
Jan 24, 2025, 4:57 AM
16
points
34
comments
1
min read
LW
link
Uncontrollable: A Surprisingly Good Introduction to AI Risk
PeterMcCluskey
Jan 24, 2025, 4:30 AM
11
points
0
comments
1
min read
LW
link
(bayesianinvestor.com)
Contra Dances Getting Shorter and Earlier
jefftk
Jan 23, 2025, 11:30 PM
11
points
0
comments
2
min read
LW
link
(www.jefftk.com)
Starting Thoughts on RLHF
Michael Flood
Jan 23, 2025, 10:16 PM
2
points
0
comments
5
min read
LW
link
Updating and Editing Factual Knowledge in Language Models
Dhananjay Ashok
Jan 23, 2025, 7:34 PM
2
points
2
comments
10
min read
LW
link
AI companies are unlikely to make high-assurance safety cases if timelines are short
ryan_greenblatt
Jan 23, 2025, 6:41 PM
145
points
5
comments
13
min read
LW
link
AISN #46: The Transition
Corin Katzke
and
Dan H
Jan 23, 2025, 6:09 PM
8
points
0
comments
5
min read
LW
link
(newsletter.safe.ai)
What does success look like?
Raymond Douglas
Jan 23, 2025, 5:48 PM
11
points
0
comments
3
min read
LW
link
AI #100: Meet the New Boss
Zvi
Jan 23, 2025, 3:40 PM
50
points
4
comments
69
min read
LW
link
(thezvi.wordpress.com)
[Cross-post] Every Bay Area “Walled Compound”
davekasten
Jan 23, 2025, 3:05 PM
37
points
3
comments
3
min read
LW
link
Writing experiments and the banana escape valve
Dmitry Vaintrob
Jan 23, 2025, 1:11 PM
34
points
1
comment
2
min read
LW
link
MONA: Managed Myopia with Approval Feedback
Seb Farquhar
,
David Lindner
and
Rohin Shah
Jan 23, 2025, 12:24 PM
80
points
30
comments
9
min read
LW
link
[Question]
How useful would alien alignment research be?
Donald Hobson
Jan 23, 2025, 10:59 AM
17
points
5
comments
1
min read
LW
link
What are the differences between AGI, transformative AI, and superintelligence?
Vishakha
and
Algon
Jan 23, 2025, 10:03 AM
10
points
3
comments
3
min read
LW
link
(aisafety.info)
Why Aligning an LLM is Hard, and How to Make it Easier
RogerDearnaley
Jan 23, 2025, 6:44 AM
33
points
3
comments
4
min read
LW
link
Tail SP 500 Call Options
sapphire
Jan 23, 2025, 5:21 AM
67
points
28
comments
2
min read
LW
link
A hierarchy of disagreement
Adam Zerner
Jan 23, 2025, 3:17 AM
21
points
4
comments
8
min read
LW
link
Early Experiments in Human Auditing for AI Control
Joey Yudelson
and
Buck
Jan 23, 2025, 1:34 AM
27
points
0
comments
7
min read
LW
link
You Have Two Brains
Eneasz
Jan 23, 2025, 12:52 AM
24
points
5
comments
5
min read
LW
link
(deathisbad.substack.com)
[Question]
are there 2 types of alignment?
KvmanThinking
Jan 23, 2025, 12:08 AM
4
points
9
comments
1
min read
LW
link
Theory of Change for AI Safety Camp
Linda Linsefors
Jan 22, 2025, 10:07 PM
36
points
3
comments
7
min read
LW
link
On DeepSeek’s r1
Zvi
Jan 22, 2025, 7:50 PM
55
points
2
comments
35
min read
LW
link
(thezvi.wordpress.com)
Detect Goodhart and shut down
Jeremy Gillen
Jan 22, 2025, 6:45 PM
70
points
21
comments
7
min read
LW
link
Recursive Self-Modeling as a Plausible Mechanism for Real-time Introspection in Current Language Models
rife
Jan 22, 2025, 6:36 PM
8
points
6
comments
2
min read
LW
link
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel