Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
Archive
Sequences
About
Search
Log In
All
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
All
Jan
Feb
Mar
Apr
May
Jun
All
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Page
1
The Compliment Sandwich 🥪 aka: How to criticize a normie without making them upset.
keltan
Mar 3, 2025, 11:15 PM
13
points
10
comments
1
min read
LW
link
AI Safety at the Frontier: Paper Highlights, February ’25
gasteigerjo
Mar 3, 2025, 10:09 PM
7
points
0
comments
7
min read
LW
link
(aisafetyfrontier.substack.com)
What goals will AIs have? A list of hypotheses
Daniel Kokotajlo
Mar 3, 2025, 8:08 PM
87
points
19
comments
18
min read
LW
link
Takeaways From Our Recent Work on SAE Probing
Josh Engels
,
Subhash Kantamneni
,
Senthooran Rajamanoharan
and
Neel Nanda
Mar 3, 2025, 7:50 PM
30
points
0
comments
5
min read
LW
link
Why People Commit White Collar Fraud (Ozy linkpost)
sapphire
Mar 3, 2025, 7:33 PM
22
points
1
comment
1
min read
LW
link
(thingofthings.substack.com)
[Question]
Ask Me Anything—Samuel
samuelshadrach
Mar 3, 2025, 7:24 PM
0
points
0
comments
1
min read
LW
link
Expanding HarmBench: Investigating Gaps & Extending Adversarial LLM Testing
racinkc1
Mar 3, 2025, 7:23 PM
1
point
0
comments
1
min read
LW
link
Could Advanced AI Accelerate the Pace of AI Progress? Interviews with AI Researchers
jleibowich
,
Nikola Jurkovic
and
Tom Davidson
Mar 3, 2025, 7:05 PM
43
points
1
comment
1
min read
LW
link
(papers.ssrn.com)
Middle School Choice
jefftk
Mar 3, 2025, 4:10 PM
27
points
10
comments
4
min read
LW
link
(www.jefftk.com)
On GPT-4.5
Zvi
Mar 3, 2025, 1:40 PM
44
points
12
comments
22
min read
LW
link
(thezvi.wordpress.com)
Coalescence—Determinism In Ways We Care About
vitaliya
Mar 3, 2025, 1:20 PM
12
points
0
comments
11
min read
LW
link
Methods for strong human germline engineering
TsviBT
Mar 3, 2025, 8:13 AM
149
points
28
comments
108
min read
LW
link
[Question]
Examples of self-fulfilling prophecies in AI alignment?
Chris Lakin
Mar 3, 2025, 2:45 AM
22
points
6
comments
1
min read
LW
link
[Question]
Request for Comments on AI-related Prediction Market Ideas
PeterMcCluskey
Mar 2, 2025, 8:52 PM
17
points
1
comment
3
min read
LW
link
Statistical Challenges with Making Super IQ babies
Jan Christian Refsgaard
Mar 2, 2025, 8:26 PM
154
points
26
comments
9
min read
LW
link
Cautions about LLMs in Human Cognitive Loops
Alice Blair
Mar 2, 2025, 7:53 PM
39
points
11
comments
7
min read
LW
link
Self-fulfilling misalignment data might be poisoning our AI models
TurnTrout
Mar 2, 2025, 7:51 PM
153
points
28
comments
1
min read
LW
link
(turntrout.com)
Spencer Greenberg hiring a personal/professional/research remote assistant for 5-10 hours per week
spencerg
Mar 2, 2025, 6:01 PM
13
points
0
comments
LW
link
[Question]
Will LLM agents become the first takeover-capable AGIs?
Seth Herd
Mar 2, 2025, 5:15 PM
36
points
10
comments
1
min read
LW
link
Not-yet-falsifiable beliefs?
Benjamin Hendricks
Mar 2, 2025, 2:11 PM
6
points
4
comments
1
min read
LW
link
Saving Zest
jefftk
Mar 2, 2025, 12:00 PM
24
points
1
comment
1
min read
LW
link
(www.jefftk.com)
Open Thread Spring 2025
Ben Pace
Mar 2, 2025, 2:33 AM
19
points
51
comments
1
min read
LW
link
[Question]
help, my self image as rational is affecting my ability to empathize with others
KvmanThinking
Mar 2, 2025, 2:06 AM
1
point
13
comments
1
min read
LW
link
Maintaining Alignment during RSI as a Feedback Control Problem
beren
Mar 2, 2025, 12:21 AM
66
points
6
comments
11
min read
LW
link
AI Safety Policy Won’t Go On Like This – AI Safety Advocacy Is Failing Because Nobody Cares.
henophilia
Mar 1, 2025, 8:15 PM
1
point
1
comment
1
min read
LW
link
(blog.hermesloom.org)
Meaning Machines
appromoximate
Mar 1, 2025, 7:16 PM
0
points
0
comments
13
min read
LW
link
[Question]
Share AI Safety Ideas: Both Crazy and Not
ank
Mar 1, 2025, 7:08 PM
16
points
28
comments
1
min read
LW
link
Historiographical Compressions: Renaissance as An Example
adamShimi
Mar 1, 2025, 6:21 PM
17
points
4
comments
7
min read
LW
link
(formethods.substack.com)
Real-Time Gigstats
jefftk
Mar 1, 2025, 2:10 PM
9
points
0
comments
1
min read
LW
link
(www.jefftk.com)
Open problems in emergent misalignment
Jan Betley
and
Daniel Tan
Mar 1, 2025, 9:47 AM
82
points
13
comments
7
min read
LW
link
Estimating the Probability of Sampling a Trained Neural Network at Random
Adam Scherlis
and
Nora Belrose
Mar 1, 2025, 2:11 AM
32
points
10
comments
1
min read
LW
link
(arxiv.org)
[Question]
What nation did Trump prevent from going to war (Feb. 2025)?
James Camacho
Mar 1, 2025, 1:46 AM
3
points
3
comments
1
min read
LW
link
AXRP Episode 38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future
DanielFilan
Mar 1, 2025, 1:20 AM
13
points
0
comments
13
min read
LW
link
TamperSec is hiring for 3 Key Roles!
Jonathan_H
Feb 28, 2025, 11:10 PM
15
points
0
comments
4
min read
LW
link
Do we want alignment faking?
Florian_Dietz
Feb 28, 2025, 9:50 PM
7
points
4
comments
1
min read
LW
link
Few concepts mixing dark fantasy and science fiction
Marek Zegarek
Feb 28, 2025, 9:03 PM
0
points
0
comments
3
min read
LW
link
Latent Space Collapse? Understanding the Effects of Narrow Fine-Tuning on LLMs
tenseisoham
Feb 28, 2025, 8:22 PM
3
points
0
comments
9
min read
LW
link
How to Contribute to Theoretical Reward Learning Research
Joar Skalse
Feb 28, 2025, 7:27 PM
16
points
0
comments
21
min read
LW
link
Other Papers About the Theory of Reward Learning
Joar Skalse
Feb 28, 2025, 7:26 PM
16
points
0
comments
5
min read
LW
link
Defining and Characterising Reward Hacking
Joar Skalse
Feb 28, 2025, 7:25 PM
15
points
0
comments
4
min read
LW
link
Misspecification in Inverse Reinforcement Learning—Part II
Joar Skalse
Feb 28, 2025, 7:24 PM
9
points
0
comments
7
min read
LW
link
STARC: A General Framework For Quantifying Differences Between Reward Functions
Joar Skalse
Feb 28, 2025, 7:24 PM
11
points
0
comments
8
min read
LW
link
Misspecification in Inverse Reinforcement Learning
Joar Skalse
Feb 28, 2025, 7:24 PM
19
points
0
comments
11
min read
LW
link
Partial Identifiability in Reward Learning
Joar Skalse
28 Feb 2025 19:23 UTC
16
points
0
comments
12
min read
LW
link
The Theoretical Reward Learning Research Agenda: Introduction and Motivation
Joar Skalse
28 Feb 2025 19:20 UTC
26
points
4
comments
14
min read
LW
link
An Open Letter To EA and AI Safety On Decelerating AI Development
kenneth_diao
28 Feb 2025 17:21 UTC
8
points
0
comments
14
min read
LW
link
(graspingatwaves.substack.com)
Dance Weekend Pay II
jefftk
28 Feb 2025 15:10 UTC
11
points
0
comments
1
min read
LW
link
(www.jefftk.com)
Existentialists and Trolleys
David Gross
28 Feb 2025 14:01 UTC
5
points
3
comments
7
min read
LW
link
On Emergent Misalignment
Zvi
28 Feb 2025 13:10 UTC
88
points
5
comments
22
min read
LW
link
(thezvi.wordpress.com)
Do safety-relevant LLM steering vectors optimized on a single example generalize?
Jacob Dunefsky
28 Feb 2025 12:01 UTC
20
points
1
comment
14
min read
LW
link
(arxiv.org)
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel