Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
Archive
Sequences
About
Search
Log In
All
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
All
Jan
Feb
Mar
Apr
May
Jun
All
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Page
1
[Question]
Request for Comments on AI-related Prediction Market Ideas
PeterMcCluskey
Mar 2, 2025, 8:52 PM
17
points
1
comment
3
min read
LW
link
Statistical Challenges with Making Super IQ babies
Jan Christian Refsgaard
Mar 2, 2025, 8:26 PM
154
points
26
comments
9
min read
LW
link
Cautions about LLMs in Human Cognitive Loops
Alice Blair
Mar 2, 2025, 7:53 PM
39
points
11
comments
7
min read
LW
link
Self-fulfilling misalignment data might be poisoning our AI models
TurnTrout
Mar 2, 2025, 7:51 PM
153
points
28
comments
1
min read
LW
link
(turntrout.com)
Spencer Greenberg hiring a personal/professional/research remote assistant for 5-10 hours per week
spencerg
Mar 2, 2025, 6:01 PM
13
points
0
comments
LW
link
[Question]
Will LLM agents become the first takeover-capable AGIs?
Seth Herd
Mar 2, 2025, 5:15 PM
36
points
10
comments
1
min read
LW
link
Not-yet-falsifiable beliefs?
Benjamin Hendricks
Mar 2, 2025, 2:11 PM
6
points
4
comments
1
min read
LW
link
Saving Zest
jefftk
Mar 2, 2025, 12:00 PM
24
points
1
comment
1
min read
LW
link
(www.jefftk.com)
Open Thread Spring 2025
Ben Pace
Mar 2, 2025, 2:33 AM
19
points
50
comments
1
min read
LW
link
[Question]
help, my self image as rational is affecting my ability to empathize with others
KvmanThinking
Mar 2, 2025, 2:06 AM
1
point
13
comments
1
min read
LW
link
Maintaining Alignment during RSI as a Feedback Control Problem
beren
Mar 2, 2025, 12:21 AM
66
points
6
comments
11
min read
LW
link
AI Safety Policy Won’t Go On Like This – AI Safety Advocacy Is Failing Because Nobody Cares.
henophilia
Mar 1, 2025, 8:15 PM
1
point
1
comment
1
min read
LW
link
(blog.hermesloom.org)
Meaning Machines
appromoximate
Mar 1, 2025, 7:16 PM
0
points
0
comments
13
min read
LW
link
[Question]
Share AI Safety Ideas: Both Crazy and Not
ank
Mar 1, 2025, 7:08 PM
16
points
28
comments
1
min read
LW
link
Historiographical Compressions: Renaissance as An Example
adamShimi
Mar 1, 2025, 6:21 PM
17
points
4
comments
7
min read
LW
link
(formethods.substack.com)
Real-Time Gigstats
jefftk
Mar 1, 2025, 2:10 PM
9
points
0
comments
1
min read
LW
link
(www.jefftk.com)
Open problems in emergent misalignment
Jan Betley
and
Daniel Tan
Mar 1, 2025, 9:47 AM
82
points
13
comments
7
min read
LW
link
Estimating the Probability of Sampling a Trained Neural Network at Random
Adam Scherlis
and
Nora Belrose
Mar 1, 2025, 2:11 AM
32
points
10
comments
1
min read
LW
link
(arxiv.org)
[Question]
What nation did Trump prevent from going to war (Feb. 2025)?
James Camacho
Mar 1, 2025, 1:46 AM
3
points
3
comments
1
min read
LW
link
AXRP Episode 38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future
DanielFilan
Mar 1, 2025, 1:20 AM
13
points
0
comments
13
min read
LW
link
TamperSec is hiring for 3 Key Roles!
Jonathan_H
Feb 28, 2025, 11:10 PM
15
points
0
comments
4
min read
LW
link
Do we want alignment faking?
Florian_Dietz
Feb 28, 2025, 9:50 PM
7
points
4
comments
1
min read
LW
link
Few concepts mixing dark fantasy and science fiction
Marek Zegarek
Feb 28, 2025, 9:03 PM
0
points
0
comments
3
min read
LW
link
Latent Space Collapse? Understanding the Effects of Narrow Fine-Tuning on LLMs
tenseisoham
Feb 28, 2025, 8:22 PM
3
points
0
comments
9
min read
LW
link
How to Contribute to Theoretical Reward Learning Research
Joar Skalse
Feb 28, 2025, 7:27 PM
16
points
0
comments
21
min read
LW
link
Other Papers About the Theory of Reward Learning
Joar Skalse
Feb 28, 2025, 7:26 PM
16
points
0
comments
5
min read
LW
link
Defining and Characterising Reward Hacking
Joar Skalse
Feb 28, 2025, 7:25 PM
15
points
0
comments
4
min read
LW
link
Misspecification in Inverse Reinforcement Learning—Part II
Joar Skalse
Feb 28, 2025, 7:24 PM
9
points
0
comments
7
min read
LW
link
STARC: A General Framework For Quantifying Differences Between Reward Functions
Joar Skalse
Feb 28, 2025, 7:24 PM
11
points
0
comments
8
min read
LW
link
Misspecification in Inverse Reinforcement Learning
Joar Skalse
Feb 28, 2025, 7:24 PM
19
points
0
comments
11
min read
LW
link
Partial Identifiability in Reward Learning
Joar Skalse
Feb 28, 2025, 7:23 PM
16
points
0
comments
12
min read
LW
link
The Theoretical Reward Learning Research Agenda: Introduction and Motivation
Joar Skalse
Feb 28, 2025, 7:20 PM
26
points
4
comments
14
min read
LW
link
An Open Letter To EA and AI Safety On Decelerating AI Development
kenneth_diao
Feb 28, 2025, 5:21 PM
8
points
0
comments
14
min read
LW
link
(graspingatwaves.substack.com)
Dance Weekend Pay II
jefftk
Feb 28, 2025, 3:10 PM
11
points
0
comments
1
min read
LW
link
(www.jefftk.com)
Existentialists and Trolleys
David Gross
Feb 28, 2025, 2:01 PM
5
points
3
comments
7
min read
LW
link
On Emergent Misalignment
Zvi
Feb 28, 2025, 1:10 PM
88
points
5
comments
22
min read
LW
link
(thezvi.wordpress.com)
Do safety-relevant LLM steering vectors optimized on a single example generalize?
Jacob Dunefsky
Feb 28, 2025, 12:01 PM
20
points
1
comment
14
min read
LW
link
(arxiv.org)
Tetherware #2: What every human should know about our most likely AI future
Jáchym Fibír
Feb 28, 2025, 11:12 AM
3
points
0
comments
11
min read
LW
link
(tetherware.substack.com)
Notes on Superwisdom & Moral RSI
welfvh
Feb 28, 2025, 10:34 AM
1
point
4
comments
1
min read
LW
link
Cycles (a short story by Claude 3.7 and me)
Knight Lee
Feb 28, 2025, 7:04 AM
9
points
0
comments
5
min read
LW
link
January-February 2025 Progress in Guaranteed Safe AI
Quinn
Feb 28, 2025, 3:10 AM
15
points
1
comment
8
min read
LW
link
(gsai.substack.com)
Exploring unfaithful/deceptive CoT in reasoning models
Lucy Wingard
Feb 28, 2025, 2:54 AM
4
points
0
comments
6
min read
LW
link
Weirdness Points
lsusr
Feb 28, 2025, 2:23 AM
62
points
19
comments
3
min read
LW
link
Do you need years more therapy, or could one conversation resolve the issue?
Chipmonk
Feb 28, 2025, 12:06 AM
9
points
10
comments
1
min read
LW
link
[New Jersey] HPMOR 10 Year Anniversary Party 🎉
🟠UnlimitedOranges🟠
Feb 27, 2025, 10:30 PM
4
points
0
comments
1
min read
LW
link
OpenAI releases GPT-4.5
Seth Herd
Feb 27, 2025, 9:40 PM
34
points
12
comments
3
min read
LW
link
(openai.com)
The Elicitation Game: Evaluating capability elicitation techniques
Teun van der Weij
,
Felix Hofstätter
,
JaydenTeoh
,
HenningB
and
Francis Rhys Ward
Feb 27, 2025, 8:33 PM
10
points
0
comments
2
min read
LW
link
For the Sake of Pleasure Alone
Greenless Mirror
Feb 27, 2025, 8:07 PM
4
points
14
comments
12
min read
LW
link
Keeping AI Subordinate to Human Thought: A Proposal for Public AI Conversations
syh
Feb 27, 2025, 8:00 PM
−1
points
0
comments
1
min read
LW
link
(medium.com)
How to Corner Liars: A Miasma-Clearing Protocol
ymeskhout
Feb 27, 2025, 5:18 PM
62
points
23
comments
7
min read
LW
link
(www.ymeskhout.com)
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel