Archive
Sequences
About
Search
Log In
Home
Featured
All
Tags
Recent
Comments
Questions
Events
Shortform
Alignment Forum
AF Comments
RSS
New
Hot
Active
Old
Page
1
In (highly contingent!) defense of interpretability-in-the-loop ML training
Steven Byrnes
6 Feb 2026 16:32 UTC
68
points
3
comments
3
min read
LW
link
Increasing AI Strategic Competence as a Safety Approach
Wei Dai
3 Feb 2026 1:08 UTC
45
points
8
comments
1
min read
LW
link
“Features” aren’t always the true computational primitives of a model, but that might be fine anyways
LawrenceC
2 Feb 2026 18:41 UTC
18
points
0
comments
5
min read
LW
link
Are there lessons from high-reliability engineering for AGI safety?
Steven Byrnes
2 Feb 2026 15:26 UTC
90
points
6
comments
8
min read
LW
link
Fitness-Seekers: Generalizing the Reward-Seeking Threat Model
Alex Mallen
29 Jan 2026 19:42 UTC
83
points
4
comments
17
min read
LW
link
AlgZoo: uninterpreted models with fewer than 1,500 parameters
Jacob_Hilton
26 Jan 2026 17:30 UTC
176
points
7
comments
10
min read
LW
link
(www.alignment.org)
New version of “Intro to Brain-Like-AGI Safety”
Steven Byrnes
23 Jan 2026 16:21 UTC
58
points
1
comment
19
min read
LW
link
When should we train against a scheming monitor?
Mary Phuong
21 Jan 2026 20:48 UTC
22
points
4
comments
5
min read
LW
link
No instrumental convergence without AI psychology
TurnTrout
20 Jan 2026 22:16 UTC
68
points
7
comments
6
min read
LW
link
(turntrout.com)
Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training
RogerDearnaley
19 Jan 2026 21:24 UTC
102
points
12
comments
11
min read
LW
link
(arxiv.org)
Could LLM alignment research reduce x-risk if the first takeover-capable AI is not an LLM?
Tim Hua
19 Jan 2026 18:09 UTC
24
points
1
comment
6
min read
LW
link
Desiderata of good problems to hand off to AIs
Jozdien
19 Jan 2026 16:55 UTC
29
points
1
comment
4
min read
LW
link
Gradual Paths to Collective Flourishing
Nora_Ammann
19 Jan 2026 7:52 UTC
37
points
10
comments
13
min read
LW
link
Test your interpretability techniques by de-censoring Chinese models
Khoi Tran
,
aryaj
,
Senthooran Rajamanoharan
and
Neel Nanda
15 Jan 2026 16:33 UTC
85
points
10
comments
20
min read
LW
link
Apply to Vanessa’s mentorship at PIBBSS
Vanessa Kosoy
14 Jan 2026 9:15 UTC
39
points
0
comments
2
min read
LW
link
Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought
Riya Tyagi
,
daria
,
Arthur Conmy
and
Neel Nanda
13 Jan 2026 20:40 UTC
48
points
0
comments
18
min read
LW
link
Brief Explorations in LLM Value Rankings
Tim Hua
,
Josh Engels
,
Neel Nanda
and
Senthooran Rajamanoharan
12 Jan 2026 18:16 UTC
37
points
1
comment
11
min read
LW
link
Practical challenges of control monitoring in frontier AI deployments
David Lindner
and
charlie_griffin
12 Jan 2026 16:45 UTC
19
points
0
comments
1
min read
LW
link
(arxiv.org)
My 2003 Post on the Evolutionary Argument for AI Misalignment
Wei Dai
6 Jan 2026 20:45 UTC
37
points
7
comments
2
min read
LW
link
How hard is it to inoculate against misalignment generalization?
Jozdien
6 Jan 2026 17:30 UTC
42
points
4
comments
14
min read
LW
link
Back to top
Next