Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Bruce W. Lee
Karma:
521
All
Posts
Comments
New
Top
Old
Preliminary Explorations on Latent Side Task Uplift
Bruce W. Lee
2 Apr 2026 2:23 UTC
13
points
0
comments
4
min read
LW
link
Reasoning Models Struggle to Control Their Chains of Thought
Yueh Han "John" Chen
,
robert mccarthy
,
Bruce W. Lee
and
Tomek Korbak
5 Mar 2026 22:37 UTC
76
points
9
comments
3
min read
LW
link
Salient Directions in AI Control
Bruce W. Lee
5 Mar 2026 19:38 UTC
13
points
0
comments
14
min read
LW
link
(brucewlee.com)
Training Agents to Self-Report Misbehavior
Bruce W. Lee
,
Yueh Han "John" Chen
and
Tomek Korbak
25 Feb 2026 17:50 UTC
26
points
0
comments
8
min read
LW
link
Bruce W. Lee’s Shortform
Bruce W. Lee
19 Feb 2026 1:53 UTC
4
points
2
comments
1
min read
LW
link
Bitter Lessons from Distillation Robustifies Unlearning
Bruce W. Lee
28 Nov 2025 1:31 UTC
27
points
3
comments
7
min read
LW
link
(www.lesswrong.com)
Distillation Robustifies Unlearning
Bruce W. Lee
,
Addie Foote
,
alexinf
,
leni
,
Jacob G-W
,
Harish Kamath
,
Bryce Woodworth
,
cloud
and
TurnTrout
13 Jun 2025 13:45 UTC
239
points
43
comments
8
min read
LW
link
(arxiv.org)
Programming Refusal with Conditional Activation Steering
Bruce W. Lee
11 Sep 2024 20:57 UTC
41
points
0
comments
11
min read
LW
link
(brucewlee.com)
Language Models Don’t Learn the Physical Manifestation of Language
Bruce W. Lee
and
Jaehyuk Lim
22 Feb 2024 18:52 UTC
39
points
23
comments
1
min read
LW
link
(arxiv.org)
Back to top