Bruce W. Lee

Karma: 521

Preliminary Explorations on Latent Side Task Uplift

Bruce W. Lee2 Apr 2026 2:23 UTC

13 points

0 comments4 min readLW link

Reasoning Models Struggle to Control Their Chains of Thought

Yueh Han "John" Chen, robert mccarthy, Bruce W. Lee and Tomek Korbak

5 Mar 2026 22:37 UTC

76 points

9 comments3 min readLW link

Salient Directions in AI Control

Bruce W. Lee5 Mar 2026 19:38 UTC

13 points

0 comments14 min readLW link

(brucewlee.com)

Training Agents to Self-Report Misbehavior

Bruce W. Lee, Yueh Han "John" Chen and Tomek Korbak

25 Feb 2026 17:50 UTC

26 points

0 comments8 min readLW link

Bruce W. Lee’s Shortform

Bruce W. Lee19 Feb 2026 1:53 UTC

4 points

2 comments1 min readLW link

Bitter Lessons from Distillation Robustifies Unlearning

Bruce W. Lee28 Nov 2025 1:31 UTC

27 points

3 comments7 min readLW link

(www.lesswrong.com)

Distillation Robustifies Unlearning

Bruce W. Lee, Addie Foote, alexinf, leni, Jacob G-W, Harish Kamath, Bryce Woodworth, cloud and TurnTrout

13 Jun 2025 13:45 UTC

239 points

43 comments8 min readLW link

(arxiv.org)

Programming Refusal with Conditional Activation Steering

Bruce W. Lee11 Sep 2024 20:57 UTC

41 points

0 comments11 min readLW link

(brucewlee.com)

Language Models Don’t Learn the Physical Manifestation of Language

Bruce W. Lee and Jaehyuk Lim

22 Feb 2024 18:52 UTC

39 points

23 comments1 min readLW link

(arxiv.org)