cloud

Karma: 1,526

Apply for Alignment Mentorship from TurnTrout and Alex Cloud

TurnTrout and cloud

26 Dec 2025 17:20 UTC

42 points

0 comments2 min readLW link

(turntrout.com)

[Paper] Output Supervision Can Obfuscate the CoT

jacob_drori, lukemarks, cloud and TurnTrout

20 Nov 2025 22:41 UTC

92 points

3 comments5 min readLW link

(arxiv.org)

Omniscaling to MNIST

cloud8 Nov 2025 19:42 UTC

104 points

3 comments10 min readLW link

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

ariana_azarbal, Victor Gillioz, TurnTrout and cloud

14 Oct 2025 0:53 UTC

144 points

15 comments10 min readLW link

cloud’s Shortform

cloud17 Sep 2025 7:41 UTC

6 points

1 comment1 min readLW link

Narrow finetuning is different

cloud and Stewy Slocum

5 Aug 2025 14:29 UTC

70 points

3 comments4 min readLW link

[Research Note] Optimizing The Final Output Can Obfuscate CoT

lukemarks, jacob_drori, cloud and TurnTrout

30 Jul 2025 21:26 UTC

202 points

23 comments6 min readLW link

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

cloud, mle and Owain_Evans

22 Jul 2025 16:37 UTC

348 points

40 comments4 min readLW link

Selective Generalization: Improving Capabilities While Maintaining Alignment

ariana_azarbal, Matthew A. Clarke, Jorio Cocola, Cailley Factor and cloud

16 Jul 2025 21:25 UTC

82 points

6 comments7 min readLW link

Distillation Robustifies Unlearning

Bruce W. Lee, Addie Foote, alexinf, leni, Jacob G-W, Harish Kamath, Bryce Woodworth, cloud and TurnTrout

13 Jun 2025 13:45 UTC

239 points

43 comments8 min readLW link

(arxiv.org)

Selective modularity: a research agenda

cloud and Jacob G-W

24 Mar 2025 4:12 UTC

72 points

3 comments24 min readLW link

[Question] Is weak-to-strong generalization an alignment technique?

cloud31 Jan 2025 7:13 UTC

22 points

1 comment2 min readLW link

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

cloud, Jacob G-W, Evzen, Joseph Miller and TurnTrout

6 Dec 2024 22:19 UTC

180 points

16 comments11 min readLW link 1 review

(arxiv.org)