RSS

cloud

Karma: 1,460

Ap­ply for Align­ment Men­tor­ship from TurnTrout and Alex Cloud

26 Dec 2025 17:20 UTC
40 points
0 comments2 min readLW link
(turntrout.com)

[Paper] Out­put Su­per­vi­sion Can Obfus­cate the CoT

20 Nov 2025 22:41 UTC
75 points
3 comments5 min readLW link
(arxiv.org)

Om­nis­cal­ing to MNIST

cloud8 Nov 2025 19:42 UTC
90 points
3 comments10 min readLW link

Re­con­tex­tu­al­iza­tion Miti­gates Speci­fi­ca­tion Gam­ing Without Mod­ify­ing the Specification

14 Oct 2025 0:53 UTC
141 points
15 comments11 min readLW link

cloud’s Shortform

cloud17 Sep 2025 7:41 UTC
6 points
1 comment1 min readLW link

Nar­row fine­tun­ing is different

5 Aug 2025 14:29 UTC
70 points
3 comments4 min readLW link

[Re­search Note] Op­ti­miz­ing The Fi­nal Out­put Can Obfus­cate CoT

30 Jul 2025 21:26 UTC
200 points
23 comments6 min readLW link

Sublimi­nal Learn­ing: LLMs Trans­mit Be­hav­ioral Traits via Hid­den Sig­nals in Data

22 Jul 2025 16:37 UTC
345 points
40 comments4 min readLW link

Selec­tive Gen­er­al­iza­tion: Im­prov­ing Ca­pa­bil­ities While Main­tain­ing Alignment

16 Jul 2025 21:25 UTC
75 points
6 comments7 min readLW link

Distil­la­tion Ro­bus­tifies Unlearning

13 Jun 2025 13:45 UTC
236 points
43 comments8 min readLW link
(arxiv.org)

Selec­tive mod­u­lar­ity: a re­search agenda

24 Mar 2025 4:12 UTC
71 points
2 comments24 min readLW link

[Question] Is weak-to-strong gen­er­al­iza­tion an al­ign­ment tech­nique?

cloud31 Jan 2025 7:13 UTC
22 points
1 comment2 min readLW link

Gra­di­ent Rout­ing: Mask­ing Gra­di­ents to Lo­cal­ize Com­pu­ta­tion in Neu­ral Networks

6 Dec 2024 22:19 UTC
172 points
15 comments11 min readLW link1 review
(arxiv.org)