RSS

cloud

Karma: 1,382

[Paper] Out­put Su­per­vi­sion Can Obfus­cate the CoT

20 Nov 2025 22:41 UTC
75 points
2 comments5 min readLW link
(arxiv.org)

Om­nis­cal­ing to MNIST

cloud8 Nov 2025 19:42 UTC
86 points
3 comments10 min readLW link

Re­con­tex­tu­al­iza­tion Miti­gates Speci­fi­ca­tion Gam­ing Without Mod­ify­ing the Specification

14 Oct 2025 0:53 UTC
122 points
13 comments9 min readLW link

cloud’s Shortform

cloud17 Sep 2025 7:41 UTC
6 points
1 comment1 min readLW link

Nar­row fine­tun­ing is different

5 Aug 2025 14:29 UTC
67 points
3 comments4 min readLW link

[Re­search Note] Op­ti­miz­ing The Fi­nal Out­put Can Obfus­cate CoT

30 Jul 2025 21:26 UTC
198 points
23 comments6 min readLW link

Sublimi­nal Learn­ing: LLMs Trans­mit Be­hav­ioral Traits via Hid­den Sig­nals in Data

22 Jul 2025 16:37 UTC
342 points
39 comments4 min readLW link

Selec­tive Gen­er­al­iza­tion: Im­prov­ing Ca­pa­bil­ities While Main­tain­ing Alignment

16 Jul 2025 21:25 UTC
67 points
4 comments7 min readLW link

Distil­la­tion Ro­bus­tifies Unlearning

13 Jun 2025 13:45 UTC
234 points
43 comments8 min readLW link
(arxiv.org)