RSS

lukemarks

Karma: 717

[Paper] Out­put Su­per­vi­sion Can Obfus­cate the CoT

20 Nov 2025 22:41 UTC
75 points
3 comments5 min readLW link
(arxiv.org)

[Re­search Note] Op­ti­miz­ing The Fi­nal Out­put Can Obfus­cate CoT

30 Jul 2025 21:26 UTC
200 points
23 comments6 min readLW link

Steer­ing Lan­guage Models in Mul­ti­ple Direc­tions Simultaneously

2 May 2025 15:27 UTC
18 points
0 comments7 min readLW link

luke­marks’s Shortform

lukemarks2 Jul 2024 6:56 UTC
4 points
19 comments1 min readLW link

Early Ex­per­i­ments in Re­ward Model In­ter­pre­ta­tion Us­ing Sparse Autoencoders

3 Oct 2023 7:45 UTC
18 points
0 comments5 min readLW link

[Question] What Does LessWrong Think of Hu­man In­tel­li­gence Aug­men­ta­tion in 2023?

lukemarks8 Jul 2023 11:42 UTC
84 points
28 comments2 min readLW link

Direct Prefer­ence Op­ti­miza­tion in One Minute

lukemarks26 Jun 2023 11:52 UTC
22 points
3 comments2 min readLW link