lukemarks

Karma: 722

[Paper] Output Supervision Can Obfuscate the CoT

jacob_drori, lukemarks, cloud and TurnTrout

20 Nov 2025 22:41 UTC

78 points

3 comments5 min readLW link

(arxiv.org)

[Research Note] Optimizing The Final Output Can Obfuscate CoT

lukemarks, jacob_drori, cloud and TurnTrout

30 Jul 2025 21:26 UTC

201 points

23 comments6 min readLW link

Steering Language Models in Multiple Directions Simultaneously

lukemarks, Narmeen and Amirali Abdullah

2 May 2025 15:27 UTC

18 points

0 comments7 min readLW link

lukemarks’s Shortform

lukemarks2 Jul 2024 6:56 UTC

4 points

19 comments1 min readLW link

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

lukemarks, Amirali Abdullah, Rauno Arike, fbarez and nothoughtsheadempty

3 Oct 2023 7:45 UTC

18 points

0 comments5 min readLW link

[Question] What Does LessWrong Think of Human Intelligence Augmentation in 2023?

lukemarks8 Jul 2023 11:42 UTC

84 points

28 comments2 min readLW link

Direct Preference Optimization in One Minute

lukemarks26 Jun 2023 11:52 UTC

22 points

3 comments2 min readLW link