Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
lukemarks
Karma:
717
All
Posts
Comments
New
Top
Old
[Paper] Output Supervision Can Obfuscate the CoT
jacob_drori
,
lukemarks
,
cloud
and
TurnTrout
20 Nov 2025 22:41 UTC
75
points
3
comments
5
min read
LW
link
(arxiv.org)
[Research Note] Optimizing The Final Output Can Obfuscate CoT
lukemarks
,
jacob_drori
,
cloud
and
TurnTrout
30 Jul 2025 21:26 UTC
200
points
23
comments
6
min read
LW
link
Steering Language Models in Multiple Directions Simultaneously
lukemarks
,
Narmeen
and
Amirali Abdullah
2 May 2025 15:27 UTC
18
points
0
comments
7
min read
LW
link
lukemarks’s Shortform
lukemarks
2 Jul 2024 6:56 UTC
4
points
19
comments
1
min read
LW
link
Early Experiments in Reward Model Interpretation Using Sparse Autoencoders
lukemarks
,
Amirali Abdullah
,
Rauno Arike
,
fbarez
and
nothoughtsheadempty
3 Oct 2023 7:45 UTC
18
points
0
comments
5
min read
LW
link
[Question]
What Does LessWrong Think of Human Intelligence Augmentation in 2023?
lukemarks
8 Jul 2023 11:42 UTC
84
points
28
comments
2
min read
LW
link
Direct Preference Optimization in One Minute
lukemarks
26 Jun 2023 11:52 UTC
22
points
3
comments
2
min read
LW
link
Back to top