Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Sam Marks
Karma:
3,185
All
Posts
Comments
New
Top
Old
Page
1
Modifying LLM Beliefs with Synthetic Document Finetuning
RowanWang
,
Johannes Treutlein
,
Avery
,
Ethan Perez
,
Fabien Roger
and
Sam Marks
Apr 24, 2025, 9:15 PM
69
points
12
comments
2
min read
LW
link
(alignment.anthropic.com)
Downstream applications as validation of interpretability progress
Sam Marks
Mar 31, 2025, 1:35 AM
112
points
3
comments
7
min read
LW
link
Auditing language models for hidden objectives
Sam Marks
,
Johannes Treutlein
,
dmz
,
Sam Bowman
,
Hoagy
,
Carson Denison
,
Kei
,
7vik
,
Akbir Khan
,
Austin Meek
,
Euan Ong
,
Christopher Olah
,
Fabien Roger
,
jeanne_
,
Meg
,
Drake Thomas
,
Adam Jermyn
,
Monte M
and
evhub
Mar 13, 2025, 7:18 PM
141
points
15
comments
13
min read
LW
link
Recommendations for Technical AI Safety Research Directions
Sam Marks
Jan 10, 2025, 7:34 PM
64
points
1
comment
17
min read
LW
link
(alignment.anthropic.com)
Alignment Faking in Large Language Models
ryan_greenblatt
,
evhub
,
Carson Denison
,
Benjamin Wright
,
Fabien Roger
,
Monte M
,
Sam Marks
,
Johannes Treutlein
,
Sam Bowman
and
Buck
Dec 18, 2024, 5:19 PM
483
points
75
comments
10
min read
LW
link
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
Can
,
Adam Karvonen
,
Johnny Lin
,
Curt Tigges
,
Joseph Bloom
,
chanind
,
Yeu-Tong Lau
,
Eoin Farrell
,
Arthur Conmy
,
CallumMcDougall
,
Kola Ayonrinde
,
Matthew Wearden
,
Sam Marks
and
Neel Nanda
Dec 11, 2024, 6:30 AM
82
points
6
comments
2
min read
LW
link
(www.neuronpedia.org)
Evaluating Sparse Autoencoders with Board Game Models
Adam Karvonen
,
Sam Marks
,
Can
,
Benjamin Wright
,
Jannik Brinkmann
,
Logan Riggs
and
Rico Angell
Aug 2, 2024, 7:50 PM
38
points
1
comment
9
min read
LW
link
Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight
Sam Marks
Apr 18, 2024, 4:17 PM
113
points
10
comments
12
min read
LW
link
What’s up with LLMs representing XORs of arbitrary features?
Sam Marks
Jan 3, 2024, 7:44 PM
158
points
63
comments
16
min read
LW
link
Some open-source dictionaries and dictionary learning infrastructure
Sam Marks
Dec 5, 2023, 6:05 AM
46
points
7
comments
5
min read
LW
link
Thoughts on open source AI
Sam Marks
Nov 3, 2023, 3:35 PM
62
points
17
comments
10
min read
LW
link
Turning off lights with model editing
Sam Marks
12 May 2023 20:25 UTC
68
points
5
comments
2
min read
LW
link
(arxiv.org)
[Crosspost] ACX 2022 Prediction Contest Results
Scott Alexander
,
Eric Neyman
and
Sam Marks
24 Jan 2023 6:56 UTC
48
points
6
comments
8
min read
LW
link
AGISF adaptation for in-person groups
Sam Marks
,
Xander Davies
and
Richard_Ngo
13 Jan 2023 3:24 UTC
44
points
2
comments
3
min read
LW
link
Update on Harvard AI Safety Team and MIT AI Alignment
Xander Davies
,
Sam Marks
,
kaivu
,
tlevin
,
eleni
,
maxnadeau
and
Naomi Bashkansky
2 Dec 2022 0:56 UTC
60
points
4
comments
8
min read
LW
link
Recommend HAIST resources for assessing the value of RLHF-related alignment research
Sam Marks
and
Xander Davies
5 Nov 2022 20:58 UTC
26
points
9
comments
3
min read
LW
link
Caution when interpreting Deepmind’s In-context RL paper
Sam Marks
1 Nov 2022 2:42 UTC
105
points
8
comments
4
min read
LW
link
Safety considerations for online generative modeling
Sam Marks
7 Jul 2022 18:31 UTC
42
points
9
comments
14
min read
LW
link
Proxy misspecification and the capabilities vs. value learning race
Sam Marks
16 May 2022 18:58 UTC
23
points
3
comments
4
min read
LW
link
If you’re very optimistic about ELK then you should be optimistic about outer alignment
Sam Marks
27 Apr 2022 19:30 UTC
17
points
8
comments
3
min read
LW
link
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel