Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
RowanWang
Karma:
316
https://rowankwang.com/
All
Posts
Comments
New
Top
Old
Building and evaluating alignment auditing agents
Sam Marks
,
trentbrick
,
RowanWang
,
Sam Bowman
,
Euan Ong
,
Johannes Treutlein
and
evhub
24 Jul 2025 19:22 UTC
47
points
1
comment
5
min read
LW
link
Modifying LLM Beliefs with Synthetic Document Finetuning
RowanWang
,
Johannes Treutlein
,
Avery
,
Ethan Perez
,
Fabien Roger
and
Sam Marks
24 Apr 2025 21:15 UTC
70
points
12
comments
2
min read
LW
link
(alignment.anthropic.com)
Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small
RowanWang
,
Alexandre Variengien
,
Arthur Conmy
,
Buck
and
jsteinhardt
28 Oct 2022 23:55 UTC
101
points
9
comments
9
min read
LW
link
2
reviews
(arxiv.org)
Gears-Level Mental Models of Transformer Interpretability
RowanWang
29 Mar 2022 20:09 UTC
75
points
4
comments
6
min read
LW
link
Lessons After a Couple Months of Trying to Do ML Research
RowanWang
22 Mar 2022 23:45 UTC
71
points
8
comments
6
min read
LW
link
Back to top