RowanWang

Karma: 378

28 Apr 2026 19:02 UTC

41 points

2 comments12 min readLW link

(alignment.anthropic.com)

RowanWang 17 Feb 2026 1:53 UTC
2 points
0
in reply to: Kei Nishimura-Gasparian’s comment on: Sam Marks’s Shortform
Oh man, it totally was wrong, sorry about that, updated data again. I looked at the train datasets from the various models we trained and reran the data generation pipeline and the results looked as expected, so I don’t think I trained models on the wrong data for the original results, but I’m not fully sure how this data mix came about. It looks like it’s a combination of the followup and goals data, i think claude might have accidentally mixed them when i was having it sanitize it for release

also fwiw depending on what you’re using this data for, you should probably just regenerate it, it’s not that hard and you could probably easily generate more diverse data. it probably also helps if the prompts actually elicit deception on the model you’re working with

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

RowanWang, Sam Marks, Johannes Treutlein, evhub and Fabien Roger

25 Nov 2025 19:33 UTC

41 points

0 comments4 min readLW link

(alignment.anthropic.com)

Building and evaluating alignment auditing agents

Sam Marks, trentbrick, RowanWang, Sam Bowman, Euan Ong, Johannes Treutlein and evhub

24 Jul 2025 19:22 UTC

47 points

1 comment5 min readLW link

Modifying LLM Beliefs with Synthetic Document Finetuning

RowanWang, Johannes Treutlein, Avery, Ethan Perez, Fabien Roger and Sam Marks

24 Apr 2025 21:15 UTC

77 points

12 comments2 min readLW link

(alignment.anthropic.com)

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

RowanWang, Alexandre Variengien, Arthur Conmy, Buck and jsteinhardt

28 Oct 2022 23:55 UTC

102 points

9 comments9 min readLW link 2 reviews

(arxiv.org)

Gears-Level Mental Models of Transformer Interpretability

RowanWang29 Mar 2022 20:09 UTC

77 points

4 comments6 min readLW link

RowanWang 27 Mar 2022 21:43 UTC
2 points
0
in reply to: eeegnu’s comment on: Lessons After a Couple Months of Trying to Do ML Research
Hm, I think this tool would’ve been really helpful for me in the past for a couple of occasions. Usually if I want to save a cell output, I just won’t edit that cell and I’ll create a new one, even if it means redundant code.

Also +1 on keeping track of bugs! I should’ve added to the og post that one thing I do that’s really helpful for me is keeping track of procedural knowledge (i.e. how to setup a GPU, how to fix common issue X, etc.) in a personal Slack that I’ve created as a second brain basically. I found that I used the message-yourself-in-slack feature a lot to keep track of small notes for myself, and since I did it so much, I created a whole private, personal Slack and that’s been pretty useful in keeping track of bugs, etc.

RowanWang 27 Mar 2022 21:36 UTC
1 point
0
in reply to: Joe Kwon’s comment on: Lessons After a Couple Months of Trying to Do ML Research
Let’s definitely catch up!

RowanWang 27 Mar 2022 21:35 UTC
1 point
0
in reply to: ViktoriaMalyasova’s comment on: Lessons After a Couple Months of Trying to Do ML Research
Yes! Especially if you show you have can provide relevant thoughts about their work, a lot of people will be happy to call or at least reply to some questions via email

Lessons After a Couple Months of Trying to Do ML Research

RowanWang22 Mar 2022 23:45 UTC

72 points

8 comments6 min readLW link

RowanWang

In­tro­spec­tion Adapters: Train­ing LLMs to Re­port Their Learned Behaviors

Eval­u­at­ing hon­esty and lie de­tec­tion tech­niques on a di­verse suite of dishon­est models

Build­ing and eval­u­at­ing al­ign­ment au­dit­ing agents

Mod­ify­ing LLM Beliefs with Syn­thetic Doc­u­ment Finetuning

Some Les­sons Learned from Study­ing Indi­rect Ob­ject Iden­ti­fi­ca­tion in GPT-2 small

Gears-Level Men­tal Models of Trans­former Interpretability

Les­sons After a Cou­ple Months of Try­ing to Do ML Research

Introspection Adapters: Training LLMs to Report Their Learned Behaviors

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

Building and evaluating alignment auditing agents

Modifying LLM Beliefs with Synthetic Document Finetuning

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

Gears-Level Mental Models of Transformer Interpretability

Lessons After a Couple Months of Trying to Do ML Research