RowanWang

Karma: 375

https://rowankwang.com/

Introspection Adapters: Training LLMs to Report Their Learned Behaviors

keshavs, RowanWang, abhayesian, Sam Marks and SoerenMind

28 Apr 2026 19:02 UTC

41 points

1 comment12 min readLW link

(alignment.anthropic.com)

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

RowanWang, Sam Marks, Johannes Treutlein, evhub and Fabien Roger

25 Nov 2025 19:33 UTC

41 points

0 comments4 min readLW link

(alignment.anthropic.com)

Building and evaluating alignment auditing agents

Sam Marks, trentbrick, RowanWang, Sam Bowman, Euan Ong, Johannes Treutlein and evhub

24 Jul 2025 19:22 UTC

47 points

1 comment5 min readLW link

Modifying LLM Beliefs with Synthetic Document Finetuning

RowanWang, Johannes Treutlein, Avery, Ethan Perez, Fabien Roger and Sam Marks

24 Apr 2025 21:15 UTC

77 points

12 comments2 min readLW link

(alignment.anthropic.com)

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

RowanWang, Alexandre Variengien, Arthur Conmy, Buck and jsteinhardt

28 Oct 2022 23:55 UTC

101 points

9 comments9 min readLW link 2 reviews

(arxiv.org)

Gears-Level Mental Models of Transformer Interpretability

RowanWang29 Mar 2022 20:09 UTC

76 points

4 comments6 min readLW link

Lessons After a Couple Months of Trying to Do ML Research

RowanWang22 Mar 2022 23:45 UTC

71 points

8 comments6 min readLW link