I’m a research scientist at Anthropic doing empirical AI safety research on language models. In the past, I’ve worked on automated red teaming of language models [1], the inverse scaling prize [2], learning from human feedback [3][4], and empirically testing debate [5][6], iterated amplification [7], and other methods [8] for scalably supervising AI systems as they become more capable than us.

In­verse Scal­ing Prize: Se­cond Round Winners

24 Jan 2023 20:12 UTC
Dis­cov­er­ing Lan­guage Model Be­hav­iors with Model-Writ­ten Evaluations

20 Dec 2022 20:08 UTC
In­verse Scal­ing Prize: Round 1 Winners

26 Sep 2022 19:57 UTC
