I’m a research lead at Anthropic doing safety research on language models. Some of my past work includes introducing automated red teaming of language models [1], showing the usefulness of AI safety via debate [2], demonstrating that chain-of-thought can be unfaithful [3], discovering sycophancy in language models [4], initiating the model organisms of misalignment agenda [5][6], and developing constitutional classifiers and showing they can be used to obtain very high levels of adversarial robustness to jailbreaks [7].
Website: https://ethanperez.net/