I’m a research lead at Anthropic doing safety research on language models. Some of my past work includes introducing automated red teaming of language models [1], showing the usefulness of AI safety via debate [2], demonstrating that chain-of-thought can be unfaithful [3], discovering sycophancy in language models [4], initiating the model organisms of misalignment agenda [5][6], and developing constitutional classifiers and showing they can be used to obtain very high levels of adversarial robustness to jailbreaks [7].
Website: https://ethanperez.net/
Thanks for the feedback—We agree with some of these points, and we’re working on an update to the post/paper. Reading this post and the comments, one place where I think we’ve caused a lot of confusion is in referring to the phenomenon we’re studying as “coherence” rather than something more specific (I’d more precisely call it something like “cross-sample error-consistency”). I think there are other relevant notions of coherence which we didn’t study here (which are relevant to alignment, as others are pointing out), e.g. “how few errors models make” and also “in-context error consistency” (whether the errors models make within a single transcript are highly correlated). I think it’s confusing that we used the term “coherence” because it has these other connotations as well, and we’re planning to revise the blog post and paper to (among other things) make it clearer we’re just studying this one aspect of coherence.
I also agree that some of the writing overstates some of the implications to alignment (especially for superintelligence), and I think we would like to update especially the blog post as well regarding this.