CEO at Redwood Research.
AI safety is a highly collaborative field—almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I’m saying this here because it would feel repetitive to say “these ideas were developed in collaboration with various people” in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.
[COI notice: this is a Redwood Research output]
I think this idea, though quite simple and obvious, is very important. I think coup probes are the paradigmatic example of a safety technique that uses model internals access, and they’re an extremely helpful concrete baseline to think about in many cases, e.g. when considering safety cases via mech interp. I refer to this post constantly. We followed up on it in Catching AIs red-handed. (We usually call them “off-policy probes” now.)
Unfortunately, this paper hasn’t been followed up with as much empirical research as I’d hoped; Anthropic’s Simple probes can catch sleep agents explores a different technique that I think is less promising or important than the one in this paper. There are some empirical projects following up on this project now, though.