Maybe Simple probes can catch sleeper agents \ Anthropic could also be interesting, in the sense that the probe seems to generalize surprisingly well (I would really like to know if this generalizes to a model that was not trained to be harmful in the first place).
“Debugging mysterious behaviour”—Might be interesting, might help marginally to get better understanding, but this is not very central for me.
Thanks a lot for writing this, this is an important consideration, and it would be sweet if Anthropic updated accordingly.
Some remarks:
I’m still not convinced that Deceptive AI following scheming is the main risk compared to other risks (gradual disempowerment, concentration of power & value Lock in, a nice list of other risks from John).
“Should we give up on interpretability? No!”—I think this is at least a case for reducing the focus a bit, and more diversification of approaches
On the theories of impacts suggested:
“A Layer of Swiss Cheese”—why not! This can make sense in DeepMind’s plan, that was really good by the way.
“Enhancing Black-Box Evaluations”—I think a better theory is interp to complement AI Control techniques. Example: Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals).
Maybe Simple probes can catch sleeper agents \ Anthropic could also be interesting, in the sense that the probe seems to generalize surprisingly well (I would really like to know if this generalizes to a model that was not trained to be harmful in the first place).
“Debugging mysterious behaviour”—Might be interesting, might help marginally to get better understanding, but this is not very central for me.