Thanks! Some thoughts here:
The first is how to train oversight AIs when the oversight tasks are no longer easily verifiable—for example, sophisticated reward hacks that can fool expert coders, or hard-to-verify sandbagging behavior on safety-related research. You mentioned that this would get covered in the next post, so I’m looking forward to that.
I think the thing that helps you out here is compositionality—all of these properties hopefully reduce to simpler concepts that are themselves verifiable, so hopefully e.g. a smart enough interp assistant could understand all the individual concepts as well as how they compose together and use this to understand more complex latent reasoning that isn’t directly verifiable.
The second is how robust these oversight mechanisms are to optimization. It seems like a bad idea to train directly against unwanted concepts in predictive concept decoders, but maybe training directly against investigator agents for unwanted behaviors is fine? Using Docent to surface issues in RL environments (and then fixing those issues) also seems good? In some cases, if we have good interpretability, we can actually understand where the undesirable behaviors come from (i.e., data attribution) and address the source of the problem. That’s great!
But I don’t think this would always be the case even if we have way better versions of existing tools. Of course, we could just use these tools mostly for auditing (as described in “Putting up bumpers”), but that limits their utility by a lot. Clearer thinking about how much we can train against various discriminators seems pretty crucial to all of this.
I agree with most of this. I’d just add that it’s not totally obvious to me that RLHF is the way we should be doing alignment out into the future—it’s kind of like electroshock therapy for LLMs which feels kind of pathological from a psychological standpoint. I’d guess that there are more psychologically friendly ways to train LMs—and understanding the relationship between training data and behaviors feels like a good way to study this!
Related to the second point: this post seems to assume a non-adversarial dynamic between the supervisors and the supervised (e.g., the AIs are not scheming). Good oversight can be one of our best tools to prevent scheming/adversarial dynamics from arising in the first place. Many of the tools developed can also be used to detect scheming (e.g., I wrote about using investigator agents to create realistic honeypots here). Still, more thought on how these problems play out in adversarial scenarios would probably be fruitful.
Do you mean that the oversight system is scheming or the subject model? For the subject model the hope is that a sufficiently powerful overseer can catch that. If you’re worried about the overseer, one reason for optimism is that the oversight models can have significantly smaller parameter count than frontier systems, so are less likely to have weird emergent properties.
Is the worry that if the overseer is used at training time, the model will be eval aware and learn to behave differently when overseen?