I’m pretty excited about Transluce’s agenda in general, and I’m glad to see it written up! Making progress on the various oversight and interpretability problems listed here should make it much easier to surface undesirable behaviors in modern-day AIs (which I agree we already need superhuman-ish supervision over).
There are three areas I’d love to see developed further.
The first is how to train oversight AIs when the oversight tasks are no longer easily verifiable—for example, sophisticated reward hacks that can fool expert coders, or hard-to-verify sandbagging behavior on safety-related research. You mentioned that this would get covered in the next post, so I’m looking forward to that.
The second is how robust these oversight mechanisms are to optimization. It seems like a bad idea to train directly against unwanted concepts in predictive concept decoders, but maybe training directly against investigator agents for unwanted behaviors is fine? Using Docent to surface issues in RL environments (and then fixing those issues) also seems good? In some cases, if we have good interpretability, we can actually understand where the undesirable behaviors come from (i.e., data attribution) and address the source of the problem. That’s great!
But I don’t think this would always be the case even if we have way better versions of existing tools. Of course, we could just use these tools mostly for auditing (as described in “Putting up bumpers”), but that limits their utility by a lot. Clearer thinking about how much we can train against various discriminators seems pretty crucial to all of this.
Related to the second point: this post seems to assume a non-adversarial dynamic between the supervisors and the supervised (e.g., the AIs are not scheming). Good oversight can be one of our best tools to prevent scheming/adversarial dynamics from arising in the first place. Many of the tools developed can also be used to detect scheming (e.g., I wrote about using investigator agents to create realistic honeypots here). Still, more thought on how these problems play out in adversarial scenarios would probably be fruitful.
The first is how to train oversight AIs when the oversight tasks are no longer easily verifiable—for example, sophisticated reward hacks that can fool expert coders, or hard-to-verify sandbagging behavior on safety-related research. You mentioned that this would get covered in the next post, so I’m looking forward to that.
I think the thing that helps you out here is compositionality—all of these properties hopefully reduce to simpler concepts that are themselves verifiable, so hopefully e.g. a smart enough interp assistant could understand all the individual concepts as well as how they compose together and use this to understand more complex latent reasoning that isn’t directly verifiable.
The second is how robust these oversight mechanisms are to optimization. It seems like a bad idea to train directly against unwanted concepts in predictive concept decoders, but maybe training directly against investigator agents for unwanted behaviors is fine? Using Docent to surface issues in RL environments (and then fixing those issues) also seems good? In some cases, if we have good interpretability, we can actually understand where the undesirable behaviors come from (i.e., data attribution) and address the source of the problem. That’s great!
But I don’t think this would always be the case even if we have way better versions of existing tools. Of course, we could just use these tools mostly for auditing (as described in “Putting up bumpers”), but that limits their utility by a lot. Clearer thinking about how much we can train against various discriminators seems pretty crucial to all of this.
I agree with most of this. I’d just add that it’s not totally obvious to me that RLHF is the way we should be doing alignment out into the future—it’s kind of like electroshock therapy for LLMs which feels kind of pathological from a psychological standpoint. I’d guess that there are more psychologically friendly ways to train LMs—and understanding the relationship between training data and behaviors feels like a good way to study this!
Related to the second point: this post seems to assume a non-adversarial dynamic between the supervisors and the supervised (e.g., the AIs are not scheming). Good oversight can be one of our best tools to prevent scheming/adversarial dynamics from arising in the first place. Many of the tools developed can also be used to detect scheming (e.g., I wrote about using investigator agents to create realistic honeypots here). Still, more thought on how these problems play out in adversarial scenarios would probably be fruitful.
Do you mean that the oversight system is scheming or the subject model? For the subject model the hope is that a sufficiently powerful overseer can catch that. If you’re worried about the overseer, one reason for optimism is that the oversight models can have significantly smaller parameter count than frontier systems, so are less likely to have weird emergent properties.
Evaluation awareness isn’t what I had in mind. It’s more like, there’s always an adversarial dynamic between your overseer and your subject model. As the subject model gets more generally powerful, it could learn to cheat and deceive the oversight model even if the oversight model is superhuman at grading outputs.
If we only use the oversight models for auditing, the subject will learn less about the oversight and probably struggle more to come up with deceptions that work. However, I also feel like oversight mechanisms are in some sense adding less value if they can’t be used in training?
I’m pretty excited about Transluce’s agenda in general, and I’m glad to see it written up! Making progress on the various oversight and interpretability problems listed here should make it much easier to surface undesirable behaviors in modern-day AIs (which I agree we already need superhuman-ish supervision over).
There are three areas I’d love to see developed further.
The first is how to train oversight AIs when the oversight tasks are no longer easily verifiable—for example, sophisticated reward hacks that can fool expert coders, or hard-to-verify sandbagging behavior on safety-related research. You mentioned that this would get covered in the next post, so I’m looking forward to that.
The second is how robust these oversight mechanisms are to optimization. It seems like a bad idea to train directly against unwanted concepts in predictive concept decoders, but maybe training directly against investigator agents for unwanted behaviors is fine? Using Docent to surface issues in RL environments (and then fixing those issues) also seems good? In some cases, if we have good interpretability, we can actually understand where the undesirable behaviors come from (i.e., data attribution) and address the source of the problem. That’s great!
But I don’t think this would always be the case even if we have way better versions of existing tools. Of course, we could just use these tools mostly for auditing (as described in “Putting up bumpers”), but that limits their utility by a lot. Clearer thinking about how much we can train against various discriminators seems pretty crucial to all of this.
Related to the second point: this post seems to assume a non-adversarial dynamic between the supervisors and the supervised (e.g., the AIs are not scheming). Good oversight can be one of our best tools to prevent scheming/adversarial dynamics from arising in the first place. Many of the tools developed can also be used to detect scheming (e.g., I wrote about using investigator agents to create realistic honeypots here). Still, more thought on how these problems play out in adversarial scenarios would probably be fruitful.
Thanks! Some thoughts here:
I think the thing that helps you out here is compositionality—all of these properties hopefully reduce to simpler concepts that are themselves verifiable, so hopefully e.g. a smart enough interp assistant could understand all the individual concepts as well as how they compose together and use this to understand more complex latent reasoning that isn’t directly verifiable.
I agree with most of this. I’d just add that it’s not totally obvious to me that RLHF is the way we should be doing alignment out into the future—it’s kind of like electroshock therapy for LLMs which feels kind of pathological from a psychological standpoint. I’d guess that there are more psychologically friendly ways to train LMs—and understanding the relationship between training data and behaviors feels like a good way to study this!
Do you mean that the oversight system is scheming or the subject model? For the subject model the hope is that a sufficiently powerful overseer can catch that. If you’re worried about the overseer, one reason for optimism is that the oversight models can have significantly smaller parameter count than frontier systems, so are less likely to have weird emergent properties.
(I was mostly thinking of subject model scheming)
Is the worry that if the overseer is used at training time, the model will be eval aware and learn to behave differently when overseen?
Evaluation awareness isn’t what I had in mind. It’s more like, there’s always an adversarial dynamic between your overseer and your subject model. As the subject model gets more generally powerful, it could learn to cheat and deceive the oversight model even if the oversight model is superhuman at grading outputs.
If we only use the oversight models for auditing, the subject will learn less about the oversight and probably struggle more to come up with deceptions that work. However, I also feel like oversight mechanisms are in some sense adding less value if they can’t be used in training?