I thought inner optimizers are supposed to be handled under “learning with catastrophe” / “optimizing for worst case”. In particular inner optimizers would cause “malign” failures which would constitute a catastrophe which techniques for learning with catastrophe / optimizing for worst case (such as adversarial training, verification, transparency) would detect and train the agent out of.
Yes. Inner optimizers should either result in low performance on the training distribution (in which case we have a hope of training them out, though we may get stuck in a local optimization), or to manifestly unacceptable behavior on some possible inputs.
Is “informed oversight” just another name for that problem, or a particular approach to solving it?
Informed oversight is being able to figure out everything your agent knows about how good a proposed action is. This seems like a prerequisite both for RL training (if you want a reward function that incentivizes the correct behavior) and for adversarial training to avoid unacceptable behavior.
If the latter, how is it different from “transparency”?
People discuss a bunch of techniques under the heading of transparency/interpretability, and have a bunch of goals.
In the context of this sequence, transparency is relevant for both:
Know what the agent knows, in order to evaluate its behavior.
Figure out under what conditions the agent would behave differently, to facilitate adversarial training.
For both of those problems, it’s not obvious the solution will look anything like what is normally called transparency (or what people in that field would recognize as transparency). And even if it will look like transparency, it seems worth distinguishing different goals of that research.
So that’s why there is a different name.
I thought those ideas would be enough to solve the more recent motivating example for “informed oversight” that Paul gave (training an agent to defend against network attacks).
(I disagreed with this upthread. I don’t think “convince the overseer that an action is good” obviously incentivizes the right behavior, even if you are allowed to offer an explanation—certainly we don’t have any particular argument that it would incentivize the right behavior. It seems like informed oversight roughly captures what is needed in order for RL to create the right incentives.)
Yes. Inner optimizers should either result in low performance on the training distribution (in which case we have a hope of training them out, though we may get stuck in a local optimization), or to manifestly unacceptable behavior on some possible inputs.
Informed oversight is being able to figure out everything your agent knows about how good a proposed action is. This seems like a prerequisite both for RL training (if you want a reward function that incentivizes the correct behavior) and for adversarial training to avoid unacceptable behavior.
People discuss a bunch of techniques under the heading of transparency/interpretability, and have a bunch of goals.
In the context of this sequence, transparency is relevant for both:
Know what the agent knows, in order to evaluate its behavior.
Figure out under what conditions the agent would behave differently, to facilitate adversarial training.
For both of those problems, it’s not obvious the solution will look anything like what is normally called transparency (or what people in that field would recognize as transparency). And even if it will look like transparency, it seems worth distinguishing different goals of that research.
So that’s why there is a different name.
(I disagreed with this upthread. I don’t think “convince the overseer that an action is good” obviously incentivizes the right behavior, even if you are allowed to offer an explanation—certainly we don’t have any particular argument that it would incentivize the right behavior. It seems like informed oversight roughly captures what is needed in order for RL to create the right incentives.)