Sounds like the argument for quantilizers. Issues with quantilizers still apply here—for example, taking a series of actions that are individually sampled from a human-like distribution will often end up constituting a long-term policy that’s off-distribution. But, like, if those problems could be surmounted I agree that would be really good.
As to ODTs, I’m not super optimistic, but I’m also not very expert. It seems from a little thought like there are two types of benefit to ODT finetuning: One, a sort of “lowering expectations” so that the system only tries to do the behaviors it’s actually learned how to do correctly, even if humans do more difficult things to get higher reward. Two, a random search through policies (local in the NN representation of the policy) that might make gradual improvements. I’m not confident in the safety properties of that second thing, for reasons similar to Steve Byrnes’ here.
Sounds like the argument for quantilizers. Issues with quantilizers still apply here—for example, taking a series of actions that are individually sampled from a human-like distribution will often end up constituting a long-term policy that’s off-distribution. But, like, if those problems could be surmounted I agree that would be really good.
As to ODTs, I’m not super optimistic, but I’m also not very expert. It seems from a little thought like there are two types of benefit to ODT finetuning: One, a sort of “lowering expectations” so that the system only tries to do the behaviors it’s actually learned how to do correctly, even if humans do more difficult things to get higher reward. Two, a random search through policies (local in the NN representation of the policy) that might make gradual improvements. I’m not confident in the safety properties of that second thing, for reasons similar to Steve Byrnes’ here.