I think KL to an imitator, or base model, would be good to enforce as well! It is an action space metric though and so shares the problems we raise about action-space entropy penalties (edit: Ryan’s proposal is different from what I had assumed) Would definitely be valuable to see an empirical head to head here regardless.
I don’t agree it shares these problems any more than the proposal you disucss shares these problems:
However, in practice, such methods run into two obstacles. First, we expect solving for optimal max-entropy policies is intractable in most cases, so properties of the optima are not sufficient to argue for safety. Second, away from optima, entropy can be increased without improving safety – e.g. in the case of token-space entropy, by increasing entropy over variable-name choice.
I don’t love KL to a base model because we might be worried the base model is scheming or if not likely to be scheming it might be much less capable than humans.
Thanks for the push, I previously didn’t click through to your post, and after doing so I realized you’re suggesting something different from what I’d assumed.
From a skim the immediate concerns with your Dagger-like RL setup is that you are bottlenecked past human capability level and you introduce a new need for online sampling from humans—as you mention in the post. For the AI R&D setting (AGI-level capabilities) I have in mind, these are not affordances I want to assume we have.
If, counterfactually, we went ahead with assuming cheap access to sufficiently capable humans, then I could imagine being convinced the linked method is preferrable. Two points that seem relevant for your method: (1) Sample efficiency of your method w.r.t. the human demonstrations. (2) Time complexity of training away malign initialization (e.g. the first solution found imports in the first chunk an insecure package).
I think KL to an imitator, or base model, would be good to enforce as well!
It is an action space metric though and so shares the problems we raise about action-space entropy penalties(edit: Ryan’s proposal is different from what I had assumed) Would definitely be valuable to see an empirical head to head here regardless.I don’t agree it shares these problems any more than the proposal you disucss shares these problems:
I don’t love KL to a base model because we might be worried the base model is scheming or if not likely to be scheming it might be much less capable than humans.
Thanks for the push, I previously didn’t click through to your post, and after doing so I realized you’re suggesting something different from what I’d assumed.
From a skim the immediate concerns with your Dagger-like RL setup is that you are bottlenecked past human capability level and you introduce a new need for online sampling from humans—as you mention in the post. For the AI R&D setting (AGI-level capabilities) I have in mind, these are not affordances I want to assume we have.
If, counterfactually, we went ahead with assuming cheap access to sufficiently capable humans, then I could imagine being convinced the linked method is preferrable. Two points that seem relevant for your method: (1) Sample efficiency of your method w.r.t. the human demonstrations. (2) Time complexity of training away malign initialization (e.g. the first solution found imports in the first chunk an insecure package).