My favorite approach for the exploitable search problem is to try to minimize KL relative to human imitation. More precisely, consider any token prefix part way though a trajectory from an AI. We’ll then have a human continue this trajectory (as effectively and safely as they can) and then KL regularize relative to this. To efficiently compute the KL penalty, we’ll train an AI to imitate the human’s continuation behavior. We discuss all of this in more detail here, see “Appendix: KL divergence prevents some low-stakes failures” as well as the overall post.
This seems better than (many) alternatives as it actively pushes against actions which are highly anomalous from a human perspective which is probably correlated with exploitation in many cases. I’m skeptical that the sort of proposal described in this post outperforms KL regularization relative to a faithful human continuation prior.
You could penalize the model relative to a weighted combination of different heuristics like this KL penalty one, and this could improve things further if you have regularization which forces a small number of solutions and which is correlated with avoiding bad outcomes.
So, a generalization would be “incentivize the model to perform well on metrics which we think are correlated with non-malign outputs and which also force a narrower range of solutions when maximized”. I currently think this is more promising that the sort of approach discussed in the post.
I think KL to an imitator, or base model, would be good to enforce as well! It is an action space metric though and so shares the problems we raise about action-space entropy penalties (edit: Ryan’s proposal is different from what I had assumed) Would definitely be valuable to see an empirical head to head here regardless.
I don’t agree it shares these problems any more than the proposal you disucss shares these problems:
However, in practice, such methods run into two obstacles. First, we expect solving for optimal max-entropy policies is intractable in most cases, so properties of the optima are not sufficient to argue for safety. Second, away from optima, entropy can be increased without improving safety – e.g. in the case of token-space entropy, by increasing entropy over variable-name choice.
I don’t love KL to a base model because we might be worried the base model is scheming or if not likely to be scheming it might be much less capable than humans.
Thanks for the push, I previously didn’t click through to your post, and after doing so I realized you’re suggesting something different from what I’d assumed.
From a skim the immediate concerns with your Dagger-like RL setup is that you are bottlenecked past human capability level and you introduce a new need for online sampling from humans—as you mention in the post. For the AI R&D setting (AGI-level capabilities) I have in mind, these are not affordances I want to assume we have.
If, counterfactually, we went ahead with assuming cheap access to sufficiently capable humans, then I could imagine being convinced the linked method is preferrable. Two points that seem relevant for your method: (1) Sample efficiency of your method w.r.t. the human demonstrations. (2) Time complexity of training away malign initialization (e.g. the first solution found imports in the first chunk an insecure package).
My favorite approach for the exploitable search problem is to try to minimize KL relative to human imitation. More precisely, consider any token prefix part way though a trajectory from an AI. We’ll then have a human continue this trajectory (as effectively and safely as they can) and then KL regularize relative to this. To efficiently compute the KL penalty, we’ll train an AI to imitate the human’s continuation behavior. We discuss all of this in more detail here, see “Appendix: KL divergence prevents some low-stakes failures” as well as the overall post.
This seems better than (many) alternatives as it actively pushes against actions which are highly anomalous from a human perspective which is probably correlated with exploitation in many cases. I’m skeptical that the sort of proposal described in this post outperforms KL regularization relative to a faithful human continuation prior.
You could penalize the model relative to a weighted combination of different heuristics like this KL penalty one, and this could improve things further if you have regularization which forces a small number of solutions and which is correlated with avoiding bad outcomes.
So, a generalization would be “incentivize the model to perform well on metrics which we think are correlated with non-malign outputs and which also force a narrower range of solutions when maximized”. I currently think this is more promising that the sort of approach discussed in the post.
See also Fabien’s early empirical experiments on this topic.
I think KL to an imitator, or base model, would be good to enforce as well!
It is an action space metric though and so shares the problems we raise about action-space entropy penalties(edit: Ryan’s proposal is different from what I had assumed) Would definitely be valuable to see an empirical head to head here regardless.I don’t agree it shares these problems any more than the proposal you disucss shares these problems:
I don’t love KL to a base model because we might be worried the base model is scheming or if not likely to be scheming it might be much less capable than humans.
Thanks for the push, I previously didn’t click through to your post, and after doing so I realized you’re suggesting something different from what I’d assumed.
From a skim the immediate concerns with your Dagger-like RL setup is that you are bottlenecked past human capability level and you introduce a new need for online sampling from humans—as you mention in the post. For the AI R&D setting (AGI-level capabilities) I have in mind, these are not affordances I want to assume we have.
If, counterfactually, we went ahead with assuming cheap access to sufficiently capable humans, then I could imagine being convinced the linked method is preferrable. Two points that seem relevant for your method: (1) Sample efficiency of your method w.r.t. the human demonstrations. (2) Time complexity of training away malign initialization (e.g. the first solution found imports in the first chunk an insecure package).