Jacob Pfau comments on Unexploitable search: blocking malicious use of free parameters

Jacob Pfau 21 May 2025 23:34 UTC
LW: 2 AF: 2
0
AF
I think KL to an imitator, or base model, would be good to enforce as well! ~~It is an action space metric though and so shares the problems we raise about action-space entropy penalties~~ (edit: Ryan’s proposal is different from what I had assumed) Would definitely be valuable to see an empirical head to head here regardless.
- ryan_greenblatt 22 May 2025 0:04 UTC
  LW: 2 AF: 2
  0
  AF Parent
  I don’t agree it shares these problems any more than the proposal you disucss shares these problems:
  
  However, in practice, such methods run into two obstacles. First, we expect solving for optimal max-entropy policies is intractable in most cases, so properties of the optima are not sufficient to argue for safety. Second, away from optima, entropy can be increased without improving safety – e.g. in the case of token-space entropy, by increasing entropy over variable-name choice.
  
  I don’t love KL to a base model because we might be worried the base model is scheming or if not likely to be scheming it might be much less capable than humans.
  - Jacob Pfau 22 May 2025 11:40 UTC
    LW: 1 AF: 1
    0
    AF Parent
    Thanks for the push, I previously didn’t click through to your post, and after doing so I realized you’re suggesting something different from what I’d assumed.
    
    From a skim the immediate concerns with your Dagger-like RL setup is that you are bottlenecked past human capability level and you introduce a new need for online sampling from humans—as you mention in the post. For the AI R&D setting (AGI-level capabilities) I have in mind, these are not affordances I want to assume we have.
    
    If, counterfactually, we went ahead with assuming cheap access to sufficiently capable humans, then I could imagine being convinced the linked method is preferrable. Two points that seem relevant for your method: (1) Sample efficiency of your method w.r.t. the human demonstrations. (2) Time complexity of training away malign initialization (e.g. the first solution found imports in the first chunk an insecure package).