Jacob Pfau comments on Unexploitable search: blocking malicious use of free parameters

Jacob Pfau 22 May 2025 11:40 UTC
LW: 1 AF: 1
0
AF
Thanks for the push, I previously didn’t click through to your post, and after doing so I realized you’re suggesting something different from what I’d assumed.

From a skim the immediate concerns with your Dagger-like RL setup is that you are bottlenecked past human capability level and you introduce a new need for online sampling from humans—as you mention in the post. For the AI R&D setting (AGI-level capabilities) I have in mind, these are not affordances I want to assume we have.

If, counterfactually, we went ahead with assuming cheap access to sufficiently capable humans, then I could imagine being convinced the linked method is preferrable. Two points that seem relevant for your method: (1) Sample efficiency of your method w.r.t. the human demonstrations. (2) Time complexity of training away malign initialization (e.g. the first solution found imports in the first chunk an insecure package).