ryan_greenblatt comments on Unexploitable search: blocking malicious use of free parameters

ryan_greenblatt 21 May 2025 19:05 UTC
LW: 2 AF: 2
0
AF
Separately, my guess is that all the action lives in how much wiggle room there is in suboptimality. So, you’d want to argue that you relatively quickly converge to the right thing. This likely requires some structure on the problem via us having a view on what types of outputs might be dangerous—this is in turn naturally suggests something like regularizing relative to human imitation or some other prior we think is benign as this does more directly push away from bad outputs.