Separately, my guess is that all the action lives in how much wiggle room there is in suboptimality. So, you’d want to argue that you relatively quickly converge to the right thing. This likely requires some structure on the problem via us having a view on what types of outputs might be dangerous—this is in turn naturally suggests something like regularizing relative to human imitation or some other prior we think is benign as this does more directly push away from bad outputs.
Separately, my guess is that all the action lives in how much wiggle room there is in suboptimality. So, you’d want to argue that you relatively quickly converge to the right thing. This likely requires some structure on the problem via us having a view on what types of outputs might be dangerous—this is in turn naturally suggests something like regularizing relative to human imitation or some other prior we think is benign as this does more directly push away from bad outputs.