Yeah, this makes sense given that you think of outer misalignment as failures of [reward function + training distribution], while inner misalignment is failures of optimization.
I’d be pretty surprised though if more than one person in my survey had that view.
Yeah, this makes sense given that you think of outer misalignment as failures of [reward function + training distribution], while inner misalignment is failures of optimization.
I’d be pretty surprised though if more than one person in my survey had that view.