Priors against Scenario 2. Another possibility is that given only the information in Scenario 1, people had strong priors against the story in Scenario 2, such that they could say “99% likely that it is outer misalignment” for Scenario 1, which gets rounded to “outer misalignment”, while still saying “inner misalignment” for Scenario 2.
I would guess this is not what’s going on. Given the information in Scenario 1, I’d expect most people would find Scenario 2 reasonably likely (i.e. they don’t have priors against it).
FWIW, this was basically my thinking on the two scenarios. Not 99% likelihood, but scenario 1 does strike me as ambiguous but much more likely to be an outer misalignment problem (in the root cause sense).
Yeah, this makes sense given that you think of outer misalignment as failures of [reward function + training distribution], while inner misalignment is failures of optimization.
I’d be pretty surprised though if more than one person in my survey had that view.
FWIW, this was basically my thinking on the two scenarios. Not 99% likelihood, but scenario 1 does strike me as ambiguous but much more likely to be an outer misalignment problem (in the root cause sense).
Yeah, this makes sense given that you think of outer misalignment as failures of [reward function + training distribution], while inner misalignment is failures of optimization.
I’d be pretty surprised though if more than one person in my survey had that view.
A while ago you wanted a few posts on outer/inner alignment distilled. Is this post a clear explanation of the same concept in your view?
I don’t think this post is aimed at the same concept(s).