johnswentworth comments on Categorizing failures as “outer” or “inner” misalignment is often confused

johnswentworth 6 Jan 2023 16:47 UTC
LW: 4 AF: 3
0
AF
Priors against Scenario 2. Another possibility is that given only the information in Scenario 1, people had strong priors against the story in Scenario 2, such that they could say “99% likely that it is outer misalignment” for Scenario 1, which gets rounded to “outer misalignment”, while still saying “inner misalignment” for Scenario 2.
I would guess this is not what’s going on. Given the information in Scenario 1, I’d expect most people would find Scenario 2 reasonably likely (i.e. they don’t have priors against it).
FWIW, this was basically my thinking on the two scenarios. Not 99% likelihood, but scenario 1 does strike me as ambiguous but much more likely to be an outer misalignment problem (in the root cause sense).
- Rohin Shah 6 Jan 2023 18:46 UTC
  LW: 6 AF: 5
  0
  AF Parent
  Yeah, this makes sense given that you think of outer misalignment as failures of [reward function + training distribution], while inner misalignment is failures of optimization.
  I’d be pretty surprised though if more than one person in my survey had that view.
- Thomas Kwa 6 Jan 2023 20:59 UTC
  LW: 2 AF: 1
  0
  AF Parent
  A while ago you wanted a few posts on outer/inner alignment distilled. Is this post a clear explanation of the same concept in your view?
  - johnswentworth 6 Jan 2023 21:07 UTC
    LW: 2 AF: 2
    0
    AF Parent
    I don’t think this post is aimed at the same concept(s).