plex comments on Categorizing failures as “outer” or “inner” misalignment is often confused

plex 6 Jan 2023 19:09 UTC
4 points
0
I classified the first as Outer misalignment, and the second as Deceptive outer misalignment, before reading on.
I agree with
Another use of the terms “outer” and “inner” is to describe the situation in which an “outer” optimizer like gradient descent is used to find a learned model that is itself performing optimization (the “inner” optimizer). This usage seems fine to me.
being the worthwhile use of the term inner alignment as opposed to the ones you argue against, and could imagine that the term is being blurred and used in less helpful ways by many people. But I’d be wary of discouraging the inner vs outer alignment ontology too hard, as the internal optimizer failure mode feels like a key one and worthwhile having as a clear category within goal misgeneralization.
As for why so many researchers were tripped up, I imagine that the framing of the pop quiz would make a big difference. The first was obviously outer alignment and the question was inner vs outer, so priors for a non-trick quiz were that the second was inner. People don’t like calling out high-status people as pulling trick questions on them without strong evidence (especially in front of a group of peers), and there is some vague story that doesn’t hold up if you look too closely in the direction of internal models not being displayed that could cause a person to brush over the error.