abramdemski comments on A misalignment taxonomy

abramdemski 21 Jun 2026 22:08 UTC
3 points
0
I think “gradient misalignment” here is definitely a type of outer misalignment, the way I think about things.
Another sort of inner misalignment (which isn’t obvious from your typology, but which might fit somewhere in your typology already according to you, or perhaps multiple places) is optimization failure: perhaps the training examples were good and the reward/loss was specified well (outer-aligned), but for whatever reason, the optimization of the weights (eg, gradient descent) introduces a strong misaligned bias.
- Alec Harris 22 Jun 2026 1:25 UTC
  2 points
  0
  Parent
  I think “gradient misalignment” here is definitely a type of outer misalignment, the way I think about things.
  Yes, that’s fair; I think it’s non-obvious where it falls. I chose to categorize it under inner misalignment because it constitutes a reason the AI might be misaligned with the objective function.
  
  I figure if you are going to call gradient misalignment a specification error, you could as well call any inner misalignment a failure to specify the right training regimen that gives you what you want. More generally, any failure to generalize can be construed as a failure to specify. So it seems more meaningful to me to define the outer/inner dichotomy as “objective misaligned with intention” v.s. “AI misaligned with objective” as opposed to “failure to specify” v.s. “failure to generalize.”
  Another sort of inner misalignment (which isn’t obvious from your typology, but which might fit somewhere in your typology already according to you, or perhaps multiple places) is optimization failure: perhaps the training examples were good and the reward/loss was specified well (outer-aligned), but for whatever reason, the optimization of the weights (eg, gradient descent) introduces a strong misaligned bias.
  Hmm, yeah, I think I would call this “perfect-correlate misalignment”: somehow the inductive biases of training are such that it converges toward misaligned goals. This is somewhat unclear from my title and description. I guess I think, in practice, the result of this misaligned bias is likely to be based around some kind of correlate and so this is a convenient way to think about it.