Inner alignment: what are we pointing at?

Proof that a model is an optimizer says very little about the model. I do not know what a research group is studying outer alignment is studying. Inner alignment seems to cover the entire problem at the limit. Whether an optimizer is mesa or not depends on your point of view. These terms seem to be a magnet for confusion and debate. I have to do background reading on someone to even understand what claim they’re making. These are all indicators that we are using the wrong terms.

What are we actually pointing at? What questions do we want answered?

  1. Do we care if a model is an optimizer? Is it important whether it is creating plans through an explicit search process or a clever collection of heuristics? A poor search algorithm cannot plan much and clever enough heuristics can take you to any goal. What’s the important metric?

  2. Sometimes a model will have great capacity to shape its environment but little inclination. How to divide between capacity and inclination in a way that closely corresponds to agents and models as we observe them? (One could say that capacity and inclination cannot be separated but the right definitions would split them right apart.)

  3. When you specify what you want the model to do in code, what is the central difficulty? Is there a common risk or error in giving examples and giving loss/​reward/​value functions that we can name?

  4. Is there a clear, accepted term for when models do not maintain desired behavior under distribution shift?

  5. Should we distinguish between trained RL models that optimize and spontaneous agents that emerge in dynamical systems? One might expect the first to almost always happen and the second very rarely. What’s the key difference?

I’ll post my answers to these questions in a couple days but I’m curious how other people slice it. Does “inner alignment failure” mean anything or do we need to point more directly?