I like all of this post except for the conclusion. I think this comment shows that the definition of inner alignment requires an explicit optimizer. Your broader definition of inner misalignment is equivalent to generalization error or robustness, which we already have a name for.
I currently see inner alignment problems as a superset of generalisation error and robustness. Furthermore, an AI being a mesa-optimiser with a misaligned objective can also be thought of as a generalisation error seeing as this means we haven’t tested the AI in scenarios where it’s mesa-objective behaves differently from the base objective. The conclusion is meant to emphasise the possibility of extending the concept of inner misalignment to AI’s that we do not model as optimisers. I am open to the claim that this is not useful, and we should only use the term when we think of the AI as an optimiser. In which case the definition involving mesa-objectives is sufficient.
I think any inner alignment problem can be thought of as a kind of generalisation error (this wouldn’t have happened if we had more data), including misaligned mesa-optimisers. So yes, you are correct, in my model they are different ways of looking at the same problem (in hindsight, superset was a wrong word to use). Is your opinion that inner misalignment should only be used in cases when a mesa-optimiser can be shown to exist (which is the original definition and that stated by the comment you linked)? I agree, that would make sense also. I was starting with an assumption that “that which is not outer misalignment should be inner misalignment” but I notice that Evan mentions problems that are neither (eg: mis-generalisations when there are no mesa-optimisers). This way of defining things only works if you commit to seeing the AI in terms of it being an optimiser, which is indeed a useful framing, but not the only one. However, based on your (and Evan’s) comments I do see how having inner alignment as a subset of things-that-are-not-outer-alignment also works.
Hmm yeah I like your edit, it breaks down the two definitions well. I definitely have a preference for the second one, I prefer confusing terms like this to have super specific definitions rather than broad vague ones, because it helps me to think about whether a proposed solution is actually solving the problem being pointed to. I have, like you, seen people using inner (mis)alignment to refer to other things outside of the original strict definition, but as far as I know the comment I linked to is the one that clarifies the definition most recently? I haven’t checked this. If there are more recent discussions involving the people who coined the term I would defer to them.
Regarding the crux that you mention in the edit:
whether or not deciding whether an AI is an optimizer, and finding its objective, is a well-defined procedure for powerful AIs
If you mean precisely mathematically well-defined, then I think this is too high a standard. I think it is sufficient that we be able to point toward archetypal examples of optimizing algorithms and say “stuff like that”.
I think the main reason I care about this distinction is that generalization error without learned optimizers doesn’t seem to be a huge problem, whereas “sufficiently powerful optimizing algorithms with imperfectly aligned goals” seems like a world-ending level of problem. Do you agree with this?
Firstly, yes I agree that it makes a lot of sense to defer to Evan who coined the term, and as far as we both can tell he meant the narrow definition. I actually read that comment before and misremembered its content so was originally under the impression that Evan had revised the definition to be broader, but then realized this is not the case.
I am still skeptical that there is any clear difference between optimizer / non-optimizer AI’s. Any AI that does a task well is in some sense optimizing for good performance on that task. This is what makes it hard for me to clearly see a case of generalization error that is not inner misalignment.
However, I can see how this can just be a framing thing where depending on how you look at the problem it’s easier to describe as “this AI has the wrong objective” vs “this AI has the correct objective but pursues it badly due to generalization error”. In any case, both of these also seem equally dangerous to me.
The problem with distinguishing these is that for a sufficiently complex training objective, even a very powerful agent -y AI will have a “fuzzy” goal that isn’t an exact specification of what it should do (for example, humans don’t have clearly defined objectives that they consistently pursue). This fuzzy goal is like a cluster of possible worlds towards which the AI is causing our current world to tend, via its actions/outputs. Pursuing the goal badly means having an overly fuzzy goal where some of the possible convergent worlds are not what we want. Inner misalignment, or having the wrong goal, will also look very similar, although perhaps a distinction you could make is that with inner misalignment fuzzy goal has to be in some sense miscentered.
Recently I’ve seen a bunch of high status people using “inner alignment” in the more general sense, so I’m starting to think it might be too late to stick to the narrow definition. E.g. this post.
Any AI that does a task well is in some sense optimizing for good performance on that task.
I disagree with this. To me there are two distinct approaches, one is to memorize which actions did well in similar training situations, and the other is to predict the consequences of each action and somehow rank each consequence.
for a sufficiently complex training objective, even a very powerful agent -y AI will have a “fuzzy” goal that isn’t an exact specification of what it should do (for example, humans don’t have clearly defined objectives that they consistently pursue).
I disagree with this, but I can’t put it into clear words. I’ll think more about it. It doesn’t seem true for model-based RL, unless we explicitly build in uncertainty over goals. I think it’s only true for humans for value-loading-from-culture reasons.
I like all of this post except for the conclusion. I think this comment shows that the definition of inner alignment requires an explicit optimizer. Your broader definition of inner misalignment is equivalent to generalization error or robustness, which we already have a name for.
I currently see inner alignment problems as a superset of generalisation error and robustness. Furthermore, an AI being a mesa-optimiser with a misaligned objective can also be thought of as a generalisation error seeing as this means we haven’t tested the AI in scenarios where it’s mesa-objective behaves differently from the base objective. The conclusion is meant to emphasise the possibility of extending the concept of inner misalignment to AI’s that we do not model as optimisers. I am open to the claim that this is not useful, and we should only use the term when we think of the AI as an optimiser. In which case the definition involving mesa-objectives is sufficient.
What would you include as an inner alignment problem that isn’t a generalization problem or robustness problem?
I think any inner alignment problem can be thought of as a kind of generalisation error (this wouldn’t have happened if we had more data), including misaligned mesa-optimisers. So yes, you are correct, in my model they are different ways of looking at the same problem (in hindsight, superset was a wrong word to use). Is your opinion that inner misalignment should only be used in cases when a mesa-optimiser can be shown to exist (which is the original definition and that stated by the comment you linked)? I agree, that would make sense also. I was starting with an assumption that “that which is not outer misalignment should be inner misalignment” but I notice that Evan mentions problems that are neither (eg: mis-generalisations when there are no mesa-optimisers). This way of defining things only works if you commit to seeing the AI in terms of it being an optimiser, which is indeed a useful framing, but not the only one. However, based on your (and Evan’s) comments I do see how having inner alignment as a subset of things-that-are-not-outer-alignment also works.
Hmm yeah I like your edit, it breaks down the two definitions well. I definitely have a preference for the second one, I prefer confusing terms like this to have super specific definitions rather than broad vague ones, because it helps me to think about whether a proposed solution is actually solving the problem being pointed to. I have, like you, seen people using inner (mis)alignment to refer to other things outside of the original strict definition, but as far as I know the comment I linked to is the one that clarifies the definition most recently? I haven’t checked this. If there are more recent discussions involving the people who coined the term I would defer to them.
Regarding the crux that you mention in the edit:
If you mean precisely mathematically well-defined, then I think this is too high a standard. I think it is sufficient that we be able to point toward archetypal examples of optimizing algorithms and say “stuff like that”.
I think the main reason I care about this distinction is that generalization error without learned optimizers doesn’t seem to be a huge problem, whereas “sufficiently powerful optimizing algorithms with imperfectly aligned goals” seems like a world-ending level of problem. Do you agree with this?
Firstly, yes I agree that it makes a lot of sense to defer to Evan who coined the term, and as far as we both can tell he meant the narrow definition. I actually read that comment before and misremembered its content so was originally under the impression that Evan had revised the definition to be broader, but then realized this is not the case.
I am still skeptical that there is any clear difference between optimizer / non-optimizer AI’s. Any AI that does a task well is in some sense optimizing for good performance on that task. This is what makes it hard for me to clearly see a case of generalization error that is not inner misalignment.
However, I can see how this can just be a framing thing where depending on how you look at the problem it’s easier to describe as “this AI has the wrong objective” vs “this AI has the correct objective but pursues it badly due to generalization error”. In any case, both of these also seem equally dangerous to me.
The problem with distinguishing these is that for a sufficiently complex training objective, even a very powerful agent -y AI will have a “fuzzy” goal that isn’t an exact specification of what it should do (for example, humans don’t have clearly defined objectives that they consistently pursue). This fuzzy goal is like a cluster of possible worlds towards which the AI is causing our current world to tend, via its actions/outputs. Pursuing the goal badly means having an overly fuzzy goal where some of the possible convergent worlds are not what we want. Inner misalignment, or having the wrong goal, will also look very similar, although perhaps a distinction you could make is that with inner misalignment fuzzy goal has to be in some sense miscentered.
Recently I’ve seen a bunch of high status people using “inner alignment” in the more general sense, so I’m starting to think it might be too late to stick to the narrow definition. E.g. this post.
I disagree with this. To me there are two distinct approaches, one is to memorize which actions did well in similar training situations, and the other is to predict the consequences of each action and somehow rank each consequence.
I disagree with this, but I can’t put it into clear words. I’ll think more about it. It doesn’t seem true for model-based RL, unless we explicitly build in uncertainty over goals. I think it’s only true for humans for value-loading-from-culture reasons.