I’m most interested in arguments for inner alignment failure. I’m pretty confused by the fact that some researchers seem to think inner alignment is the main problem and/or probably extremely difficult, and yet I haven’t really heard a rigorous case made for its plausibility.
I have not read all the material about inner alignment that has
appeared on this forum, but I do occasionally read up on it.
There are some posters on this forum who believe that contemplating a
set of problems which are together called ‘inner alignment’ can work
as an intuition pump that would allow us to make needed conceptual
breakthroughs. The breakthroughs sought have mostly to do, I believe,
with analyzing possibilities for post-training treacherous turns which
have so far escaped notice. I am not (no longer) one of the posters
who have high hopes that inner alignment will work as a useful
intuition pump.
The terminology problem I have with the term ‘inner alignment’ is that
many working on it never make the move of defining it in rigorous
mathematics, or with clear toy examples of what are and what are not
inner alignment failures. Absent either a mathematical definition or
some defining examples, I am not able judge if inner alignment is
either the main alignment problem, or whether it would be a minor one,
but still one that is extremely difficult to solve.
What does not help here is that by now several non-mathematical
notions floating around of what an inner alignment failure even is, to
the extent that Evan has felt a need to write an entire clarification
post.
When poster X calls something an example of an inner alignment
failure, poster Y might respond and declare that in their view of
inner alignment failure, it is not actually an example of an inner
alignment failure, or a very good example of an inner alignment
failure. If we interpret it as a meme, then the meme of inner
alignment has a reproduction strategy where it reproduces by
triggering social media discussions about what it means.
Inner alignment has become what Minsky called a suitcase word:
everybody packs their own meaning into it. This means that for the
purpose of distillation, the word is best avoided. If you want to
distil the discussion, my recommendation is to look for the meanings
that people pack into the word.
I’m broadly sympathetic to your point that there have been an unfortunate number of disagreements about inner alignment terminology, and it has been and remains a source of confusion.
to the extent that Evan has felt a need to write an entire clarification post.
Yeah, and recently there has beenevenmore disagreement/clarification attempts.
I should have specified this on the top level question, but (as mentioned in my own answer) I’m talking about abergal’s suggestion of what inner alignment failure should refer to (basically: a model pursuing a different objective to the one it was trained on, when deployed out-of-distribution, while retaining most or all of the capabilities it had on the training distribution). I agree this isn’t crisp and is far from a mathematical formalisim, but note that there are several examples of this kind of failure in current ML systems that help to clarify what the concept is, and people seem to agree on these examples.
If you can think of toy examples that make real trouble for this definition of inner alignment failure, then I’d be curious to hear what they are.
Meta: I usually read these posts via the alignmentforum.org portal, and this portal filters out certain comments, so I missed your mention of abergal’s suggestion, which would have clarified your concerns about inner alignment arguments for me. I have mailed the team that runs the website to ask if they could improve how this filtering works.
Just read the post with the examples you mention, and skimmed the related arxiv paper. I like how the authors develop the metrics of ‘objective robustness’ vs ‘capability robustness’ while avoiding the problem of trying to define a single meaning for the term ‘inner alignment’. Seems like good progress to me.
I’ll do the easier part of your question first:
I have not read all the material about inner alignment that has appeared on this forum, but I do occasionally read up on it.
There are some posters on this forum who believe that contemplating a set of problems which are together called ‘inner alignment’ can work as an intuition pump that would allow us to make needed conceptual breakthroughs. The breakthroughs sought have mostly to do, I believe, with analyzing possibilities for post-training treacherous turns which have so far escaped notice. I am not (no longer) one of the posters who have high hopes that inner alignment will work as a useful intuition pump.
The terminology problem I have with the term ‘inner alignment’ is that many working on it never make the move of defining it in rigorous mathematics, or with clear toy examples of what are and what are not inner alignment failures. Absent either a mathematical definition or some defining examples, I am not able judge if inner alignment is either the main alignment problem, or whether it would be a minor one, but still one that is extremely difficult to solve.
What does not help here is that by now several non-mathematical notions floating around of what an inner alignment failure even is, to the extent that Evan has felt a need to write an entire clarification post.
When poster X calls something an example of an inner alignment failure, poster Y might respond and declare that in their view of inner alignment failure, it is not actually an example of an inner alignment failure, or a very good example of an inner alignment failure. If we interpret it as a meme, then the meme of inner alignment has a reproduction strategy where it reproduces by triggering social media discussions about what it means.
Inner alignment has become what Minsky called a suitcase word: everybody packs their own meaning into it. This means that for the purpose of distillation, the word is best avoided. If you want to distil the discussion, my recommendation is to look for the meanings that people pack into the word.
I’m broadly sympathetic to your point that there have been an unfortunate number of disagreements about inner alignment terminology, and it has been and remains a source of confusion.
Yeah, and recently there has been even more disagreement/clarification attempts.
I should have specified this on the top level question, but (as mentioned in my own answer) I’m talking about abergal’s suggestion of what inner alignment failure should refer to (basically: a model pursuing a different objective to the one it was trained on, when deployed out-of-distribution, while retaining most or all of the capabilities it had on the training distribution). I agree this isn’t crisp and is far from a mathematical formalisim, but note that there are several examples of this kind of failure in current ML systems that help to clarify what the concept is, and people seem to agree on these examples.
If you can think of toy examples that make real trouble for this definition of inner alignment failure, then I’d be curious to hear what they are.
Meta: I usually read these posts via the alignmentforum.org portal, and this portal filters out certain comments, so I missed your mention of abergal’s suggestion, which would have clarified your concerns about inner alignment arguments for me. I have mailed the team that runs the website to ask if they could improve how this filtering works.
Just read the post with the examples you mention, and skimmed the related arxiv paper. I like how the authors develop the metrics of ‘objective robustness’ vs ‘capability robustness’ while avoiding the problem of trying to define a single meaning for the term ‘inner alignment’. Seems like good progress to me.