I’m broadly sympathetic to your point that there have been an unfortunate number of disagreements about inner alignment terminology, and it has been and remains a source of confusion.
to the extent that Evan has felt a need to write an entire clarification post.
Yeah, and recently there has beenevenmore disagreement/clarification attempts.
I should have specified this on the top level question, but (as mentioned in my own answer) I’m talking about abergal’s suggestion of what inner alignment failure should refer to (basically: a model pursuing a different objective to the one it was trained on, when deployed out-of-distribution, while retaining most or all of the capabilities it had on the training distribution). I agree this isn’t crisp and is far from a mathematical formalisim, but note that there are several examples of this kind of failure in current ML systems that help to clarify what the concept is, and people seem to agree on these examples.
If you can think of toy examples that make real trouble for this definition of inner alignment failure, then I’d be curious to hear what they are.
Meta: I usually read these posts via the alignmentforum.org portal, and this portal filters out certain comments, so I missed your mention of abergal’s suggestion, which would have clarified your concerns about inner alignment arguments for me. I have mailed the team that runs the website to ask if they could improve how this filtering works.
Just read the post with the examples you mention, and skimmed the related arxiv paper. I like how the authors develop the metrics of ‘objective robustness’ vs ‘capability robustness’ while avoiding the problem of trying to define a single meaning for the term ‘inner alignment’. Seems like good progress to me.
I’m broadly sympathetic to your point that there have been an unfortunate number of disagreements about inner alignment terminology, and it has been and remains a source of confusion.
Yeah, and recently there has been even more disagreement/clarification attempts.
I should have specified this on the top level question, but (as mentioned in my own answer) I’m talking about abergal’s suggestion of what inner alignment failure should refer to (basically: a model pursuing a different objective to the one it was trained on, when deployed out-of-distribution, while retaining most or all of the capabilities it had on the training distribution). I agree this isn’t crisp and is far from a mathematical formalisim, but note that there are several examples of this kind of failure in current ML systems that help to clarify what the concept is, and people seem to agree on these examples.
If you can think of toy examples that make real trouble for this definition of inner alignment failure, then I’d be curious to hear what they are.
Meta: I usually read these posts via the alignmentforum.org portal, and this portal filters out certain comments, so I missed your mention of abergal’s suggestion, which would have clarified your concerns about inner alignment arguments for me. I have mailed the team that runs the website to ask if they could improve how this filtering works.
Just read the post with the examples you mention, and skimmed the related arxiv paper. I like how the authors develop the metrics of ‘objective robustness’ vs ‘capability robustness’ while avoiding the problem of trying to define a single meaning for the term ‘inner alignment’. Seems like good progress to me.