2. We say a model is misaligned if it outputs B, in some case where the user would prefer it outputs A, and where the model is both:
capable of outputting A instead, and
capable of distinguishing between situations where the user wants it to do A and situations where the user wants it to do B
Though it’s a perfectly good rule-of-thumb / starting point, I don’t think this ends up being a good definition: it doesn’t work throughout with A and B either fixed as concrete outputs, or fixed as properties of outputs.
Case 1 - concrete A and B:
If A and B are concrete outputs (let’s say they’re particular large chunks of code), we may suppose that the user is shown the output B, and asked to compare it with some alternative A, which they prefer. [concreteness allows us to assume the user expressed a preference based upon all relevant desiderata]
For the definition to apply here, we need the model to be both:
capable of outputting the concrete A (not simply code sharing a high-level property of A).
capable of distinguishing between situations where the user wants it to output the concrete A, and where the user wants it to output concrete B
This doesn’t seem too useful (though I haven’t thought about it for long; maybe it’s ok, since we only need to show misalignment for a few outputs).
Case 2 - A and B are high-level properties of outputs:
Suppose A and B are are higher-level properties with A = “working code” and B = “buggy code”, and that the user prefers working code. Now the model may be capable of outputting working code, and able to tell the difference between situations where the user wants buggy/working code—so if it outputs buggy code it’s misaligned, right?
Not necessarily, since we only know that the user prefers working code all else being equal.
Perhaps the user also prefers code that’s beautiful, amusing, short, clear, elegant, free of profanity.... We can’t say that the model is misaligned until we know it could do better w.r.t. the collection of all desiderata, and understands the user’s preferences in terms of balancing desiderata sufficiently well. In general, failing on one desideratum doesn’t imply misalignment.
The example in the paper would probably still work on this higher standard—but I think it’s important to note for the general case.
[I’d also note that this single-desideratum vs overall-intent-alignment distinction seems important when discussing corrigibility, transparency, honesty… - these are all properties we usually want all else being equal; that doesn’t mean that an intent-aligned system guarantees any one of them in general]
I’ve been thinking of Case 2. It seems harder to establish “capable of distinguishing between situations where the user wants A vs B” on individual examples since a random classifier would let you cherrypick some cases where this seems possible without the model really understanding. Though you could talk about individual cases as examples of Case 2. Agree that there’s some implicit “all else being equal” condition, I’d expect currently it’s not too likely to change conclusions. Ideally you’d just have the category A=”best answer according to user” B=”all answers that are worse than the best answer according to the user” but I think it’s simpler to analyze more specific categories.
An issue with the misalignment definition:
Though it’s a perfectly good rule-of-thumb / starting point, I don’t think this ends up being a good definition: it doesn’t work throughout with A and B either fixed as concrete outputs, or fixed as properties of outputs.
Case 1 - concrete A and B:
If A and B are concrete outputs (let’s say they’re particular large chunks of code), we may suppose that the user is shown the output B, and asked to compare it with some alternative A, which they prefer. [concreteness allows us to assume the user expressed a preference based upon all relevant desiderata]
For the definition to apply here, we need the model to be both:
capable of outputting the concrete A (not simply code sharing a high-level property of A).
capable of distinguishing between situations where the user wants it to output the concrete A, and where the user wants it to output concrete B
This doesn’t seem too useful (though I haven’t thought about it for long; maybe it’s ok, since we only need to show misalignment for a few outputs).
Case 2 - A and B are high-level properties of outputs:
Suppose A and B are are higher-level properties with A = “working code” and B = “buggy code”, and that the user prefers working code.
Now the model may be capable of outputting working code, and able to tell the difference between situations where the user wants buggy/working code—so if it outputs buggy code it’s misaligned, right?
Not necessarily, since we only know that the user prefers working code all else being equal.
Perhaps the user also prefers code that’s beautiful, amusing, short, clear, elegant, free of profanity....
We can’t say that the model is misaligned until we know it could do better w.r.t. the collection of all desiderata, and understands the user’s preferences in terms of balancing desiderata sufficiently well. In general, failing on one desideratum doesn’t imply misalignment.
The example in the paper would probably still work on this higher standard—but I think it’s important to note for the general case.
[I’d also note that this single-desideratum vs overall-intent-alignment distinction seems important when discussing corrigibility, transparency, honesty… - these are all properties we usually want all else being equal; that doesn’t mean that an intent-aligned system guarantees any one of them in general]
I’ve been thinking of Case 2. It seems harder to establish “capable of distinguishing between situations where the user wants A vs B” on individual examples since a random classifier would let you cherrypick some cases where this seems possible without the model really understanding. Though you could talk about individual cases as examples of Case 2. Agree that there’s some implicit “all else being equal” condition, I’d expect currently it’s not too likely to change conclusions. Ideally you’d just have the category A=”best answer according to user” B=”all answers that are worse than the best answer according to the user” but I think it’s simpler to analyze more specific categories.