We can construct some other task Y, for which we know the model needs to do X in order to solve Y, and we observe that the model is capable of Y
Are you thinking that this is a helpful definition even when treating models as black boxes, or only based on some analysis of the model’s internals? To me it seems workable only in the latter case.
In particular, from a black-box perspective, I don’t think we ever know that task X is required for task Y. The most we can know is that some [task Z whose output logically entails the output of task X] is required for task Y (where of course Z may be X).
So this clause seems never to be satisfiable without highly specific knowledge of the internals of the model. (if we were to say that it’s satisfied when we know Y requires some Z entailing X, then it seems we’d be requiring logical omniscience for intent alignment)
For example, the model may be doing something like: Z→(Y&A)→Y Without knowing that Z≡(X&A), and that X→Y also works (A happening to be superfluous in this case).
Does that seem right, or am I confused somewhere?
Another way to put this is that for workable cases, I’d expect the first clause to cover things: if the model knows how to simply separate Z into X&A in the above, then I’d expect suitable prompt engineering, fine-tuning… to be able to get the model to do task X.
(EDIT—third time lucky :) : If this isn’t possible for a given X, then I expect the model isn’t capable of task X (for non-deceptive models, at least). For black boxes, the second clause only seems able to get us something like “the model contains sufficient information for task X to be performed”, which is necessary, but not sufficient, for capability.)
Yeah, I think you need some assumptions about what the model is doing internally.
I’m hoping you can handwave over cases like ‘the model might only know X&A, not X’ with something like ‘if the model knows X&A, that’s close enough to it knowing X for our purposes—in particular, if it thought about the topic or learned a small amount, it might well realise X’.
Where ‘our purposes’ are something like ‘might the model be able to use its knowledge of X in a plan in some way that outsmarts us if we don’t know X’?
Another way to put this is that for workable cases, I’d expect the first clause to cover things: if the model knows how to simply separate Z into X&A in the above, then I’d expect suitable prompt engineering, fine-tuning… to be able to get the model to do task X.
It seems plausible to me that there are cases where you can’t get the model to do X by finetuning/prompt engineering, even if the model ‘knows’ X enough to be able to use it in plans. Something like—the part of its cognition that’s solving X isn’t ‘hooked up’ to the part that does output, but is hooked up to the part that makes plans. In humans, this would be any ‘knowledge’ that can be used to help you achieve stuff, but which is subconscious—your linguistic self can’t report it directly (and further you can’t train yourself to be able to report it)
This mostly seems plausible to me—and again, I think it’s a useful exercise that ought to yield interesting results.
Some thoughts:
Handwaving would seem to take us from “we can demonstrate capability of X” to “we have good evidence for capability of X”. In cases where we’ve failed to prompt/finetune the model into doing X we also have some evidence against the model’s capability of X. Hard to be highly confident here.
Precision over the definition of a task seems important when it comes to output. Since e.g. “do arithmetic” != “output arithmetic”. This is relevant in the second capability definition clause, since the kinds of X you can show are necessary for Y aren’t behaviours (usually), but rather internal processes. This doesn’t seem too useful in attempting to show misalignment, since knowing the model can do X doesn’t mean it can output the result of X.
I like the overall idea—seems very worthwhile.
A query on the specifics:
Are you thinking that this is a helpful definition even when treating models as black boxes, or only based on some analysis of the model’s internals? To me it seems workable only in the latter case.
In particular, from a black-box perspective, I don’t think we ever know that task X is required for task Y. The most we can know is that some [task Z whose output logically entails the output of task X] is required for task Y (where of course Z may be X).
So this clause seems never to be satisfiable without highly specific knowledge of the internals of the model. (if we were to say that it’s satisfied when we know Y requires some Z entailing X, then it seems we’d be requiring logical omniscience for intent alignment)
For example, the model may be doing something like: Z→(Y&A)→Y
Without knowing that Z≡(X&A), and that X→Y also works (A happening to be superfluous in this case).
Does that seem right, or am I confused somewhere?
Another way to put this is that for workable cases, I’d expect the first clause to cover things: if the model knows how to simply separate Z into X&A in the above, then I’d expect suitable prompt engineering, fine-tuning… to be able to get the model to do task X.
(EDIT—third time lucky :) :
If this isn’t possible for a given X, then I expect the model isn’t capable of task X (for non-deceptive models, at least).
For black boxes, the second clause only seems able to get us something like “the model contains sufficient information for task X to be performed”, which is necessary, but not sufficient, for capability.)
Yeah, I think you need some assumptions about what the model is doing internally.
I’m hoping you can handwave over cases like ‘the model might only know X&A, not X’ with something like ‘if the model knows X&A, that’s close enough to it knowing X for our purposes—in particular, if it thought about the topic or learned a small amount, it might well realise X’.
Where ‘our purposes’ are something like ‘might the model be able to use its knowledge of X in a plan in some way that outsmarts us if we don’t know X’?
It seems plausible to me that there are cases where you can’t get the model to do X by finetuning/prompt engineering, even if the model ‘knows’ X enough to be able to use it in plans. Something like—the part of its cognition that’s solving X isn’t ‘hooked up’ to the part that does output, but is hooked up to the part that makes plans. In humans, this would be any ‘knowledge’ that can be used to help you achieve stuff, but which is subconscious—your linguistic self can’t report it directly (and further you can’t train yourself to be able to report it)
This mostly seems plausible to me—and again, I think it’s a useful exercise that ought to yield interesting results.
Some thoughts:
Handwaving would seem to take us from “we can demonstrate capability of X” to “we have good evidence for capability of X”. In cases where we’ve failed to prompt/finetune the model into doing X we also have some evidence against the model’s capability of X. Hard to be highly confident here.
Precision over the definition of a task seems important when it comes to output. Since e.g. “do arithmetic” != “output arithmetic”.
This is relevant in the second capability definition clause, since the kinds of X you can show are necessary for Y aren’t behaviours (usually), but rather internal processes. This doesn’t seem too useful in attempting to show misalignment, since knowing the model can do X doesn’t mean it can output the result of X.