Yeah, I think you need some assumptions about what the model is doing internally.
I’m hoping you can handwave over cases like ‘the model might only know X&A, not X’ with something like ‘if the model knows X&A, that’s close enough to it knowing X for our purposes—in particular, if it thought about the topic or learned a small amount, it might well realise X’.
Where ‘our purposes’ are something like ‘might the model be able to use its knowledge of X in a plan in some way that outsmarts us if we don’t know X’?
Another way to put this is that for workable cases, I’d expect the first clause to cover things: if the model knows how to simply separate Z into X&A in the above, then I’d expect suitable prompt engineering, fine-tuning… to be able to get the model to do task X.
It seems plausible to me that there are cases where you can’t get the model to do X by finetuning/prompt engineering, even if the model ‘knows’ X enough to be able to use it in plans. Something like—the part of its cognition that’s solving X isn’t ‘hooked up’ to the part that does output, but is hooked up to the part that makes plans. In humans, this would be any ‘knowledge’ that can be used to help you achieve stuff, but which is subconscious—your linguistic self can’t report it directly (and further you can’t train yourself to be able to report it)
This mostly seems plausible to me—and again, I think it’s a useful exercise that ought to yield interesting results.
Some thoughts:
Handwaving would seem to take us from “we can demonstrate capability of X” to “we have good evidence for capability of X”. In cases where we’ve failed to prompt/finetune the model into doing X we also have some evidence against the model’s capability of X. Hard to be highly confident here.
Precision over the definition of a task seems important when it comes to output. Since e.g. “do arithmetic” != “output arithmetic”. This is relevant in the second capability definition clause, since the kinds of X you can show are necessary for Y aren’t behaviours (usually), but rather internal processes. This doesn’t seem too useful in attempting to show misalignment, since knowing the model can do X doesn’t mean it can output the result of X.
Yeah, I think you need some assumptions about what the model is doing internally.
I’m hoping you can handwave over cases like ‘the model might only know X&A, not X’ with something like ‘if the model knows X&A, that’s close enough to it knowing X for our purposes—in particular, if it thought about the topic or learned a small amount, it might well realise X’.
Where ‘our purposes’ are something like ‘might the model be able to use its knowledge of X in a plan in some way that outsmarts us if we don’t know X’?
It seems plausible to me that there are cases where you can’t get the model to do X by finetuning/prompt engineering, even if the model ‘knows’ X enough to be able to use it in plans. Something like—the part of its cognition that’s solving X isn’t ‘hooked up’ to the part that does output, but is hooked up to the part that makes plans. In humans, this would be any ‘knowledge’ that can be used to help you achieve stuff, but which is subconscious—your linguistic self can’t report it directly (and further you can’t train yourself to be able to report it)
This mostly seems plausible to me—and again, I think it’s a useful exercise that ought to yield interesting results.
Some thoughts:
Handwaving would seem to take us from “we can demonstrate capability of X” to “we have good evidence for capability of X”. In cases where we’ve failed to prompt/finetune the model into doing X we also have some evidence against the model’s capability of X. Hard to be highly confident here.
Precision over the definition of a task seems important when it comes to output. Since e.g. “do arithmetic” != “output arithmetic”.
This is relevant in the second capability definition clause, since the kinds of X you can show are necessary for Y aren’t behaviours (usually), but rather internal processes. This doesn’t seem too useful in attempting to show misalignment, since knowing the model can do X doesn’t mean it can output the result of X.