This mostly seems plausible to me—and again, I think it’s a useful exercise that ought to yield interesting results.
Some thoughts:
Handwaving would seem to take us from “we can demonstrate capability of X” to “we have good evidence for capability of X”. In cases where we’ve failed to prompt/finetune the model into doing X we also have some evidence against the model’s capability of X. Hard to be highly confident here.
Precision over the definition of a task seems important when it comes to output. Since e.g. “do arithmetic” != “output arithmetic”. This is relevant in the second capability definition clause, since the kinds of X you can show are necessary for Y aren’t behaviours (usually), but rather internal processes. This doesn’t seem too useful in attempting to show misalignment, since knowing the model can do X doesn’t mean it can output the result of X.
This mostly seems plausible to me—and again, I think it’s a useful exercise that ought to yield interesting results.
Some thoughts:
Handwaving would seem to take us from “we can demonstrate capability of X” to “we have good evidence for capability of X”. In cases where we’ve failed to prompt/finetune the model into doing X we also have some evidence against the model’s capability of X. Hard to be highly confident here.
Precision over the definition of a task seems important when it comes to output. Since e.g. “do arithmetic” != “output arithmetic”.
This is relevant in the second capability definition clause, since the kinds of X you can show are necessary for Y aren’t behaviours (usually), but rather internal processes. This doesn’t seem too useful in attempting to show misalignment, since knowing the model can do X doesn’t mean it can output the result of X.