Joe Collman comments on Call for research on evaluating alignment (funding + advice available)

Joe Collman 17 Oct 2021 20:49 UTC
LW: 1 AF: 1
0
AF
This mostly seems plausible to me—and again, I think it’s a useful exercise that ought to yield interesting results.
Some thoughts:
1. Handwaving would seem to take us from “we can demonstrate capability of X” to “we have good evidence for capability of X”. In cases where we’ve failed to prompt/finetune the model into doing X we also have some evidence against the model’s capability of X. Hard to be highly confident here.
2. Precision over the definition of a task seems important when it comes to output. Since e.g. “do arithmetic” != “output arithmetic”.
  This is relevant in the second capability definition clause, since the kinds of X you can show are necessary for Y aren’t behaviours (usually), but rather internal processes. This doesn’t seem too useful in attempting to show misalignment, since knowing the model can do X doesn’t mean it can output the result of X.