Joe Collman comments on RSPs are pauses done right

Joe Collman 14 Oct 2023 18:50 UTC
LW: 11 AF: 5
8
AF
I think we at least do know how to do effective capabilities evaluations
This seems an overstatement to me:
Where the main risk is misuse, we’d need to know that those doing the testing have methods for eliciting capabilities that are as effective as anything people will come up with later. (including the most artful AutoGPT 3.0 setups etc)
It seems reasonable to me to claim that “we know how to do effective [capabilities given sota elicitation methods] evaluations”, but that doesn’t answer the right question.
Once the main risk isn’t misuse, then we have to worry about assumptions breaking down (no exploration hacking / no gradient hacking / [assumption we didn’t realize we were relying upon]). Obviously we don’t expect these to break yet, but I’d guess that we’ll be surprised the first time they do break.
I expect your guess on when they will break to be more accurate than mine—but that [I don’t have much of a clue, so I’m advocating extreme caution] may be the more reasonable policy.
My concern with trying to put something like [understanding-based evals] into an RSP right now is that it’ll end up evaluating the wrong thing: since we don’t yet know how to effectively evaluate understanding, any evaluation we set up right now would probably be too game-able to actually be workable here.
We don’t know how to put the concrete eval in the RSP, but we can certainly require that an eval for understanding passes. We can write in the RSP what the test would be intended to achieve, and conditions for the approval of the eval. E.g. [if at least two of David Krueger, Wei Dai and Abram Demski agree that this meets the bar for this category of understanding eval, then it does] (or whatever other criteria you might want).
Again, only putting targets that are well understood concretely in the RSP seems like a predictable way to fail to address poorly understood problems.
Either the RSP needs to cover the poorly understood problems too—perhaps with a [you can’t pass this check without first coming up with a test and getting it approved] condition, or it needs a “THIS RSP IS INADEQUATE TO ENSURE SAFETY” warning in huge red letters on every page. (if the Anthropic RSP communicates this at all, it’s not emphasized nearly enough)