Within the appendix on control methods you mention preventing exploration hacking. I think a broader point which belongs in the list is: “Improve capability elicitation techniques”.
Agreed, we think improving capability elicitation techniques will also improve our ability to prevent exploration hacking/sandbagging. So work trying to generally improve capability elicitation will be useful.
Preventing (basically intentional) sandbagging is just a special case of capabilities elicitation, but it is pretty much the only case which we care about for control.
Agreed, we think improving capability elicitation techniques will also improve our ability to prevent exploration hacking/sandbagging. So work trying to generally improve capability elicitation will be useful.
Preventing (basically intentional) sandbagging is just a special case of capabilities elicitation, but it is pretty much the only case which we care about for control.