what are the bottlenecks preventing 10x-100x scaling of Control Evaluations?
I’m not confident in the estimates of the safety margin we get from internal only evaluations—the challenge of eliciting strong subversion performance seems very hard for getting satisfactory estimates of the subversion capability of models against control protocols.
I’d feel more confident if we had thousands of people trying to create red-team models, while thousands of blue teams propose different monitoring methods, and control protocols.
The type of experiments described in (1,2,3) seem particularly amenable to this type of crowdsourcing competitions.
This seems like it fits naturally into DefCon-esque (my catchall for the hacker community) security competitions; I think this would be very popular, even without a large monetary prize.
Success would look like having strong belief that a given control procedure has lower and upper bound estimates for catching the best attempts humanity can make at subverting control procedures w/ (red-teamed models, humans), and ideally have an estimate of the cost of such control protocols.
I think the reason this would fail is:
lack of access to the actual frontier models (perhaps a lab could partner with a competition organizer, or rely on good open source models)
lack of good data collection necessary to have confidence in the results or outcomes, metrics might be hard to meaningfully create.
would the competition reflect the local conditions of frontier labs well enough that labs can generalize from the results? Ideally you have a consortium effort helping to ensure the setup reflects reality.
generally operationally difficult to coordinate lots of people.
what are the bottlenecks preventing 10x-100x scaling of Control Evaluations?
I’m not confident in the estimates of the safety margin we get from internal only evaluations—the challenge of eliciting strong subversion performance seems very hard for getting satisfactory estimates of the subversion capability of models against control protocols.
I’d feel more confident if we had thousands of people trying to create red-team models, while thousands of blue teams propose different monitoring methods, and control protocols.
The type of experiments described in (1,2,3) seem particularly amenable to this type of crowdsourcing competitions.
This seems like it fits naturally into DefCon-esque (my catchall for the hacker community) security competitions; I think this would be very popular, even without a large monetary prize.
Success would look like having strong belief that a given control procedure has lower and upper bound estimates for catching the best attempts humanity can make at subverting control procedures w/ (red-teamed models, humans), and ideally have an estimate of the cost of such control protocols.
I think the reason this would fail is:
lack of access to the actual frontier models (perhaps a lab could partner with a competition organizer, or rely on good open source models)
lack of good data collection necessary to have confidence in the results or outcomes, metrics might be hard to meaningfully create.
would the competition reflect the local conditions of frontier labs well enough that labs can generalize from the results? Ideally you have a consortium effort helping to ensure the setup reflects reality.
generally operationally difficult to coordinate lots of people.
Are there others?