“Professional red-teamers” are mentioned in the Opus 4.7 System Card (e.g., Section 5.2); I’m unsure, are these just ‘people who are paid for their red-teaming efforts’?
This phrasing conveys something stronger to me, like people who are elite at red-teaming, and who probably do it full-time for an elite professionalized company with high training and standards. If it’s just people who ‘are getting paid to red-team,’ this wouldn’t clear my bar for calling them ‘professional red-teamers,’ I think.
That is to say, if it’s just ‘people paid to red-team,’ I worry about AI companies inadvertantly giving too strong a sense of the thoroughness of their stress-testing.
For instance one worry I have is that even paid red-teamers will not be very creative or persistent in their efforts, relative to what a determined future adversary would be.
Some ideas / thinking aloud:
Do AI companies pay their red-teamers in a bounty-like way? If not, maybe they should be (aligns incentives w/ breaking the system and finding things)
Do we have data on on how red-teamer efficacy changes as they gain more experience?
What about comparing red-teamers to the efficacy of today’s own models at generating jailbreaks? (I think there’s some research on the latter but with older models?)
“Professional red-teamers” are mentioned in the Opus 4.7 System Card (e.g., Section 5.2); I’m unsure, are these just ‘people who are paid for their red-teaming efforts’?
This phrasing conveys something stronger to me, like people who are elite at red-teaming, and who probably do it full-time for an elite professionalized company with high training and standards. If it’s just people who ‘are getting paid to red-team,’ this wouldn’t clear my bar for calling them ‘professional red-teamers,’ I think.
That is to say, if it’s just ‘people paid to red-team,’ I worry about AI companies inadvertantly giving too strong a sense of the thoroughness of their stress-testing.
For instance one worry I have is that even paid red-teamers will not be very creative or persistent in their efforts, relative to what a determined future adversary would be.
Some ideas / thinking aloud:
Do AI companies pay their red-teamers in a bounty-like way? If not, maybe they should be (aligns incentives w/ breaking the system and finding things)
Do we have data on on how red-teamer efficacy changes as they gain more experience?
What about comparing red-teamers to the efficacy of today’s own models at generating jailbreaks? (I think there’s some research on the latter but with older models?)