I’m working on a research project at Rethink Priorities on this topic; whether and how to use bug bounties for advanced ML systems. I think your tl;dr is probably right—although I have a few questions I’m planning to get better answers to in the next month before advocating/facilitating the creation of bounties in AI safety:
How subjective can prize criteria for AI safety bounties be, while still incentivizing good quality engagement?
If prize criteria need high specificity, are we able to specify unsafe behaviour which is relevant to longterm AI safety (and not just obviously met by all existing AI models)?
How many valuable insights are gained from the general public (e.g. people on Twitter competing to cause the model to misbehave) vs internal red-teaming?
What is the usual career trajectory of bug bounty prize-winners?
What kind of community could a big, strong infrastructure of AI safety bounties facilitate?
How much would public/elite opinion of general AI safety be affected by more examples of vulnerabilities?
If anyone has thoughts on this topic or these questions (including what, more important, questions you’d like to see asked/answered), or wants more info on my research, I’d be keen to speak (here, or firstname@rethinkpriorities[dot]org, or calendly.com/patrick-rethink).
I’m working on a research project at Rethink Priorities on this topic; whether and how to use bug bounties for advanced ML systems. I think your tl;dr is probably right—although I have a few questions I’m planning to get better answers to in the next month before advocating/facilitating the creation of bounties in AI safety:
How subjective can prize criteria for AI safety bounties be, while still incentivizing good quality engagement?
If prize criteria need high specificity, are we able to specify unsafe behaviour which is relevant to longterm AI safety (and not just obviously met by all existing AI models)?
How many valuable insights are gained from the general public (e.g. people on Twitter competing to cause the model to misbehave) vs internal red-teaming?
Might bounty hunters generate actually harmful behaviour?
What is the usual career trajectory of bug bounty prize-winners?
What kind of community could a big, strong infrastructure of AI safety bounties facilitate?
How much would public/elite opinion of general AI safety be affected by more examples of vulnerabilities?
If anyone has thoughts on this topic or these questions (including what, more important, questions you’d like to see asked/answered), or wants more info on my research, I’d be keen to speak (here, or firstname@rethinkpriorities[dot]org, or calendly.com/patrick-rethink).