Recommendation: Bug Bounties and Responsible Disclosure for Advanced ML Systems

tl;dr—I think companies making user-facing advanced ML systems should deliberately set up a healthier relationship with users generating adversarial inputs; my proposed model is bug bounties and responsible disclosure, and I’m happy to help facilitate their creation.

User-facing advanced ML systems are in their infancy; creators and users are still figuring out how to handle them.

Currently, the loop looks something like: the creators try to set up a training environment that will produce a system that behaves (perhaps trying to make them follow instructions, or be a helpful and harmless assistant, or so on), they’ll release it to users, and then people on Twitter will compete to see who can create an unexpected input that causes the model to misbehave.

This doesn’t seem ideal. It’s adversarial instead of collaborative, the prompts are publicly shared,[1] and the reward for creativity or understanding of the models is notoriety instead of cash, improvements to the models, or increased access to the models.

I think a temptation for companies, who want systems that behave appropriately for typical users, is to block researchers who are attempting to break those systems, reducing their access and punishing the investigative behavior. Especially when the prompts involve deliberate attempts to put the system in a rarely used portion of its input space, retraining the model or patching the system to behave appropriately in those scenarios might not substantially improve the typical user experience, while still generating bad press for the product. I think papering over flaws like this is probably short-sighted.

This situation should seem familiar. Companies have been making software systems for a long time, and users have been finding exploits for those systems for a long time. I recommend that AI companies and AI researchers, who until now have not needed to pay much attention to the history of computer security, should try to figure out the necessary modifications to best practices for this new environment (ideally with help from cybersecurity experts). It should be easy for users to inform creators of prompts that cause misbehavior, and for creators to make use of that as further training data for their models;[2] there should be a concept of ‘white hat’ prompt engineers; there should be an easy way for companies with similar products to inform each other of generalizable vulnerabilities.

I also think this won’t happen by default; it seems like many companies making these systems are operating in a high-velocity environment where no one is actively opposed to implementing these sorts of best practices, but they aren’t prioritized highly enough to be implemented. This is where I think broader society can step in and make this both clearly desirable and easily implementable.

Some basic ideas:

  • Have a policy for responsible disclosure: if someone has identified model misbehavior, how can they tell you? What’s a reasonable waiting period before going public with the misbehavior? What, if anything, will you reward people for disclosing?

  • Have a monitored contact for that responsible disclosure. If you have a button on your website to report terrible generations, does that do anything? If you have a google form to collect bugs, do you have anyone looking at the results and paying out bounties?

  • Have clear-but-incomplete guidance on what is worth disclosing. If your chatbot is supposed to be able to do arithmetic but isn’t quite there yet, you probably don’t want to know about all the different pairs of numbers it can’t multiply correctly. If your standard is that no one should be surprised by an offensive joke, but users asking for them can get them, that should be clear as well.[3]

If you work at a company that makes user-facing AI systems, I’m happy to chat and put you in touch with resources (people, expert consultation, or how to help convince your managers to prioritize this); send me a direct message or an email to my username at

If you have relevant experience in setting up bug bounty systems, or would like to be a cybersecurity resource for this sort of company, I’d also be happy to hear from you.

  1. ^

    Also known as full disclosure. Recent examples (of how to provoke Bing) don’t worry me, but I think there are some examples that I’ve seen that do seem like they shouldn’t be broadly shared until the creators have had a chance to patch the vulnerability.

  2. ^

    Both as RLHF for the base model, and training examples for a ‘bad user’ model, if you’re creating one.

  3. ^

    I think there’s a lot of room for improvement in expectation-management about how aligned creators think their system is, and I think many commentators are making speculative inferences because there’s not public statements.