TL;DR: Can we collect real-world examples of misalignment while preserving privacy?
Naively, startups that monitor agents for unintended behavior could publish examples of misalignment they found, but I doubt almost any org would agree to this, since examples might involve the org’s code or other private data.
Ideas:
Work with orgs/agents that don’t handle private info. Maybe open source repos?
Only share the scenarios with a few research centers (Redwood?)
When misalignment is found, ask the org if it’s ok to share it. If even 1% agree it still might be nice.
Look up what created the culture of disclosing vulnerabilities in cyber security. Maybe that can be reproduced? (I do think the situation is very different here)
Sharing an anonymized/redacted version of the scenario
Seems a lot less useful to me (unless there is enough info to reproduce similar behavior), but it would be nice if we could get a lot of these
I found this interesting because the user of Replit seems to want to disclose the problem.
I can imagine having a norm (or a regulation?) where after Replit fix whatever problem they had—they release the LLM transcripts that are relevant to the incident.
(or maybe I’m wrong and this person wouldn’t want to share enough details for a full transcript)
We do not use Inputs or Suggestions to train our models unless: (1) they are flagged for security review (in which case we may analyze them to improve our ability to detect and enforce our Terms of Service)
But who decides what is flagged? And can Cursor do things other than train their models in other cases?
5.4. Usage Data. Anysphere may: (i) collect, analyze, and otherwise process Usage Data internally for its business purposes, including for security and analytics, to enhance the Service, and for other development and corrective purposes
I’m not sure if this is a reliable hint for what orgs would agree to
TL;DR: Can we collect real-world examples of misalignment while preserving privacy?
Naively, startups that monitor agents for unintended behavior could publish examples of misalignment they found, but I doubt almost any org would agree to this, since examples might involve the org’s code or other private data.
Ideas:
Work with orgs/agents that don’t handle private info. Maybe open source repos?
Only share the scenarios with a few research centers (Redwood?)
Would orgs agree[1] to this?
When misalignment is found, ask the org if it’s ok to share it. If even 1% agree it still might be nice.
Look up what created the culture of disclosing vulnerabilities in cyber security. Maybe that can be reproduced? (I do think the situation is very different here)
Sharing an anonymized/redacted version of the scenario
Seems a lot less useful to me (unless there is enough info to reproduce similar behavior), but it would be nice if we could get a lot of these
The Replit agent that deleted a production DB
I found this interesting because the user of Replit seems to want to disclose the problem.
I can imagine having a norm (or a regulation?) where after Replit fix whatever problem they had—they release the LLM transcripts that are relevant to the incident.
(or maybe I’m wrong and this person wouldn’t want to share enough details for a full transcript)
Looking at Cursor’s Privacy Policy:
But who decides what is flagged? And can Cursor do things other than train their models in other cases?
Also, this is from their TOS:
I’m not sure if this is a reliable hint for what orgs would agree to