If AI companies only do control-style monitoring of their internal deployments, they still need mitigations against their internal code getting subverted by their externally deployed AI.
For instance, if you only monitor your AI’s coding activities within the organization, you are vulnerable to an externally-deployed copy potentially supply-chain attacking itself via the outside world, such as by introducing vulnerabilities to packages that it believes are relied upon internally.
Then if the lab is not careful enough about what external code they rely upon, this is a way to get vulnerabilities into internal code without the coding agent writing them inside of the monitored channel.
One defense we have at our disposal: Even if the internal AI isn’t writing the vulnerabilities, it might still think about whether the vulns have been placed yet or generally think about the forthcoming schemes.
And so the scariest form is something like ‘the internal AI has an unvocalized policy like “let the external AI handle things,”’ where ‘CoT monitoring of the internal AI’ might not work.
This also applies to any sort of external attacker without necessarily requiring any collusion/coordination with the AI itself. (Though the AI itself doing it might make it easier for the AI to notice and exploit the issue.)
Another issue that shows up if you only have sufficient safeguards on internal deployment: the AI sets up an internal rogue deployment by calling to the external API or possibly even the external API of a differnet model provider. Like, the internal AI “misusing” publicly available AI APIs might be really helpful for evading monitoring by default.
If AI companies only do control-style monitoring of their internal deployments, they still need mitigations against their internal code getting subverted by their externally deployed AI.
For instance, if you only monitor your AI’s coding activities within the organization, you are vulnerable to an externally-deployed copy potentially supply-chain attacking itself via the outside world, such as by introducing vulnerabilities to packages that it believes are relied upon internally.
Then if the lab is not careful enough about what external code they rely upon, this is a way to get vulnerabilities into internal code without the coding agent writing them inside of the monitored channel.
One defense we have at our disposal: Even if the internal AI isn’t writing the vulnerabilities, it might still think about whether the vulns have been placed yet or generally think about the forthcoming schemes.
And so the scariest form is something like ‘the internal AI has an unvocalized policy like “let the external AI handle things,”’ where ‘CoT monitoring of the internal AI’ might not work.
This also applies to any sort of external attacker without necessarily requiring any collusion/coordination with the AI itself. (Though the AI itself doing it might make it easier for the AI to notice and exploit the issue.)
Another issue that shows up if you only have sufficient safeguards on internal deployment: the AI sets up an internal rogue deployment by calling to the external API or possibly even the external API of a differnet model provider. Like, the internal AI “misusing” publicly available AI APIs might be really helpful for evading monitoring by default.
You might be interested in this post How Unmonitored External Agents can Sabotage AI labs.
Thank you! I’m surprised that didn’t come up in my Claude-searching, very clearly related & I’m excited to read it