Rogue internal deployments via external APIs
tl;dr: A heavily-monitored internally deployed AI with sensitive access (e.g. to model weights) could bypass safety controls by asking a weakly-monitored external API version of itself (or another AI) what to do. It could then use its privileged access to execute instructions that the external API generated under less monitoring.
Once AI companies build powerful AIs, they may:
Give internal AIs access to sensitive internal privileges (e.g. access to the internal infra that touches model weights, help with training of the next generation of AIs, …);
Have mitigations specific to those internal deployments enforced via monitoring of the internal LLM API;
Have an external API deployment without those mitigations.
In this situation, an internal AI may build an agent scaffold that makes calls to the external API and uses the internal sensitive privileges. Such agent scaffold would be in a better position to cause a catastrophe because it would not be subject to the same monitoring as the internal API. I call this a rogue internal deployment via external APIs. (It is “internal” because the model weights and the agent scaffold never leave the cluster.)
I think preventing those is similarly important from a misalignment perspective as preventing LLM weight exfiltration (aka rogue external deployments) or regular rogue internal deployments where the AI hacks its own datacenter to run unmonitored LLM inference.
This post explores this threat as well as potential mitigations. Overall, I think this is a threat that is relatively hard to mitigate. It is potentially a central example of a threat that probably requires fancy control measures (not “just trusted monitoring” or “just security”, probably a combination of trusted monitoring with some dedicated security measures, or something that fully leverages untrusted monitoring).
Rogue internal deployments via external APIs
Let’s say a scheming superhuman AI researcher (SAIR) is running inside OpenMind, and OpenMind is monitoring its actions via a monitor looking at all the traffic that goes through the internal API. Some human and a lot of AI labor is spent ensuring that no SAIR generation that goes out of the internal API would sabotage the security of OpenMind or the alignment of the next generation of AIs.
But in contrast to AI2027, let’s imagine that the SAIR is also deployed externally not long after it started being deployed internally. The versions deployed externally are not used to modify the infra or align the next generation of OpenMind AIs but they are also less carefully monitored (more on why in “Monitoring the external API”).
In such a situation, a scheming SAIR running internally (on the internal API, with internal permissions) could spin up an internal job with sensitive internal permissions that makes queries to OpenMind’s external API (more on why it may have access to it in “Preventing access to external APIs”) and executes the actions suggested by the external API.
The internal SAIR also makes sure that in the prompt sent to the external SAIR there is enough information for the external SAIR to understand the situation. And now you have a situation where a barely-monitored external SAIR can write code and misleading research proposals that will be executed and acted on using the sensitive internal privileges of OpenMind—aka the SAIR successfully set up a rogue deployment!
A variation: rogue internal deployments via cross-company APIs
Note that this might be a concern even if OpenMind doesn’t have an external API, as long as another AI company (e.g DeepCent) has an external API that serves a frontier scheming model. In this case, instead of sending queries to the external OpenMind API, OpenMind’s SAIR could send queries to the DeepCent API.
While AI companies might try to prevent this just to avoid sending their secrets to their competitors, I don’t think this means this problem will be solved by default.
It does make the situation better that OpenMind AIs have much fewer plausible reasons to interact with DeepCent’s API than OpenMind’s API, but because DeepCent (and other actors that can wrap OpenMind or DeepCent’s API) might want to access OpenMind’s secrets, it means that scheming AIs might get a lot of help from outsiders setting up a scaffold that sends requests to an external LLM API.
A possible mitigation: preventing the creation of scaffolds that use external LLM APIs via monitoring
The intervention that seems most promising to me is to monitor the code that internal AIs write and block attempts at creating code that both accesses sensitive internal permissions and gives LLMs accessed via external APIs the ability to interact at length with sensitive internal permissions.
A variant of this could be to plug the monitoring and the control interventions in the agent scaffolds and the PR review system and use monitoring to ensure modification to these systems don’t unplug the monitor (and don’t redirect them to a less trusted LLM API).
This monitoring task can be somewhat difficult, but I am hopeful that certain security measures can help monitors do a decent job at this. For example:
With the right permissions, you could reduce the amount of generated code with both internet access and sensitive internal permissions, which helps focus monitoring efforts in the right place and reduce the effects of false positives elsewhere;
With enough API-key hygiene, you could make it difficult for an internal scaffold (with sensitive internal permissions and internet access) to find external LLM API keys “by accident” (and actively looking for them would be more suspicious).
This is not as effective when the LLM gets help from outside attackers that are trying to get AI company secrets since such attacker-provided endpoints might not even need API keys, but getting the instructions on how to access such external help might also be blatantly bad to a monitor.
While this prevents direct code/research agent sabotage, it does not prevent things like the external LLM API producing ideas for how to perform sabotage. This problem seems trickier to solve, and figuring out how big of an issue this is involves thinking about steganographic communication and encoded reasoning, which is out-of-scope for this post.
Why I am more pessimistic about other solutions
Monitoring the external API
I think monitoring external LLM APIs can help, but that is unlikely to be sufficient.
Why not just monitor the external API like you monitor the internal API?
Traffic understanding. You might not understand external LLM API traffic as well as internal traffic. If you want to know if some usage is red-teaming, or sandboxed usage, or sensitive usage, you can ask people inside the company, but you usually can’t ask external API users as easily (and it is harder to verify their claims).
Cost. This lack of understanding means it could be harder to focus monitoring effort in the right place and take more drastic and costly measures to prevent potentially catastrophic actions during the most sensitive usage. This can result in large costs, both in the amount of AI and human labor spent reviewing less sensitive parts of the traffic, and usefulness cost when applying interventions to parts of the traffic that are less sensitive. External traffic might also be much bigger than internal traffic (though it depends on the economics of intelligence explosions, and seems plausible only in longer timelines where the AI R&D returns to AI labor are small).
User privacy. User privacy might limit how much in-depth monitoring you can do. Companies often care a lot that AI companies can’t read their LLM usage, and this might be hard to change. This hinders human monitoring.
Coordination. If the external LLM API is built by DeepCent, it is hard for OpenMind to make sure DeepCent’s monitoring is good enough to prevent catastrophes from happening.
The main way I can see it as being sufficient is if all actors carefully monitor and block any usage that looks like it has access to internal resources of AI companies in order to avoid speeding up other AI companies—but that seems unlikely to me given the current trajectory, and it might be technically difficult to detect such traffic without preventing general software engineering and research.
Preventing access to external APIs
I think blocking access to external APIs is something that can help reduce the amount of code that is run that has the security permissions to access both the internet and sensitive internal permissions, but I would be surprised if it could drive it to something negligible.
Why not just block internet access or block access to the relevant websites?
There are malicious human actors that will help AIs get these permissions because routing such traffic would let you access valuable secrets. This means that targeted bans might not suffice.
Not giving AIs internet access would probably be a large tax on SAIR productivity—interacting with the internet is useful in many ways, especially in the world where there are external APIs (as the external world would also massively use large amounts of AI labor in ways which are almost surely relevant to staying competitive).
And even if you solved the problem of LLM API proxies, there are good reasons to sometimes give your SAIR access to external LLM APIs and sensitive internal permissions, for example when improving the external LLM API or when trying to learn from the API of other model providers. It might be possible to mostly eliminate this risk by splitting sensitive resources, but it is unclear how costly such division of sensitive resources would be.
Monitoring access to sensitive permissions
One last way you can prevent a catastrophe from happening is monitoring the usage of sensitive permissions. I think this is a valid way to reduce risk, and would probably be hard to bypass—which makes this “rogue” internal deployment via external APIs not totally rogue. But I’d still be surprised if it enabled reducing risk by more than 50%.
Final thoughts
Overall I find this particular threat somewhat concerning, though not less tractable than other forms of rogue internal deployments.
On a more meta-level, I find it concerning that I found this new way in which AIs could set up rogue deployments roughly a year after having first tried to catalog ways in which AIs could set up a rogue deployment.
I think this was predictable, since “Try to catalogue every way in which an AI could set up a rogue deployment” is an argument from exhaustive free association. I also predict that you will think of another one within a year. Conditional on you thinking of a new method, my gut says there’s a 1⁄2 chance that it was something you could have thought of a year ago, and a 1⁄2 chance that it’s based on some new empirical discovery about LLMs made between a year ago and when you think of it (a la “The model triggers emergent misalignment in a distilled mini version of itself.”)
My guess is that for control of things around human level, you should be able to use humans to find all the things that an LLM could have thought of. This is an update against long-in-advance small-scale efforts being good enough, not an update against the general methodology.
I think there are domains where you can actually point at a small category of things that actually capture the most important threats. I think this exercise was for me an update in the direction of rogue internal deployments having lots of important and non-trivial subcategories as opposed to few.
Re 1: LLMs have spiky capability levels which don’t line up with “human level” nicely. So I’m not convinced this is the case. For example, they can do extremely good author attribution, come up with seed prompts to induce “personas” in other instances of themself, etc.
Re 2: Maybe? I’m skeptical here. If the solution space contains elements like “Call an external API” then I expect there to be a Large (i.e. un-enumeratable) number of solutions.
At an intersection of the two: would we have predicted in advance something like “GPT 4o, which is still dumber than most humans, has a persuasive/charismatic persona which people become highly attached to, to the point where OpenAI could not shut it off without significant backlash.”
An internally-deployed AI working on code with employees will be interacting with employees in an intellectual capacity (collaborating on code and intellectual problems) which means it will have opportunities to gain favour/influence with them the same way 4o did with a bunch of casual readers (I don’t know if this was part of your threat model a year ago, FWIW).
(I had elements on persuasion which I think roughly covered what you are describing here. I also had elements on poisoning the data of future models which covers the “triggering emergent misalignment of AIs it trains” you describe above.)