Yes, all of them. Every channel through which the ASI can causally influence the outside world would ideally pass through the monitored interface. That would be the starting point of the whole architecture.
That said, the proposal doesn’t necessarily collapse if this is violated. It makes subsequent dangerous actions more likely to occur, but not inevitable, because the finiteness argument applies to the physical world broadly, not just the interface. Even if the ASI communicates through an undetected side channel, whatever entity receives that communication still has to take actions in the physical world. And ultimately catastrophic outcomes (engineered pathogens, nuclear access, infrastructure takeover) are the end products of long causal chains with observable physical steps, nonzero timescales, and points at which a vastly faster-thinking defensive agent can detect and intervene.
Lester Leong
Thanks for this, these are good objections and worth addressing carefully.
On confidence level. Fair point. I tried to frame this as a conjecture with falsifiable assumptions (A1-A8), but I can see the overall tone reading as more confident than warranted. I’ll think about adjusting the framing for the future.
On computational prohibitiveness. I think there might be a misunderstanding here, and if so it’s probably my fault for not making this clearer. The tractability claim depends heavily on the action granularity section. At document-level granularity, you’re right, the branching factor is intractable. But the architecture specifies bit-level granularity (which we control, since we design the interface), where the branching factor drops to 2. Exhaustive lookahead to depth 60 then requires roughly 10^18 operations, which is feasible at the speed ratios I’m proposing. Each bit carries at most 1 bit of strategic content, making it very hard to embed a coherent dangerous plan across millions of individually inspected actions. If you think the bit-level analysis breaks for a different reason, I’d be interested to hear it.
On instantaneous catastrophic actions. This is a good challenge, but I think it rests on a picture of the architecture that differs from what I’m proposing. The weaker AI evaluates actions before they pass through the interface, not after execution. The ASI has no direct access to nuclear switches or any external system. Every bit-level output is approved or vetoed before reaching anything outside. So the question isn’t “can we intercept a launch in time?” but “can we detect across millions of inspected bits that the trajectory is converging toward gaining access to those switches?” The long causal chain leading to the ASI having that ability is where the defense operates. That said, you’re touching on a real challenge: ensuring that the monitored interface is truly the only channel between the ASI and the world. Side channels (power consumption patterns, electromagnetic emissions, timing signals) are a genuine concern. But this is a well-studied problem in high-security computing with known engineering solutions (TEMPEST shielding, Faraday cages, optical isolation), and the weaker agent has enormous subjective time to continuously monitor for anomalous physical signatures that might indicate a side channel attempt. It’s demanding engineering, but the point I’m hoping to make is that it isn’t intractable, given a sufficiently fast supervisor.
On speed superintelligence. You’re right, this doesn’t apply to an already-running unconstrained system. It’s a proposal for how to build and operate ASI safely from the start (assumption A5), not a containment strategy applied after the fact. I agree this scope deserves clearer framing. Thanks for pushing on it.
Regarding the second issue (the point that the LLM may not know how to play at different time ratios) - I wonder whether this may be true for current LLM and DRL systems where inference happens in a single step (and where intelligence is derived mainly from training), BUT potentially not true for the next crop of reasoners, where a significant portion of intelligence is derived from the reasoning step that happens at inference time and which we are hoping to target with this scheme. One can imagine that sufficiently advanced AI, even if not explicitly trained for a given task, will eventually succeed at that task, by reasoning about it and extrapolating to new knowledge (as humans do) versus interpolating from limited training data. We could eventually have an “o2-preview” or “o3-preview” model that, through deductive reasoning and without any additional training, is able to figure out that it’s moving too slow to be effective, and adjust its strategy to rely more on grand strategy. This is the regime of intelligence that I think could be most vulnerable to a slowdown.
As to the first issue, there are lightweight RTS frameworks (eg, microRTS) that consume minimal compute and can be run in parallel, but without any tooling for LLMs specifically. I thought TextSC2 would be a good base because it not only offers this, but a few other advantages as well: 1) sufficiently deep action and strategy spaces that provide AI agents with enough degrees of freedom to develop emergent behavior, and 2) sufficient popularity whereby human evaluators can better understand, evaluate, and supervise AI agent strategy and behavior. With that said, if there are POMDPs or MDPs that you think are better fits, I would be very happy to check them out!
Thanks for the feedback. It would be great to learn more about your agenda and see if there are any areas where we may be able to help each other.
You’re absolutely correct. I’ve reached out to the original framework authors to confirm. I will be creating a PR for their repo as well as for the one that I’ve forked. I suspect this won’t change much about overall win/loss rates, but will be running a few tests here to confirm.
Thanks for the excellent feedback. I did consider action ratio at first, but it does have some slightly different considerations that made it a little challenging to do for an initial pass. The first is based on current limitations with the TextSC2 framework—there isn’t a way to obtain detailed action logs for the in-game AI the same way we can for our agents, so it would require an “agent vs agent” setup instead of “agent vs in-game AI”. And while TextSC2 supports this, it currently does not allow for real-time play when doing it (probably because “agent vs agent” setups require running two instances of the game at once on the same machine, which would cause performance degradation, and there isn’t any netcode to run multiplayer games over a network). With that said, SC2 is a 15 year old game at this point, and if someone has state-of-the-art hardware, it should be possible to run both instances with at least 40-50 fps, so this is something I would like to work on improving within the framework.
The second consideration is that not all actions are temporally equivalent, with some actions taking longer than others, and so it may not be a true “apples to apples” comparison if both agents employ different strategies that utilize different mixes of actions. We would probably either have to weight each action differently, or increase sample size to smooth out the noise, or both.
Regarding horizon lengths and scaling, I agree that this would be a great next direction of exploration, and suspect you may be correct regarding irreducible loss here. More broadly, it would be great to establish scaling laws that apply across different adversarial environments (beyond SC2). I think this could have a significant impact on a lot of the discourse around AI risk.
+
I do concede that in theory it should be possible to do this on standard computers, and that it might actually be a good way to test this hypothesis out and gather empirical data today.
Where I’m not so sure is whether even “slow” standard computers think slower than humans do. In other words, imagine some future AI architecture that is OOM more power and compute efficient. It may not be entirely unreasonable to suppose that we could get an algorithmic/architectural innovation that would enable GPT-4 level performance on old legacy hardware (eg, from the 80s). Indeed, at the unit level, we have calculators from the 60s that can out-multiply the fastest mental mathematicians today.
I thought of using virtualization/simulation to emulate a slow sandbox environment for the AI to run in, but as you mention, it does lower the barrier for an “escape”, creating risk that might be unacceptable when dealing with ASI.
With that said, being able to simulate this on current hardware does suggest a possible way to test this whole hypothesis out today. For example, we could have two different models: Model A, with higher parameter count and ELO score running on a slower environment, and Model B, with less parameters/lower ELO score running on a faster environment, and see if Model B is able to sufficiently supervise Model A on some task that would be difficult (or impossible) if both were running at the same speed.
I am human (as far as I can tell), but have been told my writing can be mechanical sometimes!
Regarding bit-level monitoring, yes, the AI may break down actions into harmless-seeming chunks. But the supervisor isn’t evaluating the chunk in isolation. Consider the following toy example, where we are representing an action in the form of a string that the AI outputs chunk-by-chunk:
“Send priority message to STRATCOM. Authorize execution of contingency plan 7-Alpha. Confirm release of strategic ____”
If the chunks are composed strategically (“Send priority message”, “to STRATCOM”, “Authorize execution of”, “contingency plan 7-Alpha”, etc), they are independently innocuous. But notice how the space of possibilities for the last remaining chunk has decreased substantially, given all the chunks that have already been received. It would not be difficult for the supervisor to enumerate this space, assign high probability to the last chunk being something like “nuclear weapons”, and then, critically, *before* this chunk is received, take action to harden against that outcome (perhaps by alerting humans, disabling the launch mechanism, etc), since it is significantly likely.
Could there be a chunk that ostensibly, instantaneously one-shots a catastrophe, without any prior context? Theoretically yes, but realistically probably highly unlikely, in my opinion.
Even in that case, the setup I had in mind is such that the supervisor monitors the interface, and the ASI never directly manipulates the physical world without supervisor approval. Even if “nuclear weapons” were output, the supervisor would veto it.
I’ll update the post to make these mechanisms more explicit. Thanks for the thoughtful pushback.