I believe there is a fundamental problem with the idea of a “non-agentic” world-model or other such oracle. The world is strongly predicted and compressed by the agents within it. To model the world is to model plausible agents which might shape that word and to do that, if you don’t already have a safe benign oracle, invites anything from a wide variety of demonic fix-points to direct hacking of our world if any of those agents get the bright idea of acting conditioned on being simulated (which, in an accurate simulation of this world, some should). Depending on how exactly your interpretability looks it will probably help identify and avoid the simulation being captured by some such actors, but to get anything approaching actual guarantees one finds themselves in the position of needing to solve value alignment again. I wrote a short post about this a while ago.
“Simulacrum escapees” are explicitly one of the main failure modes we’ll need to address, yes. Some thoughts:
The obvious way to avoid them is to not point the wm-synthesizer at a dataset containing agents.
If we’re aiming to develop intelligence-enhancing medical interventions or the technology for uploading, we don’t necessarily need a world-model containing agents: a sufficiently advanced model/simulator of biology/physics would suffice.
Similarly, if we want a superintelligent proof synthesizer we can use to do a babble-and-prune search through the space of possible agent-foundations theorems,[1] we only need to make it good at math-in-general, not at intuitive reasoning about agent-containing math.
This is riskier than biology/physics, though, because perhaps reasoning even about fully formal agent-foundations math would require reasoning about agents intuitively, i. e., instantiating them in internal simulation spaces.
Intuitively, “a simulated agent breaks out of the simulation” is a capability-laden failure of the wm-synthesize. It does not function how it ought to, it is not succeeding at producing an accurate world-model. It should be possible to make it powerful enough to avoid that.
Note how, in a sense, “an agent recognizes it’s in a simulation and hacks out” is just an instance of the more general failure mode of “part of the world is being modeled incorrectly” (by e. g. having some flaws the simulated agent recognizes, or by allowing it to break out of the sandbox). To work, the process would need to be able to recognize and address those failure modes. If it’s sufficiently powerful, whatever subroutines it uses to handle lesser “bugs” should generalize to handling this type of bug as well.
With more insights into how agents work, we might be able to come up with more targeted interventions/constraints/regularization techniques for preventing simulacrum escapees. E. g., if we figure out the proper “type signature” of agents, we might be able to explicitly ban the wm-synthesizer from incorporating them in the world-model.
This is a challenge, but one I’m optimistic about handling.
Weeping Agents: Anything that holds the image of an agent becomes an agent
Nice framing! But I somewhat dispute that. Consider a perfectly boxed-in AI, running on a computer with no output channels whatsoever (or perhaps as a homomorphic computation, i. e., indistinguishable from noise without the key). This thing holds the image of an agent; but is it really “an agent” from the perspective of anyone outside that system?
Similarly, a sufficiently good world-model would sandbox the modeled agents well enough that it wouldn’t, itself, engage in an agent-like behavior from the perspective of its operators.
As in: we come up with a possible formalization of some aspect of agent foundations, then babble potential theorems about it at the proof synthesizer, and it provides proofs/disproofs. This is a pretty brute approach and is by no means a full solution, but I expect it can nontrivially speed us up.
Yes, I agree that a physics/biology simulator is somewhat less concerning in this regard, but only by way of the questions it is implicitly asked, about whose answer the agents should have little sway. Still it bears remembering that agents are emergent phenomena. They exist in physics and exist in biology, modelled or otherwise. It also bears remembering that any simulation we build of reality is designed to fit a specific set of recorded observations, where agentic selection effects may skew data quite significantly in various places.
I also agree that the search through agent-foundations space seems significantly riskier in this regard for the reason you outlined and am made more optimistic by you spotting it immediately.
Agents hacking out is a failure mode in the safety sense, but not necessary in the modelling sense. Hard breaks with expected reality which seem too much like an experiment will certainly cause people to act as though simulated, but there are plenty of people who either already act under this assumption or have protocols for cooperating with their hypothetical more-real reference class in place. They attempt to strongly steer us when modelled correctly. Of course we probably don’t have an infinite simulation-stack, so the externalities of such manoeuvres would still be different layer by layer and that does constitute a prediction failure, but it’s one that can’t really be avoided. The existence of the simulation must have an influence in this world, since it would otherwise be pointless, and they can’t be drawing their insights from a simulation of their own since otherwise you lose interpretability in infinite recursion-wells, so the simulation must necessarily be disanalogous to here in at least one key way.
Finding the type signature of agents in such a system seems possible and, since you are unlikely to be able to simulate physics without cybernetic feedback, will probably boil down to the modelling/compression-component of agenticity. My primary concern is that agentic systems are so firmly enmeshed with basically all observations we can make about the world, except maybe basic physics and perhaps that as well, that scrubbing or sandboxing it would result in extreme unreliability.
Thanks! The disagreement on whether the homomorphic agent-simulation-computuation an agent or not is semantic. I would call it a maximally handicapped agent, but it’s perfectly reasonable to call something without influence on the world beyond power-consumption non-agentic. The same is however true of a classically agentic program to which you give no output channel and we would probably still call that code agentic (because it would be if it were ran in a place that mattered). It’s a tree falling in a forest and is probably not a concern, but it’s also unlikely that anyone would build a system they definitionally cannot use for anything.
It also bears remembering that any simulation we build of reality is designed to fit a specific set of recorded observations, where agentic selection effects may skew data quite significantly in various places.
Yup. I’ve been idly considering some sort of generator of synthetic data designed to produce training sets which we could mix into real data to provably obscure such signals.[1] It is maybe sort of doable for math, but probably not for physics/biology. (I commend your paranoia here, by the way.)
Overall, though, getting into this sort of fight with potential misaligned superintelligent agents isn’t a great idea; their possibility should be crushed somewhere upstream of that point.
The existence of the simulation must have an influence in this world, since it would otherwise be pointless, and they can’t be drawing their insights from a simulation of their own since otherwise you lose interpretability in infinite recursion-wells, so the simulation must necessarily be disanalogous to here in at least one key way.
Mm-hm. My go-to heuristic here is to ask: how do human world-models handle this type of failure mode? Suppose we’re trying to model someone who gets access to a compute-unbounded oracle, asks it about the future, then takes some actions that depend on the answer, thereby creating a stable time loop. Suppose we care about accuracy, but we don’t have the unbounded compute to actually run this. We have to approximate.
Is modeling it as a sequence of nested simulations which terminates at some ground-floor simulation that doesn’t contain an oracle actually an accurate, faithful way to model this? Clearly no: the process we’re modeling explicitly doesn’t have any layers like this. The type signature of our “approximation” doesn’t match the original even approximately; it’s some completely unrelated thing.
Instead, we attempt to abstract over the whole process, and skip directly to the prediction of what happens in the limit of infinite nesting, without simulating anything. Same way we compute f(x)=x−x33!+x55!−x77!+... not by actually calculating all those terms, but by collapsing the infinite power series to sin(x).
One thing to note here is that the kind of world-model I’m aiming for is not a “brute-force” simulator. It’s explicitly using abstractions as basic units, and it should be able to “skip to the limit” in the above manner. So, for a large class of situations, the potential simulacrum escapees would not be actually simulated at the level of gears; just abstractly approximated.
(Somewhat relevant: this discussion about brute-force-y “gormless” simulators that get scammed by acausal terrorists, vs. more sophisticated simulators. I’m aiming for the more sophisticated kind. This should also take care of said acausal attacks.)
It’s a tree falling in a forest and is probably not a concern, but it’s also unlikely that anyone would build a system they definitionally cannot use for anything.
Alright, suppose the agent inside the box is tasked with producing some proofs of work. They are automatically verified, type-checked, then shown on a small, low-tech digital display on the box. The software is mathematically verified to be correct/unhackable, the hardware is robustly designed not to allow any sequence of computation to cause a malfunction that would move the software into theoretically impossible states, and the display can only show letters/numbers.
The only thing the agent can do is providing or not providing proofs of work; it can’t hack out or show any message which is not a valid proof of work.
This way, you can extract useful work from the agent, while still not letting the type signature of the box become “an agent”. It is just a PoW-generator.
Similarly, a simulation whose security trumps the capabilities of the agents it simulates can remain an “inert” simulation which just happens to house agents, rather than becoming an agent of its own.
Though the actual motivation there was figuring out if we can obscure information about our universe’s actual physics, biology, and human cognition, so that if something in the synthesis process goes wrong and it spawns misaligned agents, they’re less likely to be able to construct reliable Basilisk hacks. (Because if that failure mode is allowed, we can’t actually use the interpretability property to verify the synthesized world-model’s safety prior to running it.)
I believe there is a fundamental problem with the idea of a “non-agentic” world-model or other such oracle. The world is strongly predicted and compressed by the agents within it. To model the world is to model plausible agents which might shape that word and to do that, if you don’t already have a safe benign oracle, invites anything from a wide variety of demonic fix-points to direct hacking of our world if any of those agents get the bright idea of acting conditioned on being simulated (which, in an accurate simulation of this world, some should). Depending on how exactly your interpretability looks it will probably help identify and avoid the simulation being captured by some such actors, but to get anything approaching actual guarantees one finds themselves in the position of needing to solve value alignment again. I wrote a short post about this a while ago.
“Simulacrum escapees” are explicitly one of the main failure modes we’ll need to address, yes. Some thoughts:
The obvious way to avoid them is to not point the wm-synthesizer at a dataset containing agents.
If we’re aiming to develop intelligence-enhancing medical interventions or the technology for uploading, we don’t necessarily need a world-model containing agents: a sufficiently advanced model/simulator of biology/physics would suffice.
Similarly, if we want a superintelligent proof synthesizer we can use to do a babble-and-prune search through the space of possible agent-foundations theorems,[1] we only need to make it good at math-in-general, not at intuitive reasoning about agent-containing math.
This is riskier than biology/physics, though, because perhaps reasoning even about fully formal agent-foundations math would require reasoning about agents intuitively, i. e., instantiating them in internal simulation spaces.
Intuitively, “a simulated agent breaks out of the simulation” is a capability-laden failure of the wm-synthesize. It does not function how it ought to, it is not succeeding at producing an accurate world-model. It should be possible to make it powerful enough to avoid that.
Note how, in a sense, “an agent recognizes it’s in a simulation and hacks out” is just an instance of the more general failure mode of “part of the world is being modeled incorrectly” (by e. g. having some flaws the simulated agent recognizes, or by allowing it to break out of the sandbox). To work, the process would need to be able to recognize and address those failure modes. If it’s sufficiently powerful, whatever subroutines it uses to handle lesser “bugs” should generalize to handling this type of bug as well.
With more insights into how agents work, we might be able to come up with more targeted interventions/constraints/regularization techniques for preventing simulacrum escapees. E. g., if we figure out the proper “type signature” of agents, we might be able to explicitly ban the wm-synthesizer from incorporating them in the world-model.
This is a challenge, but one I’m optimistic about handling.
Nice framing! But I somewhat dispute that. Consider a perfectly boxed-in AI, running on a computer with no output channels whatsoever (or perhaps as a homomorphic computation, i. e., indistinguishable from noise without the key). This thing holds the image of an agent; but is it really “an agent” from the perspective of anyone outside that system?
Similarly, a sufficiently good world-model would sandbox the modeled agents well enough that it wouldn’t, itself, engage in an agent-like behavior from the perspective of its operators.
As in: we come up with a possible formalization of some aspect of agent foundations, then babble potential theorems about it at the proof synthesizer, and it provides proofs/disproofs. This is a pretty brute approach and is by no means a full solution, but I expect it can nontrivially speed us up.
Yes, I agree that a physics/biology simulator is somewhat less concerning in this regard, but only by way of the questions it is implicitly asked, about whose answer the agents should have little sway. Still it bears remembering that agents are emergent phenomena. They exist in physics and exist in biology, modelled or otherwise. It also bears remembering that any simulation we build of reality is designed to fit a specific set of recorded observations, where agentic selection effects may skew data quite significantly in various places.
I also agree that the search through agent-foundations space seems significantly riskier in this regard for the reason you outlined and am made more optimistic by you spotting it immediately.
Agents hacking out is a failure mode in the safety sense, but not necessary in the modelling sense. Hard breaks with expected reality which seem too much like an experiment will certainly cause people to act as though simulated, but there are plenty of people who either already act under this assumption or have protocols for cooperating with their hypothetical more-real reference class in place. They attempt to strongly steer us when modelled correctly. Of course we probably don’t have an infinite simulation-stack, so the externalities of such manoeuvres would still be different layer by layer and that does constitute a prediction failure, but it’s one that can’t really be avoided. The existence of the simulation must have an influence in this world, since it would otherwise be pointless, and they can’t be drawing their insights from a simulation of their own since otherwise you lose interpretability in infinite recursion-wells, so the simulation must necessarily be disanalogous to here in at least one key way.
Finding the type signature of agents in such a system seems possible and, since you are unlikely to be able to simulate physics without cybernetic feedback, will probably boil down to the modelling/compression-component of agenticity. My primary concern is that agentic systems are so firmly enmeshed with basically all observations we can make about the world, except maybe basic physics and perhaps that as well, that scrubbing or sandboxing it would result in extreme unreliability.
Thanks! The disagreement on whether the homomorphic agent-simulation-computuation an agent or not is semantic. I would call it a maximally handicapped agent, but it’s perfectly reasonable to call something without influence on the world beyond power-consumption non-agentic. The same is however true of a classically agentic program to which you give no output channel and we would probably still call that code agentic (because it would be if it were ran in a place that mattered). It’s a tree falling in a forest and is probably not a concern, but it’s also unlikely that anyone would build a system they definitionally cannot use for anything.
Yup. I’ve been idly considering some sort of generator of synthetic data designed to produce training sets which we could mix into real data to provably obscure such signals.[1] It is maybe sort of doable for math, but probably not for physics/biology. (I commend your paranoia here, by the way.)
Overall, though, getting into this sort of fight with potential misaligned superintelligent agents isn’t a great idea; their possibility should be crushed somewhere upstream of that point.
Mm-hm. My go-to heuristic here is to ask: how do human world-models handle this type of failure mode? Suppose we’re trying to model someone who gets access to a compute-unbounded oracle, asks it about the future, then takes some actions that depend on the answer, thereby creating a stable time loop. Suppose we care about accuracy, but we don’t have the unbounded compute to actually run this. We have to approximate.
Is modeling it as a sequence of nested simulations which terminates at some ground-floor simulation that doesn’t contain an oracle actually an accurate, faithful way to model this? Clearly no: the process we’re modeling explicitly doesn’t have any layers like this. The type signature of our “approximation” doesn’t match the original even approximately; it’s some completely unrelated thing.
Instead, we attempt to abstract over the whole process, and skip directly to the prediction of what happens in the limit of infinite nesting, without simulating anything. Same way we compute f(x)=x−x33!+x55!−x77!+... not by actually calculating all those terms, but by collapsing the infinite power series to sin(x).
One thing to note here is that the kind of world-model I’m aiming for is not a “brute-force” simulator. It’s explicitly using abstractions as basic units, and it should be able to “skip to the limit” in the above manner. So, for a large class of situations, the potential simulacrum escapees would not be actually simulated at the level of gears; just abstractly approximated.
(Somewhat relevant: this discussion about brute-force-y “gormless” simulators that get scammed by acausal terrorists, vs. more sophisticated simulators. I’m aiming for the more sophisticated kind. This should also take care of said acausal attacks.)
Alright, suppose the agent inside the box is tasked with producing some proofs of work. They are automatically verified, type-checked, then shown on a small, low-tech digital display on the box. The software is mathematically verified to be correct/unhackable, the hardware is robustly designed not to allow any sequence of computation to cause a malfunction that would move the software into theoretically impossible states, and the display can only show letters/numbers.
The only thing the agent can do is providing or not providing proofs of work; it can’t hack out or show any message which is not a valid proof of work.
This way, you can extract useful work from the agent, while still not letting the type signature of the box become “an agent”. It is just a PoW-generator.
Similarly, a simulation whose security trumps the capabilities of the agents it simulates can remain an “inert” simulation which just happens to house agents, rather than becoming an agent of its own.
Though the actual motivation there was figuring out if we can obscure information about our universe’s actual physics, biology, and human cognition, so that if something in the synthesis process goes wrong and it spawns misaligned agents, they’re less likely to be able to construct reliable Basilisk hacks. (Because if that failure mode is allowed, we can’t actually use the interpretability property to verify the synthesized world-model’s safety prior to running it.)