Thanks for the review. I think you’ve summarized the post fairly well; I’ll try to clarify some parts of the underlying model and claims that may have been unclear in the post.
A properly designed simbox will generate clear evidence about whether an AGI will help other agents, or whether it will engage in world conquest.
These two are not mutually exclusive: a fully aligned agent may still ‘conquer the world’ to prevent unaligned agents from doing the same. The real test is then what the agent does after conquering.
There’s a wide variety of threat models that we could worry about. A project that tries to defend against all of them is likely to be too slow for its results to matter.
Right—and the goal here isn’t to simulate all threats, but to create a testbed for alignment. If you have solved alignment-in-general then specific threat models no longer matter.
There seems to be a contradiction between these statements:
I’m fairly confident that toy worlds can be rich enough to train human-level AGI. I’ll assume that failures to reach human levels of generality imply a need for different AGI designs, and not much change to the simboxes.
My main concern is a trade-off between the realism needed to evaluate alignment, and the goal of ensuring that the AGI can’t break out of the simbox.
My biggest concern with simboxes is that they won’t get enough compute to simulate a sufficiently realistic environment.
I don’t see why alignment inherently needs more realism than intelligence, but I think your main point is that testing alignment requires sufficient quantity, quality, and perhaps diversity of other agents to compete/cooperate with. I agree with that.
How big should the simulated universes look to the AGI? Some early tests could be done on a universe whose physics only supports village-sized societies. But I’m worried that simulations with few apparent resources would cause the AGI to be less ambitious than it will be in our world.
If you accept that the main cost for a sim is compute (with intelligence obviously requiring compute), and that we explicitly we are designing sims with dualistic physics (mind is a fully separate non matter-based phenomenon) where compute is not supported for safety reasons, then most of the compute cost of the sim is for simulating intelligent agents—and just intelligence in general.
The computational physics of GPUs are such that a single high end GPU has roughly enough compute for human-brain level intelligence, but comes up short in RAM by a few OOM. As a result the practical efficient training strategies involve model-parallelism where you train a large population of agents in parallel which share much of their params/weights. If new techniques allow us to use much less VRAM that doesn’t change the picture much either—just makes it easier to train large populations in parallel. So it doesn’t seem that much any extra effort is required to get reasonable sized populations of agents—that’s just a natural outcome.
Also, the NPC population is rather unbounded (and this is already true in games). So creating simulated universes that appear as large as ours is possible without using much any extra compute (and there is along tradition of games with vast proceduraly generated universes).
I also see some assumptions about the feasibility of modifying the AGI’s utility function. The utility function seems unlikely to be implemented in a way that makes this straightforward. Also, I expect the AGI will resist such changes if developers wait too long to try them.
The AGI in simboxes shouldn’t be aware of concepts like utility functions, let alone the developers or the developers modifying their utility functions (or minds).
A key test of whether “LOVE in a Simbox” succeeds is how it handles mind uploading.
I agree—and there may be interesting ways to model analogs of uploading in later simboxes, but doing that without compute-capable physics may be more difficult.
To the extent that the AGI(s) replicate human-style virtue signaling, they’ll be inclined to manipulate humans into believing that the AGI(s) are making the right choices about uploading, and it will be hard to predict whether they’re getting it right, or whether they’re doing the equivalent of tiling the universe with smiling faces.
I’m somewhat optimistic that this problem can be solved, but I don’t have a satisfactory explanation of how to solve it.
I’m also fairly optimistic that can be solved at least to the degree sufficient that we shouldn’t differentially worry about uploading “tiling the universe with smiling faces” with aligned-in-simboxes AGI creating the technology any moreso than we do with humans creating the technology.
I don’t see why alignment inherently needs more realism than intelligence
I was focused solely on testing alignment. I’m pretty confused about how much realism is needed to produce alignment.
explicitly we are designing sims with dualistic physics (mind is a fully separate non matter-based phenomenon)
I guess I should have realized that, but it did not seem obvious enough for me to notice.
So it doesn’t seem that much any extra effort is required to get reasonable sized populations of agents
I’m skeptical. I don’t know whether I’ll manage to quantify my intuitions well enough to figure out how much we disagree here.
The AGI in simboxes shouldn’t be aware of concepts like utility functions, let alone the developers or the developers modifying their utility functions (or minds).
It seems likely that some AGIs would notice changes in behavior. I expect it to be hard to predict what they’ll infer.
But now that I’ve thought a bit more about this, I don’t see any likely path to an AGI finding a way to resist the change in utility function.
Thanks for the review. I think you’ve summarized the post fairly well; I’ll try to clarify some parts of the underlying model and claims that may have been unclear in the post.
These two are not mutually exclusive: a fully aligned agent may still ‘conquer the world’ to prevent unaligned agents from doing the same. The real test is then what the agent does after conquering.
Right—and the goal here isn’t to simulate all threats, but to create a testbed for alignment. If you have solved alignment-in-general then specific threat models no longer matter.
There seems to be a contradiction between these statements:
I don’t see why alignment inherently needs more realism than intelligence, but I think your main point is that testing alignment requires sufficient quantity, quality, and perhaps diversity of other agents to compete/cooperate with. I agree with that.
If you accept that the main cost for a sim is compute (with intelligence obviously requiring compute), and that we explicitly we are designing sims with dualistic physics (mind is a fully separate non matter-based phenomenon) where compute is not supported for safety reasons, then most of the compute cost of the sim is for simulating intelligent agents—and just intelligence in general.
The computational physics of GPUs are such that a single high end GPU has roughly enough compute for human-brain level intelligence, but comes up short in RAM by a few OOM. As a result the practical efficient training strategies involve model-parallelism where you train a large population of agents in parallel which share much of their params/weights. If new techniques allow us to use much less VRAM that doesn’t change the picture much either—just makes it easier to train large populations in parallel. So it doesn’t seem that much any extra effort is required to get reasonable sized populations of agents—that’s just a natural outcome.
Also, the NPC population is rather unbounded (and this is already true in games). So creating simulated universes that appear as large as ours is possible without using much any extra compute (and there is along tradition of games with vast proceduraly generated universes).
The AGI in simboxes shouldn’t be aware of concepts like utility functions, let alone the developers or the developers modifying their utility functions (or minds).
I agree—and there may be interesting ways to model analogs of uploading in later simboxes, but doing that without compute-capable physics may be more difficult.
I’m also fairly optimistic that can be solved at least to the degree sufficient that we shouldn’t differentially worry about uploading “tiling the universe with smiling faces” with aligned-in-simboxes AGI creating the technology any moreso than we do with humans creating the technology.
I was focused solely on testing alignment. I’m pretty confused about how much realism is needed to produce alignment.
I guess I should have realized that, but it did not seem obvious enough for me to notice.
I’m skeptical. I don’t know whether I’ll manage to quantify my intuitions well enough to figure out how much we disagree here.
It seems likely that some AGIs would notice changes in behavior. I expect it to be hard to predict what they’ll infer.
But now that I’ve thought a bit more about this, I don’t see any likely path to an AGI finding a way to resist the change in utility function.