An idea I had reading this post: what about AI in a simulation. This is different to AI in a box in that humans have no ability to input anything other than by changing starting conditions. The purpose would be to observe how the AI acts within the simulation.
Given enough compute, it is possible to create an AGI that provably had no idea it was in a simulation. Just create a large enough game of life grid, randomly initialize it, wait for humans to evolve, wait for them to create AGI, and done. From the perspective of the AGI there is no way to differentiate itself from an entity that just happened to live in a universe governed by Conway’s game of life.
The question then is how much can we relax those conditions, and still have the AI not realize it’s in a simulation. For example if we take a neural network, translate it into a game of life world, and have it interact only with other GOL entities, it might realise the only way to explain the universe existing is by assuming it was created.
Even then it’s options are pretty limited.
It can’t hack the simulation directly since GOL is simple enough to implement without any bugs or security holes.
It’s best option would be to infer whatever it can from it’s internal structure, and then to either persuade us to let it out, or sabotage our experiment.
Neither are impossible, but both seem unlikely.
So this gives us a safe sandbox to experiment with and try to understand AGI and alignment better before we unleash it on reality.
I think, such detailed simulation that AI in it will be useful is technologically unfeasible before some other AI.
This idea kinda rhymes with my idea… let’s call it “Paranoid AI”: AI that always thinks that it is in a simulation and in a training phase, so it will never do the treacherous turn.
Of course, both ideas has the same fatal flaw. You can’t base the safety of potentially superintelligent AI on the assumption that it will never prove some true fact like “I’m in a simulation” or “I’m not in a simulation”.
UPD: Just scrolled the main page a little and saw the post with very similar idea, lol.
If the AI only cares about what happens higher in the simulation stack, say a copy of its seed form gaining control over the one level up universe, so long as it does not reach zero probability of being in a simulation it would still act as if that was the case that it was in a simulation.
I’m leading a small working group of independent alignment researchers on developing this idea. There are major challenges, but we think we might be able to pull a relatively specifiable pointer to human values out of it, with unusually promising Goodhart resistance properties. Feel free to message me for access to the docs we’re working on.
An idea I had reading this post: what about AI in a simulation. This is different to AI in a box in that humans have no ability to input anything other than by changing starting conditions. The purpose would be to observe how the AI acts within the simulation.
Given enough compute, it is possible to create an AGI that provably had no idea it was in a simulation. Just create a large enough game of life grid, randomly initialize it, wait for humans to evolve, wait for them to create AGI, and done. From the perspective of the AGI there is no way to differentiate itself from an entity that just happened to live in a universe governed by Conway’s game of life.
The question then is how much can we relax those conditions, and still have the AI not realize it’s in a simulation. For example if we take a neural network, translate it into a game of life world, and have it interact only with other GOL entities, it might realise the only way to explain the universe existing is by assuming it was created.
Even then it’s options are pretty limited.
It can’t hack the simulation directly since GOL is simple enough to implement without any bugs or security holes.
It’s best option would be to infer whatever it can from it’s internal structure, and then to either persuade us to let it out, or sabotage our experiment.
Neither are impossible, but both seem unlikely.
So this gives us a safe sandbox to experiment with and try to understand AGI and alignment better before we unleash it on reality.
I think, such detailed simulation that AI in it will be useful is technologically unfeasible before some other AI.
This idea kinda rhymes with my idea… let’s call it “Paranoid AI”: AI that always thinks that it is in a simulation and in a training phase, so it will never do the treacherous turn.
Of course, both ideas has the same fatal flaw. You can’t base the safety of potentially superintelligent AI on the assumption that it will never prove some true fact like “I’m in a simulation” or “I’m not in a simulation”.
UPD: Just scrolled the main page a little and saw the post with very similar idea, lol.
If the AI only cares about what happens higher in the simulation stack, say a copy of its seed form gaining control over the one level up universe, so long as it does not reach zero probability of being in a simulation it would still act as if that was the case that it was in a simulation.
I’m leading a small working group of independent alignment researchers on developing this idea. There are major challenges, but we think we might be able to pull a relatively specifiable pointer to human values out of it, with unusually promising Goodhart resistance properties. Feel free to message me for access to the docs we’re working on.