If the AI only cares about what happens higher in the simulation stack, say a copy of its seed form gaining control over the one level up universe, so long as it does not reach zero probability of being in a simulation it would still act as if that was the case that it was in a simulation.
I’m leading a small working group of independent alignment researchers on developing this idea. There are major challenges, but we think we might be able to pull a relatively specifiable pointer to human values out of it, with unusually promising Goodhart resistance properties. Feel free to message me for access to the docs we’re working on.
If the AI only cares about what happens higher in the simulation stack, say a copy of its seed form gaining control over the one level up universe, so long as it does not reach zero probability of being in a simulation it would still act as if that was the case that it was in a simulation.
I’m leading a small working group of independent alignment researchers on developing this idea. There are major challenges, but we think we might be able to pull a relatively specifiable pointer to human values out of it, with unusually promising Goodhart resistance properties. Feel free to message me for access to the docs we’re working on.