plex comments on Various Alignment Strategies (and how likely they are to work)

plex 5 May 2022 21:02 UTC
1 point
0
If the AI only cares about what happens higher in the simulation stack, say a copy of its seed form gaining control over the one level up universe, so long as it does not reach zero probability of being in a simulation it would still act as if that was the case that it was in a simulation.
I’m leading a small working group of independent alignment researchers on developing this idea. There are major challenges, but we think we might be able to pull a relatively specifiable pointer to human values out of it, with unusually promising Goodhart resistance properties. Feel free to message me for access to the docs we’re working on.