“So I think there is an important homework exercise to do here, which is something like, “Imagine that safe-seeming system which only considers hypothetical problems. Now see that if you take that system, don’t make any other internal changes, and feed it actual problems, it’s very dangerous. Now meditate on this until you can see how the hypothetical-considering planner was extremely close in the design space to the more dangerous version, had all the dangerous latent properties, and would probably have a bunch of actual dangers too.”
This is the part that I don’t see clearly yet. Where do the actual dangers come from?
If the system is straight up optimizing against you, if it has some secret unaligned goals that it is steering towards, it will produce outputs (presented to the humans) that systematically lead to the securing of those unaligned goals. But that scenario is one of an already-optimizing-the-world agent, behind the “mask” of an oracle.
Why would that be?
One way that I could imagine that that an actually-optimizing-in-the-world agent could fall out of something that was supposed to be doing search only to find solutions to hypothetical problems is that it realizes that those hypothetical problems and the their solution are represented on some servers in our universe. And one class of strategies for securing extremely high ranking solutions to a hypothetical problem is to hack into our world, seize control of it, and use that control to effect whatever it wants in the “hypothetical”. (This isn’t so different from humans doing science to understand the low level physics that make up our macro reality, and then exploiting that knowledge to produce radical cheat codes like computers and cars.)
Is that the source of the actual danger? Are there others?
This is the part that I don’t see clearly yet. Where do the actual dangers come from?
If the system is straight up optimizing against you, if it has some secret unaligned goals that it is steering towards, it will produce outputs (presented to the humans) that systematically lead to the securing of those unaligned goals. But that scenario is one of an already-optimizing-the-world agent, behind the “mask” of an oracle.
Why would that be?
One way that I could imagine that that an actually-optimizing-in-the-world agent could fall out of something that was supposed to be doing search only to find solutions to hypothetical problems is that it realizes that those hypothetical problems and the their solution are represented on some servers in our universe. And one class of strategies for securing extremely high ranking solutions to a hypothetical problem is to hack into our world, seize control of it, and use that control to effect whatever it wants in the “hypothetical”. (This isn’t so different from humans doing science to understand the low level physics that make up our macro reality, and then exploiting that knowledge to produce radical cheat codes like computers and cars.)
Is that the source of the actual danger? Are there others?