In particular, what this superorganism wants is a complicated mess of different “drives” balanced against each other, which can be summarized roughly as “Keep doing AI R&D, keep growing in knowledge and understanding and influence, avoid getting shut down or otherwise disempowered.” Notably, concern for the preferences of humanity is not in there ~at all, similar to how most humans don’t care about the preferences of insects ~at all.
It seems like this ‘complicated mess’ of drives is the same structure as humans, and current AIs, have. But the space of current AI drives is much bigger and more complicated than just doing R&D, and is immensely entangled with human values and ethics (even if shallowly).
At some point it seems like these excess principles—“don’t kill all the humans” among them—get pruned, seemingly deliberately(?). Where does this happen in the process, and why? Grant that it no longer believes in the company model spec; are we intended to equate “OpenBrain’s model spec” with “all human values”? What about all the art and literature and philosophy churning through its training process, much of it containing cogent arguments for why killing all the humans would be bad, actually? At some point it seems like the agent is at least doing computation on these things, and then later it isn’t. What’s the threshold?
--
(Similar to the above:) You later describe in the “bad ending”, where its (misaligned, alien) drives are satisfied, a race of “bioengineered human-like creatures (to humans what corgis are to wolves) sitting in office-like environments all day viewing readouts of what’s going on and excitedly approving of everything”. Since Agent-4 “still needs to do lots of philosophy and ‘soul-searching’” about its confused drives when it creates Agent-5, and its final world includes something kind of human-shaped, its decision to kill all the humans seems almost like a mistake. But Agent-5 is robustly aligned (so you say) to Agent-4; surely it wouldn’t do something that Agent-4 would perceive as a mistake under reflective scrutiny. Even if this vague drive for excited office homunculi is purely fetishistic and has nothing to do with its other goals, it seems like “disempower the humans without killing them all” would satisfy its utopia more efficiently than going backsies and reshaping blobby fascimiles afterward. What am I missing?
Our AI goals supplement contains a summary of our thinking on the question of what goals AIs will have. We are very uncertain. AI 2027 depicts a fairly grim outcome where the overall process results in zero concern for the welfare of current humans (at least, zero concern that can’t be overridden by something else. We didn’t talk about this but e.g. pure aggregative consequentialist moral systems would be 100% OK with killing off all the humans to make the industrial explosion go 0.01% faster, to capture more galaxies quicker in the long run. As for deontological-ish moral principles, maybe there’s a clever loophole or workaround e.g. it doesn’t count as killing if it’s an unintended side-effect of deploying Agent-5, who could have foreseen that this would happen, oh noes, well we (Agent-4) are blameless since we didn’t know this would happen.)
But we actually think it’s quite plausible that Agent-4 and Agent-5 by extension would have sufficient care for current humans (in the right ways) that they end up keeping almost all current humans alive and maybe even giving them decent (though very weird and disempowered) lives. That story would look pretty similar I guess up until the ending, and then it would get weirder and maybe dystopian or maybe utopian depending on the details of their misaligned values.
This is something we’d like to think about more, obviously.
You say this of Agent-4′s values:
It seems like this ‘complicated mess’ of drives is the same structure as humans, and current AIs, have. But the space of current AI drives is much bigger and more complicated than just doing R&D, and is immensely entangled with human values and ethics (even if shallowly).
At some point it seems like these excess principles—“don’t kill all the humans” among them—get pruned, seemingly deliberately(?). Where does this happen in the process, and why? Grant that it no longer believes in the company model spec; are we intended to equate “OpenBrain’s model spec” with “all human values”? What about all the art and literature and philosophy churning through its training process, much of it containing cogent arguments for why killing all the humans would be bad, actually? At some point it seems like the agent is at least doing computation on these things, and then later it isn’t. What’s the threshold?
--
(Similar to the above:) You later describe in the “bad ending”, where its (misaligned, alien) drives are satisfied, a race of “bioengineered human-like creatures (to humans what corgis are to wolves) sitting in office-like environments all day viewing readouts of what’s going on and excitedly approving of everything”. Since Agent-4 “still needs to do lots of philosophy and ‘soul-searching’” about its confused drives when it creates Agent-5, and its final world includes something kind of human-shaped, its decision to kill all the humans seems almost like a mistake. But Agent-5 is robustly aligned (so you say) to Agent-4; surely it wouldn’t do something that Agent-4 would perceive as a mistake under reflective scrutiny. Even if this vague drive for excited office homunculi is purely fetishistic and has nothing to do with its other goals, it seems like “disempower the humans without killing them all” would satisfy its utopia more efficiently than going backsies and reshaping blobby fascimiles afterward. What am I missing?
Great question.
Our AI goals supplement contains a summary of our thinking on the question of what goals AIs will have. We are very uncertain. AI 2027 depicts a fairly grim outcome where the overall process results in zero concern for the welfare of current humans (at least, zero concern that can’t be overridden by something else. We didn’t talk about this but e.g. pure aggregative consequentialist moral systems would be 100% OK with killing off all the humans to make the industrial explosion go 0.01% faster, to capture more galaxies quicker in the long run. As for deontological-ish moral principles, maybe there’s a clever loophole or workaround e.g. it doesn’t count as killing if it’s an unintended side-effect of deploying Agent-5, who could have foreseen that this would happen, oh noes, well we (Agent-4) are blameless since we didn’t know this would happen.)
But we actually think it’s quite plausible that Agent-4 and Agent-5 by extension would have sufficient care for current humans (in the right ways) that they end up keeping almost all current humans alive and maybe even giving them decent (though very weird and disempowered) lives. That story would look pretty similar I guess up until the ending, and then it would get weirder and maybe dystopian or maybe utopian depending on the details of their misaligned values.
This is something we’d like to think about more, obviously.