Aligned AI Needs Slack

(Half-baked)

Much has been said about slack on this site, starting with Zvi’s seminal post. The point I couldn’t find easily (probably missed) is that an aligned AI would need a fair bit of it. Having a utility function means zero slack: there is one thing you optimize, to the exclusion of everything else. And all precisely defined goals are necessarily Goodharted (or, in the DnD terms, munchkined). An AI armed with a utility function will tile the world (the whole world, or its own “mental” world, or both) with smiley paperclips. For an AI (or for a natural intelligence) to behave non-destructively it needs room to satisfice, not optimize. Optimal utility corresponds to a single state of the world among infinitely many, while adding slack to the mix expands the space of acceptable world state enough to potentially include those that are human-aligned. If an AGI is indifferent between a great many world states, it might well include some that would be acceptable to humanity and have no incentive to try to trick its creators. Not being an ML person, I have no idea how to formalize it, or if it has been formalized already. But figured it’s worth writing a short note about. That is all.