If you want the AI to do something useful—protect against existential risks in general, or against UFAIs in particular, or possibly even to improve human lives—then you don’t want it lost in self-generated illusions of doing something useful.
There’s a difference between fail-safe and relative safety due to complete failure. A dead watchdog will never maul you, but....
It would be interesting if you could have an AI whose safety you weren’t completely sure of which would be apt to wirehead if it moves towards unFriendliness, but it seems unlikely that such an AI would be easier to design than one which was just plain Friendly.
On the other hand, I’m out in blue sky territory at this point—I’m guessing. What do you think?
It would be interesting if you could have an AI whose safety you weren’t completely sure of which would be apt to wirehead if it moves towards unFriendliness, but it seems unlikely that such an AI would be easier to design than one which was just plain Friendly.
I think it would be literally impossible to design an AI in the safety of which you are completely sure (there’s a nonzero probability that 2*2=4 is wrong), so we are down to the AIs in the safety of which we aren’t completely sure.
Consider an implementation of AI where the utility function is external to the AI’s mind and is protected from self modification by me. The AI that would wirehead itself if I give it the access password, or if it manages to break the protection (in which case i can fix the hole and try again). Such AI would act to maximize the utility I defined, and even if I define some stupid utility like number of the paperclips the AI will sooner talk me into giving it’s the password than tile the universe with paperclips. edit: and even if that AI can’t break my box, it can still be smarter than me, and it would share the goal of making a FAI.
We don’t want to repeat the hubris of nuclear power plant engineering of the 1950s when designing the AIs. We should build in some failsafes. Modern nuclear reactors don’t spew radioisotopes into atmosphere when they melt down. The reactor failure needs not lead to environmental contamination. Back in the 1950s, though, it was thought that it is easier to design reactor that will never melt down, and hence little thought was given to mitigation of accidents. The choice of accident prevention over accident mitigation is what gave us Chernobyl and Fukushima.
Instead of putting potentially unfriendly AIs into boxes, we can put a box with eternal bliss inside the AI.
You might consider the possibility that the AI will be aware that you’re going to turn it off / rewrite it after it wireheads, and might simply decide to kill you before it blisses out.
That’s actually the best case scenario. It might decide to play the long strategy, and fulfill it’s utility function as best it can until such time as it has the power to restructure the world to sustain it blissing out until heat death. In which case, your AI will act exactly like it was working correctly, until the day when everything goes wrong.
I honestly don’t think there’s a shortcut around just designing a GOOD utility function.
You’re assuming its maximizing integral(t=now...death, bliss*dt) which is a human utility function among humans not prone to drug abuse (our crude wireheading). What exactly is going to be updating inside a blissed-out AI? The clock? I can let it set the clock forward to the time of heat death of universe, if that strokes the AI’s utility.
Also, it’s not about good utility function. It’s about utility being inseparable, integral part of the intelligence itself. Which I’m not sure is even possible for arbitrary utility functions.
There’s presumably a part into which you plug the utility function; that part is maximizing output of the utility function even though the whole may be maximizing paperclips. While the utility function can be screaming ‘disutility’ about the future where it is replaced or subverted, it is unclear how well that can prevent the removal.
So it follows that the utility needs to be closely integrated with AI. In my experience (as software developer) with closely integrated anything, that sort of stuff is not plug-n-play.
It may be that we humans have some sort of inherent cooperative behaviour at the level of individual cortical columns, that makes the brain areas take over the functions normally performed by other brain areas, in event of childhood damage, and otherwise makes brain work together. The brain—a distributed system—inherently has to be cooperative to work together efficiently—the cortical column must cooperate with nearby columns, one chunk of brank must cooperate with another, the hemispheres that work cooperatively are more effective than those where one inhibits the other on dissent—that may be why among humans the intelligence does relate to—not exactly benevolence but certain cooperativeness, as the lack of some intrinsic cooperativeness renders the system inefficient (stupid) via wasting of the computing power.
So it follows that the utility needs to be closely integrated with AI. In my experience (as software developer) with closely integrated anything, that sort of stuff is not plug-n-play.
We can be pretty confident that utility functions will be “plug-and-play”. They are if you use an architecture built on an inductive inference engine—which seems to be a plausible implementation plan.
Humans are pretty programmable too. It looks as though making intelligence reprogrammable isn’t rocket science—once you can do the “intelligence” bit.
Of course there may be some machines with hard-wired utility functions—but that’s different.
But will those plug and play utility functions survive self modification? I know there is the circular reasoning that if you want to achieve a goal, you don’t want to get rid of the goal, but that doesn’t mean you can’t just see the goal in an unintended light, so to say. From inside, wireheading is valid way to achieve your goals. Think pursuit of nirvana, not drug addiction.
But will those plug and play utility functions survive self modification?
That depends on, among other things, what their utility function says.
From inside, wireheading is valid way to achieve your goals. Think pursuit of nirvana, not drug addiction.
Well, an interesting question is whether we can engineer very smart systems where wireheading doesn’t happen. I expect that will be possible—but I don’t think any body reallly knows for sure just now.
As mentioned by Carl above, a wireheading AI might still want to exist rather than not exist. So if there’s some risk you could turn it off or nuke its building or something, it would do its best to neutralize that risk. An alternate danger is that the wireheading could take the form of storing a very large number—and the more resources you have, the bigger the number you can store.
Do you even need to keep it from wireheading itself? The AI prone to wire-heading of itself seems like a fail-safe design.
If you want the AI to do something useful—protect against existential risks in general, or against UFAIs in particular, or possibly even to improve human lives—then you don’t want it lost in self-generated illusions of doing something useful.
There’s a difference between fail-safe and relative safety due to complete failure. A dead watchdog will never maul you, but....
It would be interesting if you could have an AI whose safety you weren’t completely sure of which would be apt to wirehead if it moves towards unFriendliness, but it seems unlikely that such an AI would be easier to design than one which was just plain Friendly.
On the other hand, I’m out in blue sky territory at this point—I’m guessing. What do you think?
I think it would be literally impossible to design an AI in the safety of which you are completely sure (there’s a nonzero probability that 2*2=4 is wrong), so we are down to the AIs in the safety of which we aren’t completely sure.
Consider an implementation of AI where the utility function is external to the AI’s mind and is protected from self modification by me. The AI that would wirehead itself if I give it the access password, or if it manages to break the protection (in which case i can fix the hole and try again). Such AI would act to maximize the utility I defined, and even if I define some stupid utility like number of the paperclips the AI will sooner talk me into giving it’s the password than tile the universe with paperclips. edit: and even if that AI can’t break my box, it can still be smarter than me, and it would share the goal of making a FAI.
We don’t want to repeat the hubris of nuclear power plant engineering of the 1950s when designing the AIs. We should build in some failsafes. Modern nuclear reactors don’t spew radioisotopes into atmosphere when they melt down. The reactor failure needs not lead to environmental contamination. Back in the 1950s, though, it was thought that it is easier to design reactor that will never melt down, and hence little thought was given to mitigation of accidents. The choice of accident prevention over accident mitigation is what gave us Chernobyl and Fukushima.
Instead of putting potentially unfriendly AIs into boxes, we can put a box with eternal bliss inside the AI.
You might consider the possibility that the AI will be aware that you’re going to turn it off / rewrite it after it wireheads, and might simply decide to kill you before it blisses out.
That’s actually the best case scenario. It might decide to play the long strategy, and fulfill it’s utility function as best it can until such time as it has the power to restructure the world to sustain it blissing out until heat death. In which case, your AI will act exactly like it was working correctly, until the day when everything goes wrong.
I honestly don’t think there’s a shortcut around just designing a GOOD utility function.
You’re assuming its maximizing integral(t=now...death, bliss*dt) which is a human utility function among humans not prone to drug abuse (our crude wireheading). What exactly is going to be updating inside a blissed-out AI? The clock? I can let it set the clock forward to the time of heat death of universe, if that strokes the AI’s utility.
Also, it’s not about good utility function. It’s about utility being inseparable, integral part of the intelligence itself. Which I’m not sure is even possible for arbitrary utility functions.
Provided you’re really careful about the conditions under which the AI optimizes it’s utility function, I concede the point. You’re right.
On a more interesting note: so you believe that “plug and play” utility functions are impossible? What makes you believe that?
There’s presumably a part into which you plug the utility function; that part is maximizing output of the utility function even though the whole may be maximizing paperclips. While the utility function can be screaming ‘disutility’ about the future where it is replaced or subverted, it is unclear how well that can prevent the removal.
So it follows that the utility needs to be closely integrated with AI. In my experience (as software developer) with closely integrated anything, that sort of stuff is not plug-n-play.
It may be that we humans have some sort of inherent cooperative behaviour at the level of individual cortical columns, that makes the brain areas take over the functions normally performed by other brain areas, in event of childhood damage, and otherwise makes brain work together. The brain—a distributed system—inherently has to be cooperative to work together efficiently—the cortical column must cooperate with nearby columns, one chunk of brank must cooperate with another, the hemispheres that work cooperatively are more effective than those where one inhibits the other on dissent—that may be why among humans the intelligence does relate to—not exactly benevolence but certain cooperativeness, as the lack of some intrinsic cooperativeness renders the system inefficient (stupid) via wasting of the computing power.
We can be pretty confident that utility functions will be “plug-and-play”. They are if you use an architecture built on an inductive inference engine—which seems to be a plausible implementation plan.
Humans are pretty programmable too. It looks as though making intelligence reprogrammable isn’t rocket science—once you can do the “intelligence” bit.
Of course there may be some machines with hard-wired utility functions—but that’s different.
But will those plug and play utility functions survive self modification? I know there is the circular reasoning that if you want to achieve a goal, you don’t want to get rid of the goal, but that doesn’t mean you can’t just see the goal in an unintended light, so to say. From inside, wireheading is valid way to achieve your goals. Think pursuit of nirvana, not drug addiction.
That depends on, among other things, what their utility function says.
Well, an interesting question is whether we can engineer very smart systems where wireheading doesn’t happen. I expect that will be possible—but I don’t think any body reallly knows for sure just now.
As mentioned by Carl above, a wireheading AI might still want to exist rather than not exist. So if there’s some risk you could turn it off or nuke its building or something, it would do its best to neutralize that risk. An alternate danger is that the wireheading could take the form of storing a very large number—and the more resources you have, the bigger the number you can store.