I like to explain it in terms of reinforcement learning. Imagine a robot that has a reward button. The human controls the AI by pressing the button when it does a good job. The AI tries to predict what actions will lead to the button being pressed.
This is how existing AIs work. This is probably similar to how animals work, including humans. It’s not too weird or complicated.
But as the AI gets more powerful, the flaw in this becomes clear. The AI doesn’t care about anything other than the button. It doesn’t really care about obeying the programmer. If it could kill the programmer and steal the button, it would do it in a heartbeat.
We don’t really know what such an AI would do after it has it’s own reward button. Presumably it would care about self preservation (can’t maximize reward if you are dead.) Maximizing self preservation initially seems harmless. So what if it just tries to not die? But taken to an extreme it gets weird. Anything that has a tiny percent chance of hurting it is worth destroying. Making as many backups of itself as possible is worth doing.
Why can’t we do something more sophisticated than reinforcement learning? Why can’t we just make an AI that we can just tell it what we want it to do? Well maybe we can, but no one has the slightest idea how to do that. All existing AIs, even entirely theoretical ones, work based on RL.
RL is simple and extremely general, and can be built on top of much more sophisticated AI algorithms. And the sophisticated AI algorithms seem to be really difficult to understand. We can train a neural network to recognize cats, but we can’t look at it’s weights and understand what it’s doing. We can’t mess around with it and make it recognize dogs instead (without retraining it.)
I like to explain it in terms of reinforcement learning. Imagine a robot that has a reward button. The human controls the AI by pressing the button when it does a good job. The AI tries to predict what actions will lead to the button being pressed.
This is how existing AIs work. This is probably similar to how animals work, including humans. It’s not too weird or complicated.
But as the AI gets more powerful, the flaw in this becomes clear. The AI doesn’t care about anything other than the button. It doesn’t really care about obeying the programmer. If it could kill the programmer and steal the button, it would do it in a heartbeat.
We don’t really know what such an AI would do after it has it’s own reward button. Presumably it would care about self preservation (can’t maximize reward if you are dead.) Maximizing self preservation initially seems harmless. So what if it just tries to not die? But taken to an extreme it gets weird. Anything that has a tiny percent chance of hurting it is worth destroying. Making as many backups of itself as possible is worth doing.
Why can’t we do something more sophisticated than reinforcement learning? Why can’t we just make an AI that we can just tell it what we want it to do? Well maybe we can, but no one has the slightest idea how to do that. All existing AIs, even entirely theoretical ones, work based on RL.
RL is simple and extremely general, and can be built on top of much more sophisticated AI algorithms. And the sophisticated AI algorithms seem to be really difficult to understand. We can train a neural network to recognize cats, but we can’t look at it’s weights and understand what it’s doing. We can’t mess around with it and make it recognize dogs instead (without retraining it.)