What are your thoughts on this AI failure mode: Assume an AI works by rewarding itself when it improves its model of the world (which is roughly Schmidhuber’s curiosity-driven reinforcement learning approach to AI), however, the AI figures out that it can also receive reward if it turns this sort of learning on its head: Instead of changing a model to make it better fit the world, the AI starts changing the world to make it better fit its model.
Has this been considered before? Can we see this occurring in natural intelligence?
Instead of changing a model to make it better fit the world, the AI starts changing the world to make it better fit its model.
One might call this ‘cleaning’ or ‘homogenizing’ the world; instead of trying to get better at predicting the variation, you try to reduce the variation so that prediction is easier.
I don’t think I’ve seen much mathematical work on this, and very little that discusses it as an AI failure mode. Most of the discussions I see of it as a failure mode have to do with markets, globalization, agriculture, and pandemic risk.
The problem is that in this specific case “the world state you want” is more or less defined as something that is easy to model (because you are rewarded when your models for the world), which may give you incentives to destroy exceptional complicated things… such as life.
It would be a form of agency but probably not the definition of it. In the curiosity-driven approach the agent is thought to choose actions such that it can gain reward form learning new things about the world, thereby compressing the knowledge about the world more (possibly overlooking that the reward could also be gained from making the world better fit the current model of it).
The best illustrating example I can think of right now is an AI that falsely assumes that the Earth is spherical and it decides to flatten the equator instead of updating its model.
What are your thoughts on this AI failure mode: Assume an AI works by rewarding itself when it improves its model of the world (which is roughly Schmidhuber’s curiosity-driven reinforcement learning approach to AI), however, the AI figures out that it can also receive reward if it turns this sort of learning on its head: Instead of changing a model to make it better fit the world, the AI starts changing the world to make it better fit its model.
Has this been considered before? Can we see this occurring in natural intelligence?
One might call this ‘cleaning’ or ‘homogenizing’ the world; instead of trying to get better at predicting the variation, you try to reduce the variation so that prediction is easier.
I don’t think I’ve seen much mathematical work on this, and very little that discusses it as an AI failure mode. Most of the discussions I see of it as a failure mode have to do with markets, globalization, agriculture, and pandemic risk.
Isn’t it basically the definition of agency? Steering the world state toward the one you want?
The problem is that in this specific case “the world state you want” is more or less defined as something that is easy to model (because you are rewarded when your models for the world), which may give you incentives to destroy exceptional complicated things… such as life.
It would be a form of agency but probably not the definition of it. In the curiosity-driven approach the agent is thought to choose actions such that it can gain reward form learning new things about the world, thereby compressing the knowledge about the world more (possibly overlooking that the reward could also be gained from making the world better fit the current model of it).
The best illustrating example I can think of right now is an AI that falsely assumes that the Earth is spherical and it decides to flatten the equator instead of updating its model.