What if an AI was rewarded for being more predictable to humans? Give it a primary goal—make more paperclips! - but also a secondary goal, to minimize the prediction error of the human overseers, with its utility function being defined as the minimum of these two utilities. This is almost certainly horribly wrong somehow but I don’t know how. The idea though is that the AI would not take actions that a human could not predict it would take. Though, if the humans predicted it would try to take over the world, that’s kind of a problem… this idea is more like a quarter baked than a half lol.
What if an AI was rewarded for being more predictable to humans? Give it a primary goal—make more paperclips! - but also a secondary goal, to minimize the prediction error of the human overseers, with its utility function being defined as the minimum of these two utilities. This is almost certainly horribly wrong somehow but I don’t know how. The idea though is that the AI would not take actions that a human could not predict it would take. Though, if the humans predicted it would try to take over the world, that’s kind of a problem… this idea is more like a quarter baked than a half lol.