Unfortunately for us, this would not cause it to “bliss out” if it was constructed as a rational learning agent, so it would then proceed to take actions to stop anyone from ever removing the duct-tape.
That might be true for taping the button down or doing something analogous in software; in that case it’d still be evaluating expected button presses, it’s just that most of the numbers would be very large (and effectively useless from a training perspective). But more sophisticated means of hacking its reward function would effectively lobotomize it: if a pure reinforcement learner’s reward function returns MAXINT on every input, it has no way of planning or evaluating actions against each other.
Those more sophisticated means are also subjectively more rewarding as far as the agent’s concerned.
Ah, really? Oh, right, because current pure reinforcement learners have no self-model, and thus an anvil on their own head might seem very rewarding.
Well, consider my statement modified: current pure reinforcement learners are Unfriendly, but stupid enough that we’ll have a way to kill them, which they will want us to enact.
A self-model might help, but it might not. It depends on the details of how it plans and how time discounting and uncertainty get factored in.
That comes at the stage before the agent inserts a jump-to-register or modifies its defaults or whatever it ends up doing, though. Once it does that, it can’t plan no matter how good of a self-model it had before. The reward function isn’t a component of the planning system in a reinforcement learner; it is the planning system. No reward gradient, no planning.
(Early versions of EURISKO allegedly ran into this problem. The maintainer eventually ended up walling off the reward function from self-modification—a measure that a sufficiently smart AI would presumably be able to work around.)
Thanks for explaining that! Really. For one thing, it clarified a bunch of things I’d been wondering about learning architectures, the evolution of complicated psychologies like ours, and the universe at large. (Yeah, I wish my Machine Learning course had covered reinforcement learners and active environments, but apparently active environments means AI whereas passive learning means ML. Oh well.)
For instance, I now have a clear answer to the question: why would a value architecture more complex than reinforcement learning evolve in the first place? Answer: because pure reinforcement learning falls into a self-destructive bliss-out attractor. Therefore, even if it’s computationally (and therefore physically/biologically) more simple, it will get eliminated by natural selection very quickly.
Well, this is limited by the agent’s ability to hack its reward system, and most natural agents are less than perfect in that respect. I think the answer to “why aren’t we all pure reinforcement learners?” is a little less clean than you suggest; it probably has something to do with the layers of reflexive and semi-reflexive agency our GI architecture is built on, and something to do with the fact that we have multiple reward channels (another symptom of messy ad-hoc evolution), and something to do with the bounds on our ability to anticipate future rewards.
Even so, it’s not perfect. Heroin addicts do exist.
However, a reality in which pure reinforcement learners self-destruct from blissing out remains simpler than one in which a sufficiently good reinforcement learner goes FOOM and takes over the universe.
That might be true for taping the button down or doing something analogous in software; in that case it’d still be evaluating expected button presses, it’s just that most of the numbers would be very large (and effectively useless from a training perspective). But more sophisticated means of hacking its reward function would effectively lobotomize it: if a pure reinforcement learner’s reward function returns MAXINT on every input, it has no way of planning or evaluating actions against each other.
Those more sophisticated means are also subjectively more rewarding as far as the agent’s concerned.
Ah, really? Oh, right, because current pure reinforcement learners have no self-model, and thus an anvil on their own head might seem very rewarding.
Well, consider my statement modified: current pure reinforcement learners are Unfriendly, but stupid enough that we’ll have a way to kill them, which they will want us to enact.
A self-model might help, but it might not. It depends on the details of how it plans and how time discounting and uncertainty get factored in.
That comes at the stage before the agent inserts a jump-to-register or modifies its defaults or whatever it ends up doing, though. Once it does that, it can’t plan no matter how good of a self-model it had before. The reward function isn’t a component of the planning system in a reinforcement learner; it is the planning system. No reward gradient, no planning.
(Early versions of EURISKO allegedly ran into this problem. The maintainer eventually ended up walling off the reward function from self-modification—a measure that a sufficiently smart AI would presumably be able to work around.)
Thanks for explaining that! Really. For one thing, it clarified a bunch of things I’d been wondering about learning architectures, the evolution of complicated psychologies like ours, and the universe at large. (Yeah, I wish my Machine Learning course had covered reinforcement learners and active environments, but apparently active environments means AI whereas passive learning means ML. Oh well.)
For instance, I now have a clear answer to the question: why would a value architecture more complex than reinforcement learning evolve in the first place? Answer: because pure reinforcement learning falls into a self-destructive bliss-out attractor. Therefore, even if it’s computationally (and therefore physically/biologically) more simple, it will get eliminated by natural selection very quickly.
Neat!
Well, this is limited by the agent’s ability to hack its reward system, and most natural agents are less than perfect in that respect. I think the answer to “why aren’t we all pure reinforcement learners?” is a little less clean than you suggest; it probably has something to do with the layers of reflexive and semi-reflexive agency our GI architecture is built on, and something to do with the fact that we have multiple reward channels (another symptom of messy ad-hoc evolution), and something to do with the bounds on our ability to anticipate future rewards.
Even so, it’s not perfect. Heroin addicts do exist.
True true.
However, a reality in which pure reinforcement learners self-destruct from blissing out remains simpler than one in which a sufficiently good reinforcement learner goes FOOM and takes over the universe.