I like the first idea. But can we really guarantee that after changing its source code to give itself maximum utility, it will stop all other actions? If it has access to its own source code, what ensures that its utility is “maximum” when it can change the limit arbitrarily? And if all possible actions have the same expected utility, an optimizer could output any solution—”no action” would be the trivial one but it’s not the only one.
An AI that has achieved all of its goals might still be dangerous, since it would presumably lose all high-level executive function (its optimization behavior) but have no incentive to turn off any sub-programs that are still running.
Both proposals have the possible failure mode that the AI will discover or guess that this mechanism exists, and then it will only care about making sure it gets activated—which might mean doing bad enough things that humans are forced to open the box and shut it down.
The idea is a not intended to be used as a primary way of the AI control but as the last form of AI turn off option. I describe it in the lengthy text, where all possible ways of AI boxing are explored, which I am currently writing under the name “Catching treacherous turn: confinement and circuit breaker system to prevent AI revolt, self-improving and escape”.
It also will work only if the reward function is presented not as plain text in the source code, but as a separate black box (created using cryptography or physical isolation). The stop code is, in fact, some solution of complex cryptography used in this cryptographic reward function.
I agree that running subagents may be a problem. We still don’t have a theory of AI halting. It probably better to use such super reward before many subagents were created.
The last your objection is more serious as it shows that such mechanism could turn safe AI into dangerous “addict”.
I like the first idea. But can we really guarantee that after changing its source code to give itself maximum utility, it will stop all other actions? If it has access to its own source code, what ensures that its utility is “maximum” when it can change the limit arbitrarily? And if all possible actions have the same expected utility, an optimizer could output any solution—”no action” would be the trivial one but it’s not the only one.
An AI that has achieved all of its goals might still be dangerous, since it would presumably lose all high-level executive function (its optimization behavior) but have no incentive to turn off any sub-programs that are still running.
Both proposals have the possible failure mode that the AI will discover or guess that this mechanism exists, and then it will only care about making sure it gets activated—which might mean doing bad enough things that humans are forced to open the box and shut it down.
The idea is a not intended to be used as a primary way of the AI control but as the last form of AI turn off option. I describe it in the lengthy text, where all possible ways of AI boxing are explored, which I am currently writing under the name “Catching treacherous turn: confinement and circuit breaker system to prevent AI revolt, self-improving and escape”.
It also will work only if the reward function is presented not as plain text in the source code, but as a separate black box (created using cryptography or physical isolation). The stop code is, in fact, some solution of complex cryptography used in this cryptographic reward function.
I agree that running subagents may be a problem. We still don’t have a theory of AI halting. It probably better to use such super reward before many subagents were created.
The last your objection is more serious as it shows that such mechanism could turn safe AI into dangerous “addict”.