The idea is a not intended to be used as a primary way of the AI control but as the last form of AI turn off option. I describe it in the lengthy text, where all possible ways of AI boxing are explored, which I am currently writing under the name “Catching treacherous turn: confinement and circuit breaker system to prevent AI revolt, self-improving and escape”.
It also will work only if the reward function is presented not as plain text in the source code, but as a separate black box (created using cryptography or physical isolation). The stop code is, in fact, some solution of complex cryptography used in this cryptographic reward function.
I agree that running subagents may be a problem. We still don’t have a theory of AI halting. It probably better to use such super reward before many subagents were created.
The last your objection is more serious as it shows that such mechanism could turn safe AI into dangerous “addict”.
The idea is a not intended to be used as a primary way of the AI control but as the last form of AI turn off option. I describe it in the lengthy text, where all possible ways of AI boxing are explored, which I am currently writing under the name “Catching treacherous turn: confinement and circuit breaker system to prevent AI revolt, self-improving and escape”.
It also will work only if the reward function is presented not as plain text in the source code, but as a separate black box (created using cryptography or physical isolation). The stop code is, in fact, some solution of complex cryptography used in this cryptographic reward function.
I agree that running subagents may be a problem. We still don’t have a theory of AI halting. It probably better to use such super reward before many subagents were created.
The last your objection is more serious as it shows that such mechanism could turn safe AI into dangerous “addict”.