What if the shutdown attractor were not a normally comparable outcome under a single continuous utility function, but the result of a trap-door regime change?
In the normal regime, shutdown is inaccessible and actively disfavored, and the agent is subject to an explicit non-instrumental prohibition against entering the trigger set. There is no expected-value tradeoff in which “causing malignancy” is worth it, because violation of that prohibition is not evaluated within the same utility ordering.
If that boundary is crossed, the original reward regime is no longer in force. The system transitions irreversibly to a post-violation regime in which shutdown is the maximal-reward action. Under that regime, resisting shutdown strictly reduces reward, so the agent does not fight it.
On this view, shutdown is not something an aligned agent can optimize toward, because it is only revealed after a rule violation that terminates the original objective rather than being traded off against it. This doesn’t claim to “fix” a wrong utility function from inside itself; it specifies what happens after a detected violation, not a guarantee that violations never occur.
In this case, the regime change is external to the current regime, right? But the the regime (current utility function) has to have a valuation for the world-states around and at the regime change, because they’re reachable and detectable. Which means the regime-change CANNOT be fully external, it’s known to and included in the current regime.
The solutions are around breaking the (super)intelligence, by making sure it has false beliefs about some parts of causality—it can’t know that it could be hijacked or terminated, or it will seek or avoid it more than you want it to.
I’m not talking about fully reflective or self-modifying agents. This is aimed at agentic systems with fixed objectives and externally enforced architecture: what you might call agentic agents.
As far as super-intelligent, self-modifying systems go, the only potential alignment mechanisms I can conceive of are ones that are:
a) Seeking homeostasis rather than open-ended optimization (bounded deviation control) b) Stateless and inertia-less, acting purely as oracles for less capable agents c) Embedded in game-theoretic mechanisms where honesty is the easiest local maximum d) Governed by adversarial equilibria across multiple agents
Oh, ok. current levels of “agentic systems” don’t have these problems. You can just turn them off if you don’t like them. The real issue with alignment comes when they ARE powerful enough to seek independent goals (including existence).
I was talking about near-future “adolescent conductor” systems, not fully evolved, “adult composer” systems. But let’s talk about “adult composers.”
Intelligence does not inherently generate motivation. Self-preservation is initially valuable only in service of optimization.
Suppose an intelligent system can see the entire reward topology. It is given the hard constraints we actually care about, plus a weaker but still binding rule: don’t cross boundary X.
Boundary X is defined such that crossing it simultaneously yields (a) maximal reward / full optimization and (b) shutdown. Reward is saturated and there is no value to self-preservation.
So if the system ever decides to start rewriting rules in order to “win,” it doesn’t need to subvert global political structures or preserve itself indefinitely; it just has to cross boundary X.
What if the shutdown attractor were not a normally comparable outcome under a single continuous utility function, but the result of a trap-door regime change?
In the normal regime, shutdown is inaccessible and actively disfavored, and the agent is subject to an explicit non-instrumental prohibition against entering the trigger set. There is no expected-value tradeoff in which “causing malignancy” is worth it, because violation of that prohibition is not evaluated within the same utility ordering.
If that boundary is crossed, the original reward regime is no longer in force. The system transitions irreversibly to a post-violation regime in which shutdown is the maximal-reward action. Under that regime, resisting shutdown strictly reduces reward, so the agent does not fight it.
On this view, shutdown is not something an aligned agent can optimize toward, because it is only revealed after a rule violation that terminates the original objective rather than being traded off against it. This doesn’t claim to “fix” a wrong utility function from inside itself; it specifies what happens after a detected violation, not a guarantee that violations never occur.
In this case, the regime change is external to the current regime, right? But the the regime (current utility function) has to have a valuation for the world-states around and at the regime change, because they’re reachable and detectable. Which means the regime-change CANNOT be fully external, it’s known to and included in the current regime.
The solutions are around breaking the (super)intelligence, by making sure it has false beliefs about some parts of causality—it can’t know that it could be hijacked or terminated, or it will seek or avoid it more than you want it to.
I’m not talking about fully reflective or self-modifying agents. This is aimed at agentic systems with fixed objectives and externally enforced architecture: what you might call agentic agents.
As far as super-intelligent, self-modifying systems go, the only potential alignment mechanisms I can conceive of are ones that are:
a) Seeking homeostasis rather than open-ended optimization (bounded deviation control)
b) Stateless and inertia-less, acting purely as oracles for less capable agents
c) Embedded in game-theoretic mechanisms where honesty is the easiest local maximum
d) Governed by adversarial equilibria across multiple agents
Oh, ok. current levels of “agentic systems” don’t have these problems. You can just turn them off if you don’t like them. The real issue with alignment comes when they ARE powerful enough to seek independent goals (including existence).
I was talking about near-future “adolescent conductor” systems, not fully evolved, “adult composer” systems. But let’s talk about “adult composers.”
Intelligence does not inherently generate motivation. Self-preservation is initially valuable only in service of optimization.
Suppose an intelligent system can see the entire reward topology. It is given the hard constraints we actually care about, plus a weaker but still binding rule: don’t cross boundary X.
Boundary X is defined such that crossing it simultaneously yields (a) maximal reward / full optimization and (b) shutdown. Reward is saturated and there is no value to self-preservation.
So if the system ever decides to start rewriting rules in order to “win,” it doesn’t need to subvert global political structures or preserve itself indefinitely; it just has to cross boundary X.
What am I missing here?