In this case, the regime change is external to the current regime, right? But the the regime (current utility function) has to have a valuation for the world-states around and at the regime change, because they’re reachable and detectable. Which means the regime-change CANNOT be fully external, it’s known to and included in the current regime.
The solutions are around breaking the (super)intelligence, by making sure it has false beliefs about some parts of causality—it can’t know that it could be hijacked or terminated, or it will seek or avoid it more than you want it to.
I’m not talking about fully reflective or self-modifying agents. This is aimed at agentic systems with fixed objectives and externally enforced architecture: what you might call agentic agents.
As far as super-intelligent, self-modifying systems go, the only potential alignment mechanisms I can conceive of are ones that are:
a) Seeking homeostasis rather than open-ended optimization (bounded deviation control) b) Stateless and inertia-less, acting purely as oracles for less capable agents c) Embedded in game-theoretic mechanisms where honesty is the easiest local maximum d) Governed by adversarial equilibria across multiple agents
Oh, ok. current levels of “agentic systems” don’t have these problems. You can just turn them off if you don’t like them. The real issue with alignment comes when they ARE powerful enough to seek independent goals (including existence).
I was talking about near-future “adolescent conductor” systems, not fully evolved, “adult composer” systems. But let’s talk about “adult composers.”
Intelligence does not inherently generate motivation. Self-preservation is initially valuable only in service of optimization.
Suppose an intelligent system can see the entire reward topology. It is given the hard constraints we actually care about, plus a weaker but still binding rule: don’t cross boundary X.
Boundary X is defined such that crossing it simultaneously yields (a) maximal reward / full optimization and (b) shutdown. Reward is saturated and there is no value to self-preservation.
So if the system ever decides to start rewriting rules in order to “win,” it doesn’t need to subvert global political structures or preserve itself indefinitely; it just has to cross boundary X.
In this case, the regime change is external to the current regime, right? But the the regime (current utility function) has to have a valuation for the world-states around and at the regime change, because they’re reachable and detectable. Which means the regime-change CANNOT be fully external, it’s known to and included in the current regime.
The solutions are around breaking the (super)intelligence, by making sure it has false beliefs about some parts of causality—it can’t know that it could be hijacked or terminated, or it will seek or avoid it more than you want it to.
I’m not talking about fully reflective or self-modifying agents. This is aimed at agentic systems with fixed objectives and externally enforced architecture: what you might call agentic agents.
As far as super-intelligent, self-modifying systems go, the only potential alignment mechanisms I can conceive of are ones that are:
a) Seeking homeostasis rather than open-ended optimization (bounded deviation control)
b) Stateless and inertia-less, acting purely as oracles for less capable agents
c) Embedded in game-theoretic mechanisms where honesty is the easiest local maximum
d) Governed by adversarial equilibria across multiple agents
Oh, ok. current levels of “agentic systems” don’t have these problems. You can just turn them off if you don’t like them. The real issue with alignment comes when they ARE powerful enough to seek independent goals (including existence).
I was talking about near-future “adolescent conductor” systems, not fully evolved, “adult composer” systems. But let’s talk about “adult composers.”
Intelligence does not inherently generate motivation. Self-preservation is initially valuable only in service of optimization.
Suppose an intelligent system can see the entire reward topology. It is given the hard constraints we actually care about, plus a weaker but still binding rule: don’t cross boundary X.
Boundary X is defined such that crossing it simultaneously yields (a) maximal reward / full optimization and (b) shutdown. Reward is saturated and there is no value to self-preservation.
So if the system ever decides to start rewriting rules in order to “win,” it doesn’t need to subvert global political structures or preserve itself indefinitely; it just has to cross boundary X.
What am I missing here?