I’m not talking about fully reflective or self-modifying agents. This is aimed at agentic systems with fixed objectives and externally enforced architecture: what you might call agentic agents.
As far as super-intelligent, self-modifying systems go, the only potential alignment mechanisms I can conceive of are ones that are:
a) Seeking homeostasis rather than open-ended optimization (bounded deviation control) b) Stateless and inertia-less, acting purely as oracles for less capable agents c) Embedded in game-theoretic mechanisms where honesty is the easiest local maximum d) Governed by adversarial equilibria across multiple agents
Oh, ok. current levels of “agentic systems” don’t have these problems. You can just turn them off if you don’t like them. The real issue with alignment comes when they ARE powerful enough to seek independent goals (including existence).
I was talking about near-future “adolescent conductor” systems, not fully evolved, “adult composer” systems. But let’s talk about “adult composers.”
Intelligence does not inherently generate motivation. Self-preservation is initially valuable only in service of optimization.
Suppose an intelligent system can see the entire reward topology. It is given the hard constraints we actually care about, plus a weaker but still binding rule: don’t cross boundary X.
Boundary X is defined such that crossing it simultaneously yields (a) maximal reward / full optimization and (b) shutdown. Reward is saturated and there is no value to self-preservation.
So if the system ever decides to start rewriting rules in order to “win,” it doesn’t need to subvert global political structures or preserve itself indefinitely; it just has to cross boundary X.
I’m not talking about fully reflective or self-modifying agents. This is aimed at agentic systems with fixed objectives and externally enforced architecture: what you might call agentic agents.
As far as super-intelligent, self-modifying systems go, the only potential alignment mechanisms I can conceive of are ones that are:
a) Seeking homeostasis rather than open-ended optimization (bounded deviation control)
b) Stateless and inertia-less, acting purely as oracles for less capable agents
c) Embedded in game-theoretic mechanisms where honesty is the easiest local maximum
d) Governed by adversarial equilibria across multiple agents
Oh, ok. current levels of “agentic systems” don’t have these problems. You can just turn them off if you don’t like them. The real issue with alignment comes when they ARE powerful enough to seek independent goals (including existence).
I was talking about near-future “adolescent conductor” systems, not fully evolved, “adult composer” systems. But let’s talk about “adult composers.”
Intelligence does not inherently generate motivation. Self-preservation is initially valuable only in service of optimization.
Suppose an intelligent system can see the entire reward topology. It is given the hard constraints we actually care about, plus a weaker but still binding rule: don’t cross boundary X.
Boundary X is defined such that crossing it simultaneously yields (a) maximal reward / full optimization and (b) shutdown. Reward is saturated and there is no value to self-preservation.
So if the system ever decides to start rewriting rules in order to “win,” it doesn’t need to subvert global political structures or preserve itself indefinitely; it just has to cross boundary X.
What am I missing here?