To quote Eliezer (who was originally talking to Benja Fallenstein; edits italicized):
Well, that’s a very intelligent review, John Wentworth. But I have a crushing reply to your review, such that, once I deliver it, you will at once give up further debate with me on this particular point: You’re right.
Here are some more thoughts.
POWER offers a black-box notion of instrumental convergence. This is the right starting point, but it needs to be complemented with a gears-level understanding of what features of the environment give rise to convergence.
I agree, and I’d like to elaborate my take on the black boxy-ness.
To me, these theorems (and the further basic MDP theorems I’ve developed but not yet made available) feel analogous to the Sylow theorems in group theory. Indeed, a CHAI researcher once remarked to me that my theorems seem to apply the spirit of abstract algebra to MDPs.
The Sylow theorems tell you that if you know the cardinality |G| of a group G, you can constrain its internal structure in useful ways, and sometimes even guarantee it has normal subgroups of given cardinalities. But maybe we don’t know |G|. What’s that world like, where we don’t have easy ways of knowing the group cardinality and deriving its prime factorization, but we still have the Sylow theorems?
My theorems say that if you know certain summary information of the graphical properties of an MDP, you can conclude POWER-seeking. But maybe we don’t know that summary information, because we don’t know exactly what the MDP looks like. What’s that world like, where we don’t have easy ways of knowing the MDP model and deriving high-level graphical properties, but we still have the POWER-seeking theorems?
I think that you still want to know both sets of theorems, even though you might not have recourse to a constructive explanation for the actual (groups / MDPs) you care about understanding.
But you also care about the cardinalities, and you also care about what kinds of things will tend to be robustly instrumental, what kinds of things tend to give you POWER / “resources”, and I think that the kind of theory you propose could take an important step in that direction.
I also think that it’s aesthetically pleasing to have a notion of POWER-seeking which doesn’t depend on the state featurization, but only on the environmental dynamics; however, more granular theories probably should depend on that.
if an agent’s goals do not explicitly involve things close to X, then the agent cares only about controlling f(X).
This, I think, is too strong: not only do some agents not care about [exact voltages on a CPU], some agents aren’t even incentivized to care about [the summary information f(X) of these voltages]. For example, an agent with a constant utility function cares neither about X nor about f(X), and I imagine there are less trivial utility functions which are indifferent to large classes of outcomes and share this property.
The main point here isn’t to nitpick the implication, but to shift the emphasis towards a direction I think might be productive (towards * below). So, I would say:
if an agent’s goals do not explicitly involve things close to X, then the agent cares only about controlling f(X), if it cares at all.
Crucially, X is of type thing-to-care-about, and f(X) is of type thing-to-condition-on. By definition, one cares about X itself for terminal reasons, but cares about f(X) for instrumental predictive benefit—because of what it implies about the other things one does care about.
Then you might wonder, in a given environment:
why “agents” are incentivized to develop “good abstractions”;
given f(X), what kinds of query classes tend to make f(X) a good abstraction;
There’s a corresponding question of:
“given variables X, what kinds of utility functions will terminally care about X?”,
but this isn’t as relevant. This question also seems easier, since X is presumably in the domain of the utility functions.
given an abstraction f(X), what kinds of goals will incentivize policies which care about controlling f(X).
* The answers to 2. and 3. would point to “caring about thing Y means you care about summary information Z, and if you care about summary information Z, you’ll tend to try to control features A, B, and C.” This could then not only say how many goals tend to seek POWER, but which kinds of goals seek which kinds of control and resources and flexible influence. *
We can talk about that feature entirely independent of agents or agency. Indeed, we could potentially use this intuition to derive agency, via some kind of coherence theorem; this notion of instrumental convergence is more fundamental than utility functions.
Can you expand on this? I’m mostly confused about what kind of agency might be derived, exactly.
I’d say you passed my intellectual Turing test, but that seems like an understatement. More like… if you were a successor AI, I would be comfortable deferring to you on this topic. (Not literally true, but the analogy seems to convey something of the right spirit.) You fully understand my points and have made further novel observations about them; in particular, the analogy to the Sylow theorems is perfect, and you’re clearly asking the right questions.
Regarding instrumental convergence as a foundation for coherence theorems...
I touched on this a bit in this review of Coherent Decisions Imply Consistent Utilities. The main issue is that coherence theorems generally need some kind of “yardstick” to measure utility against, something which agents are assumed to generally want more of; the flavor text around the theorem usually calls it “money”. It need not be something that agents want as a terminal value, just something that we assume agents can always use more of in order to get more utility. We then recognize “incoherent decisions” by an agent “throwing away” the yardstick-resource unnecessarily—i.e. taking a path which expends strictly more of the resource than is necessary to reach the end-state.
But what if our universe doesn’t have some built-in, ontologically-basic yardstick against which to measure decision-coherence? How can we derive the yardstick from first principles?
That’s the question I think instrumental convergence could potentially answer. If broad classes of mind designs in a certain universe “want similar things” (as non-terminal goals), then those things might make a good yardstick. In order to to give full force to this argument, we need to ground “want similar things” in a way which doesn’t talk about “wanting”, since we’re trying to derive utility from first principles. That’s where something like “nearby subsystems can only influence far away subsystems via <small set of variables>” comes in. That small set of variables acts like a natural yardstick to measure coherence of nearby decisions: throwing away control over those variables implies that the agent is strictly suboptimal for controlling (almost) anything far away. In some sense, it’s coherence of nearby decisions, as viewed from a distance.
To quote Eliezer (who was originally talking to Benja Fallenstein; edits italicized):
Here are some more thoughts.
I agree, and I’d like to elaborate my take on the black boxy-ness.
To me, these theorems (and the further basic MDP theorems I’ve developed but not yet made available) feel analogous to the Sylow theorems in group theory. Indeed, a CHAI researcher once remarked to me that my theorems seem to apply the spirit of abstract algebra to MDPs.
The Sylow theorems tell you that if you know the cardinality |G| of a group G, you can constrain its internal structure in useful ways, and sometimes even guarantee it has normal subgroups of given cardinalities. But maybe we don’t know |G|. What’s that world like, where we don’t have easy ways of knowing the group cardinality and deriving its prime factorization, but we still have the Sylow theorems?
My theorems say that if you know certain summary information of the graphical properties of an MDP, you can conclude POWER-seeking. But maybe we don’t know that summary information, because we don’t know exactly what the MDP looks like. What’s that world like, where we don’t have easy ways of knowing the MDP model and deriving high-level graphical properties, but we still have the POWER-seeking theorems?
I think that you still want to know both sets of theorems, even though you might not have recourse to a constructive explanation for the actual (groups / MDPs) you care about understanding.
But you also care about the cardinalities, and you also care about what kinds of things will tend to be robustly instrumental, what kinds of things tend to give you POWER / “resources”, and I think that the kind of theory you propose could take an important step in that direction.
I also think that it’s aesthetically pleasing to have a notion of POWER-seeking which doesn’t depend on the state featurization, but only on the environmental dynamics; however, more granular theories probably should depend on that.
~~~~
In Abstraction, Evolution and Gears, you write:
This, I think, is too strong: not only do some agents not care about [exact voltages on a CPU], some agents aren’t even incentivized to care about [the summary information f(X) of these voltages]. For example, an agent with a constant utility function cares neither about X nor about f(X), and I imagine there are less trivial utility functions which are indifferent to large classes of outcomes and share this property.
The main point here isn’t to nitpick the implication, but to shift the emphasis towards a direction I think might be productive (towards * below). So, I would say:
Crucially, X is of type thing-to-care-about, and f(X) is of type thing-to-condition-on. By definition, one cares about X itself for terminal reasons, but cares about f(X) for instrumental predictive benefit—because of what it implies about the other things one does care about.
Then you might wonder, in a given environment:
why “agents” are incentivized to develop “good abstractions”;
given f(X), what kinds of query classes tend to make f(X) a good abstraction;
There’s a corresponding question of:
“given variables X, what kinds of utility functions will terminally care about X?”,
but this isn’t as relevant. This question also seems easier, since X is presumably in the domain of the utility functions.
given an abstraction f(X), what kinds of goals will incentivize policies which care about controlling f(X).
* The answers to 2. and 3. would point to “caring about thing Y means you care about summary information Z, and if you care about summary information Z, you’ll tend to try to control features A, B, and C.” This could then not only say how many goals tend to seek POWER, but which kinds of goals seek which kinds of control and resources and flexible influence. *
Can you expand on this? I’m mostly confused about what kind of agency might be derived, exactly.
I’d say you passed my intellectual Turing test, but that seems like an understatement. More like… if you were a successor AI, I would be comfortable deferring to you on this topic. (Not literally true, but the analogy seems to convey something of the right spirit.) You fully understand my points and have made further novel observations about them; in particular, the analogy to the Sylow theorems is perfect, and you’re clearly asking the right questions.
Regarding instrumental convergence as a foundation for coherence theorems...
I touched on this a bit in this review of Coherent Decisions Imply Consistent Utilities. The main issue is that coherence theorems generally need some kind of “yardstick” to measure utility against, something which agents are assumed to generally want more of; the flavor text around the theorem usually calls it “money”. It need not be something that agents want as a terminal value, just something that we assume agents can always use more of in order to get more utility. We then recognize “incoherent decisions” by an agent “throwing away” the yardstick-resource unnecessarily—i.e. taking a path which expends strictly more of the resource than is necessary to reach the end-state.
But what if our universe doesn’t have some built-in, ontologically-basic yardstick against which to measure decision-coherence? How can we derive the yardstick from first principles?
That’s the question I think instrumental convergence could potentially answer. If broad classes of mind designs in a certain universe “want similar things” (as non-terminal goals), then those things might make a good yardstick. In order to to give full force to this argument, we need to ground “want similar things” in a way which doesn’t talk about “wanting”, since we’re trying to derive utility from first principles. That’s where something like “nearby subsystems can only influence far away subsystems via <small set of variables>” comes in. That small set of variables acts like a natural yardstick to measure coherence of nearby decisions: throwing away control over those variables implies that the agent is strictly suboptimal for controlling (almost) anything far away. In some sense, it’s coherence of nearby decisions, as viewed from a distance.