Measuring intelligence and reverse-engineering goals

Link post

It is analytically useful to define intelligence in the context of AGI. One intuitive notion is epistemology: an agent’s intelligence is how good its epistemology is, how good it is at knowing things and making correct guesses. But “intelligence” in AGI theory often means more than epistemology. An intelligent agent is supposed to be good at achieving some goal, not just knowing a lot of things.

So how could we define intelligent agency? Marcus Hutter’s universal intelligence measures an agent’s ability to achieve observable reward across a distribution of environments; AIXI maximizes this measure. Testing across a distribution makes sense for avoiding penalizing “unlucky” agents who fail in the real world, but use effective strategies that succeed most of the time. However, maximizing observable reward is a sort of fixed goal function; it can’t consider intelligent agents that effectively achieve goals other than reward-maximization. This relates to inner alignment: an agent may not be “inner aligned” with AIXI’s reward maximization objective, yet still intelligent in the sense of effectively accomplishing something else.

To generalize, it is problematic to score an agent’s intelligence on the basis of a fixed utility function. It is fallacious to imagine a paperclip maximizer and say “it is not smart, it doesn’t even produce a lot of staples!” (or happiness for conscious beings or whatever). Hopefully, the confusion of relativist pluralism of intelligence measures can be avoided.

Of practical import is the agent’s “general effectiveness”. Both a paperclip maximizer and a staple maximizer would harness energy effectively, e.g. effectively harnessing nuclear energy from stars. A generalization is Omohundro’s basic AI drives or convergent instrumental goals: these are what effective utility-maximizing agents would tend to pursue almost regardless of the utility function.

So a proposed rough definition: An agent is intelligent to the extent it tends to achieve convergent instrumental goals. This is not meant to be a final definition, it might have conceptual problems e.g. dependence on the VNM notion of intelligent agency, but it at least adds some specificity. “Tends to” here is similar to Hutter’s idea of testing an agent across a distribution of environments: an agent can tend to achieve value even when it actually fails (unluckily).

To cite prior work, Nick Land writes (in “What is intelligence”, Xenosystems):

Intelligence solves problems, by guiding behavior to produce local extropy. It is indicated by the avoidance of probable outcomes, which is equivalent to the construction of information.

This amounts to something similar to the convergent instrumental goal definition; achieving sufficiently specific outcomes involves pursuing convergent instrumental goals.

The convergent instrumental goal definition of intelligence may help study the Orthogonality Thesis. In Superintelligence, Bostrom states the thesis as:

Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal.

(Previously I argued against a strong version of the thesis.)

Clearly, having a definition of intelligence helps clarify what the orthogonality thesis is stating. But the thesis also refers to “final goals”; how can that be defined? For example, what are the final goals of a mouse brain?

In some idealized cases, like a VNM-based agent that explicitly optimizes a defined utility function over universe trajectories, “final goal” is well-defined. However, it’s unclear how to generalize to less idealized cases. In particular, a given idealized optimization architecture has a type signature for goals, e.g. a Turing machine assigning a real number to universe trajectories which themselves have some type signature e.g. based on the physics model. But different type signatures for goals across different architectures, even idealized ones, makes identification of final goals more difficult.

A different approach: what are the relevant effective features of an agent other than its intelligence? This doesn’t bake in a “goal” concept but asks a natural left-over question after defining intelligence. In an idealized case like paperclip maximizer vs. staple maximizer (with the same cognitive architecture and so on), while the agents behave fairly similarly (harnessing energy, expanding throughout the universe, and so on), there is a relevant effective difference in that they manufacture different objects towards the latter part of the universe’s lifetime. The difference in effective behavior, here, does seem to correspond with the differences in goals.

To provide some intuition for alternative agent architectures, I’ll give a framework inspired by the Bellman equation. To simplify, assume we have an MDP with S being a set of states, A being a set of actions, t(s’ | s, a) specifying the distribution over next states given the previous state and an action, and being the initial state. A value function on states satisfies:

This is a recurrent relationship in the sense that the values of states “depend on” the values of other states; the value function is a sort of fixed point. A valid policy for a value function must always select an action that maximizes the expected value of the following state. A difference with the usual Bellman equation is that there is no time discounting and no reward. (There are of course interesting modifications to this setup, such as relaxing the equality to an approximate equality, or having partial observability as in a POMDP; I’m starting with something simple.)

Now, what does the space of valid value functions for an MDP look like? As a very simple example, consider if there are three states {start, left, right}; two actions {L, R}; ‘start’ being the starting state; ‘left’ always transitioning to ‘left’, ‘right’ always transitioning to ‘right’; ‘start’ transitioning to ‘left’ if the ‘L’ action is taken, and to right if the ‘R’ action is taken. The value function can take on arbitrary values for ‘left’ and ‘right’, but the value of ‘start’ must be the maximum of the two.

We could say something like, the agent’s utility function is only over ‘left’ and ‘right’, and the value function can be derived from the utility function. This took some work, though; the utility function isn’t directly written down. It’s a way of interpreting the agent architecture and value function. We figure out what the “free parameters” are, and figure out the value function from these.

It of course gets more complex in cases where we have infinite chains of different states, or cycles between more than one state; it would be less straightforward to say something like “you can assign any values to these states, and the values of other states follow from those”.

In “No Universally Compelling Arguments”, Eliezer Yudkowsky writes:

If you switch to the physical perspective, then the notion of a Universal Argument seems noticeably unphysical. If there’s a physical system that at time T, after being exposed to argument E, does X, then there ought to be another physical system that at time T, after being exposed to environment E, does Y. Any thought has to be implemented somewhere, in a physical system; any belief, any conclusion, any decision, any motor output. For every lawful causal system that zigs at a set of points, you should be able to specify another causal system that lawfully zags at the same points.

The switch from “zig” to “zag” is a hypothetical modification to an agent. In the case of the studied value functions, not all modifications to a value function (e.g. changing the value of a particular state) lead to another valid value function. The modifications we can make are more restricted: for example, perhaps we can change the value of a “cyclical” state (one that always transitions to itself), and then back-propagate the value change to preceding states.

A more general statement: Changing a “zig” to a “zag” in an agent can easily change its intelligence. For example, perhaps the modification is to add a “fixed action pattern” where the modified agent does something useless (like digging a ditch and filling it) under some conditions. This modification to the agent would negatively impact its tendency to achieve convergent instrumental goals, and accordingly its intelligence according to our definition.

This raises the question: for a given agent, keeping its architecture fixed, what are the valid modifications that don’t change its intelligence? The results of such modifications are a sort of “level set” in the function mapping from agents within the architecture to intelligence. The Bellman-like value function setup makes the point that specifying the set of such modifications may be non-trivial; they could easily result in an invalid value function, leading to un-intelligent, wasteful behavior.

A general analytical approach:

  • Consider some agent architecture, a set of programs.

  • Consider an intelligence function on this set of programs, based on something like “tendency to achieve convergent instrumental goals”.

  • Consider differences within some set of agents with equivalent intelligence; do they behave differently?

  • Consider whether the effective differences between agents with equivalent intelligence can be parametrized with something like a “final goal” or “utility function”.

Whereas classical decision theory assumes the agent architecture is parameterized by a utility function, this is more of a reverse-engineering approach: can we first identify an intelligence measure on agents within an architecture, then look for relevant differences between agents of a given intelligence, perhaps parametrized by something like a utility function?

There’s not necessarily a utility function directly encoded in an intelligent system such as a mouse brain; perhaps what is encoded directly is more like a Bellman state value function learned from reinforcement learning, influenced by evolutionary priors. In that case, it might be more analytically fruitful to identify relevant motivational features other than intelligence, and seeing how final-goal-like they are, rather than starting from the assumption that there is a final goal.

Let’s consider orthogonality again, and take a somewhat different analytical approach. Suppose that agents in a given architecture are well-parametrized by their final goals. How could intelligence vary depending on the agent’s final goal?

As an example, suppose the agents have utility functions over universe trajectories, which vary both in what sort of states they prefer, and in their time preference (how much they care more about achieving valuable states soon). An agent with a very high time preference (i.e. very impatient) would probably be relatively unintelligent, as it tries to achieve value quickly, neglecting convergent instrumental goals such as amassing energy. So intelligence should usually increase with patience, although maximally patient agents may behave unintelligently in other ways, e.g. investing too much in unlikely ways of averting the heat death of the universe.

There could also be especially un-intelligent goals such as the goal of dying as fast as possible. An agent pursuing this goal would of course tend to fail to achieve convergent instrumental goals. (Bostrom and Yudkowsky would agree that such cases exist, and require putting some conditions on the orthogonality thesis).

A more interesting question is whether there are especially intelligent goals, ones whose pursuit leads to especially high convergent instrumental goal achievement relative to “most” goals. A sketch of an example: Suppose we are considering a class of agents that assume Newtonian physics is true, and have preferences over Newtonian universe configurations. Some such agents have the goal of building Newtonian configurations that are (in fact, unknown to them) valid quantum computers. These agents might be especially intelligent, as they pursue the convergent instrumental goal of building quantum computers (thus unleashing even more intelligent agents, which build more quantum computers), unlike most Newtonian agents.

This is a bit of a weird case because it relies on the agents having a persistently wrong epistemology. More agnostically, we could also consider Newtonian agents that tend to want to build “interesting”, varied matter configurations, and are thereby more likely to stumble on esoteric physics like quantum computation. There are some complexities here (does it count as achieving convergent instrumental goals to create more advanced agents with “default” random goals, compared to the baseline of not doing so?) but at the very least, Newtonian agents that build interesting configurations seem to be more likely to have big effects than ones that don’t.

Generalizing a bit, different agent architectures could have different ontologies for the world model and utility function, e.g. Newtonian or quantum mechanical. If a Newtonian agent looks at a “random” quantum mechanical agent’s behavior, it might guess that it has a strong preference for building certain Newtonian matter configurations, e.g. ones that (in fact, unknown to it) correspond to quantum computers. More abstractly, a “default” /​ max-entropy measure on quantum mechanical utility functions might lead to behaviors that, projected back into Newtonian goals, look like having very specific preferences over Newtonian matter configurations. (Even more abstractly, see the Bertrand paradox showing that max-entropy distributions depend on parameterization.)

Maybe there is such a thing as a “universal agent architecture” in which there are no especially intelligent goals, but finding such an architecture would be difficult. This goes to show that identifying truly orthogonal goal-like axes is conceptually difficult; just because something seems like a final goal parameter doesn’t mean it is really orthogonal to intelligence.

Unusually intelligent utility functions relate to Nick Land’s idea of intelligence optimization. Quoting “Intelligence and the Good” (Xenosystems):

From the perspective of intelligence optimization (intelligence explosion formulated as a guideline), more intelligence is of course better than less intelligence… Even the dimmest, most confused struggle in the direction of intelligence optimization is immanently “good” (self-improving).

My point here is not to opine on the normativity of intelligence optimization, but rather to ask whether some utility functions within an architecture lead to more intelligence-optimization behavior. A rough guess is that especially intelligent goals within an agent architecture will tend to terminally value achieving conditions that increase intelligence in the universe.

Insurrealist, expounding on Land in “Intro to r/​acc (part 1)”, writes:

Intelligence for us is, roughly, the ability of a physical system to maximize its future freedom of action. The interesting point is that “War Is God” seems to undermine any positive basis for action. If nothing is given, I have no transcendent ideal to order my actions and cannot select between them. This is related to the is-ought problem from Hume, the fact/​value distinction from Kant, etc., and the general difficulty of deriving normativity from objective fact.

This class of problems seems to be no closer to resolution than it was a century ago, so what are we to do? The Landian strategy corresponds roughly to this: instead of playing games (in a very general, abstract sense) in accordance with a utility function predetermined by some allegedly transcendent rule, look at the collection of all of the games you can play, and all of the actions you can take, then reverse-engineer a utility function that is most consistent with your observations. This lets one not refute, but reject and circumvent the is-ought problem, and indeed seems to be deeply related to what connectionist systems, our current best bet for “AGI”, are actually doing.

The general idea of reverse-engineering a utility function suggests a meta-utility function, and a measure of intelligence is one such candidate. My intuition is that in the Newtonian agent architecture, a reverse-engineered utility function looks something like “exploring varied, interesting matter configurations of the sort that (in fact, perhaps unknown to the agent itself) tend to create large effects in non-Newtonian physics”.

To summarize main points:

  • Intelligence can be defined in a way that is not dependent on a fixed objective function, such as by measuring tendency to achieve convergent instrumental goals.

  • Within an agent architecture, effective behavioral differences other than intelligence can be identified, which for at least some architectures correspond with “final goals”, although finding the right orthogonal parameterization might be non-trivial.

  • Within an agent architecture already parameterized by final goals, intelligence may vary between final goals; especially unintelligent goals clearly exist, but especially intelligent goals would be more notable in cases where they exist.

  • Given an intelligence measure and agent architecture parameterized by goals, intelligence optimization could possibly correspond with some goals in that architecture; such reverse-engineered goals would be candidates for especially intelligent goals.