Measuring intelligence and reverse-engineering goals

jessicata11 Aug 2025 2:08 UTC

33 points

It is analytically useful to define intelligence in the context of AGI. One intuitive notion is epistemology: an agent’s intelligence is how good its epistemology is, how good it is at knowing things and making correct guesses. But “intelligence” in AGI theory often means more than epistemology. An intelligent agent is supposed to be good at achieving some goal, not just knowing a lot of things.

So how could we define intelligent agency? Marcus Hutter’s universal intelligence measures an agent’s ability to achieve observable reward across a distribution of environments; AIXI maximizes this measure. Testing across a distribution makes sense for avoiding penalizing “unlucky” agents who fail in the real world, but use effective strategies that succeed most of the time. However, maximizing observable reward is a sort of fixed goal function; it can’t consider intelligent agents that effectively achieve goals other than reward-maximization. This relates to inner alignment: an agent may not be “inner aligned” with AIXI’s reward maximization objective, yet still intelligent in the sense of effectively accomplishing something else.

To generalize, it is problematic to score an agent’s intelligence on the basis of a fixed utility function. It is fallacious to imagine a paperclip maximizer and say “it is not smart, it doesn’t even produce a lot of staples!” (or happiness for conscious beings or whatever). Hopefully, the confusion of relativist pluralism of intelligence measures can be avoided.

Of practical import is the agent’s “general effectiveness”. Both a paperclip maximizer and a staple maximizer would harness energy effectively, e.g. effectively harnessing nuclear energy from stars. A generalization is Omohundro’s basic AI drives or convergent instrumental goals: these are what effective utility-maximizing agents would tend to pursue almost regardless of the utility function.

So a proposed rough definition: An agent is intelligent to the extent it tends to achieve convergent instrumental goals. This is not meant to be a final definition, it might have conceptual problems e.g. dependence on the VNM notion of intelligent agency, but it at least adds some specificity. “Tends to” here is similar to Hutter’s idea of testing an agent across a distribution of environments: an agent can tend to achieve value even when it actually fails (unluckily).

To cite prior work, Nick Land writes (in “What is intelligence”, Xenosystems):

Intelligence solves problems, by guiding behavior to produce local extropy. It is indicated by the avoidance of probable outcomes, which is equivalent to the construction of information.

This amounts to something similar to the convergent instrumental goal definition; achieving sufficiently specific outcomes involves pursuing convergent instrumental goals.

The convergent instrumental goal definition of intelligence may help study the Orthogonality Thesis. In Superintelligence, Bostrom states the thesis as:

Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal.

(Previously I argued against a strong version of the thesis.)

Clearly, having a definition of intelligence helps clarify what the orthogonality thesis is stating. But the thesis also refers to “final goals”; how can that be defined? For example, what are the final goals of a mouse brain?

In some idealized cases, like a VNM-based agent that explicitly optimizes a defined utility function over universe trajectories, “final goal” is well-defined. However, it’s unclear how to generalize to less idealized cases. In particular, a given idealized optimization architecture has a type signature for goals, e.g. a Turing machine assigning a real number to universe trajectories which themselves have some type signature e.g. based on the physics model. But different type signatures for goals across different architectures, even idealized ones, makes identification of final goals more difficult.

A different approach: what are the relevant effective features of an agent other than its intelligence? This doesn’t bake in a “goal” concept but asks a natural left-over question after defining intelligence. In an idealized case like paperclip maximizer vs. staple maximizer (with the same cognitive architecture and so on), while the agents behave fairly similarly (harnessing energy, expanding throughout the universe, and so on), there is a relevant effective difference in that they manufacture different objects towards the latter part of the universe’s lifetime. The difference in effective behavior, here, does seem to correspond with the differences in goals.

To provide some intuition for alternative agent architectures, I’ll give a framework inspired by the Bellman equation. To simplify, assume we have an MDP with S being a set of states, A being a set of actions, t(s’ | s, a) specifying the distribution over next states given the previous state and an action, and $s_{0}$ being the initial state. A value function on states satisfies:

$V (s) = max a \in A \sum s^{'} \in S t (s^{'} | s, a) V (s^{'})$

This is a recurrent relationship in the sense that the values of states “depend on” the values of other states; the value function is a sort of fixed point. A valid policy for a value function must always select an action that maximizes the expected value of the following state. A difference with the usual Bellman equation is that there is no time discounting and no reward. (There are of course interesting modifications to this setup, such as relaxing the equality to an approximate equality, or having partial observability as in a POMDP; I’m starting with something simple.)

Now, what does the space of valid value functions for an MDP look like? As a very simple example, consider if there are three states {start, left, right}; two actions {L, R}; ‘start’ being the starting state; ‘left’ always transitioning to ‘left’, ‘right’ always transitioning to ‘right’; ‘start’ transitioning to ‘left’ if the ‘L’ action is taken, and to right if the ‘R’ action is taken. The value function can take on arbitrary values for ‘left’ and ‘right’, but the value of ‘start’ must be the maximum of the two.

We could say something like, the agent’s utility function is only over ‘left’ and ‘right’, and the value function can be derived from the utility function. This took some work, though; the utility function isn’t directly written down. It’s a way of interpreting the agent architecture and value function. We figure out what the “free parameters” are, and figure out the value function from these.

It of course gets more complex in cases where we have infinite chains of different states, or cycles between more than one state; it would be less straightforward to say something like “you can assign any values to these states, and the values of other states follow from those”.

In “No Universally Compelling Arguments”, Eliezer Yudkowsky writes:

If you switch to the physical perspective, then the notion of a Universal Argument seems noticeably unphysical. If there’s a physical system that at time T, after being exposed to argument E, does X, then there ought to be another physical system that at time T, after being exposed to environment E, does Y. Any thought has to be implemented somewhere, in a physical system; any belief, any conclusion, any decision, any motor output. For every lawful causal system that zigs at a set of points, you should be able to specify another causal system that lawfully zags at the same points.

The switch from “zig” to “zag” is a hypothetical modification to an agent. In the case of the studied value functions, not all modifications to a value function (e.g. changing the value of a particular state) lead to another valid value function. The modifications we can make are more restricted: for example, perhaps we can change the value of a “cyclical” state (one that always transitions to itself), and then back-propagate the value change to preceding states.

A more general statement: Changing a “zig” to a “zag” in an agent can easily change its intelligence. For example, perhaps the modification is to add a “fixed action pattern” where the modified agent does something useless (like digging a ditch and filling it) under some conditions. This modification to the agent would negatively impact its tendency to achieve convergent instrumental goals, and accordingly its intelligence according to our definition.

This raises the question: for a given agent, keeping its architecture fixed, what are the valid modifications that don’t change its intelligence? The results of such modifications are a sort of “level set” in the function mapping from agents within the architecture to intelligence. The Bellman-like value function setup makes the point that specifying the set of such modifications may be non-trivial; they could easily result in an invalid value function, leading to un-intelligent, wasteful behavior.

A general analytical approach:

Consider some agent architecture, a set of programs.
Consider an intelligence function on this set of programs, based on something like “tendency to achieve convergent instrumental goals”.
Consider differences within some set of agents with equivalent intelligence; do they behave differently?
Consider whether the effective differences between agents with equivalent intelligence can be parametrized with something like a “final goal” or “utility function”.

Whereas classical decision theory assumes the agent architecture is parameterized by a utility function, this is more of a reverse-engineering approach: can we first identify an intelligence measure on agents within an architecture, then look for relevant differences between agents of a given intelligence, perhaps parametrized by something like a utility function?

There’s not necessarily a utility function directly encoded in an intelligent system such as a mouse brain; perhaps what is encoded directly is more like a Bellman state value function learned from reinforcement learning, influenced by evolutionary priors. In that case, it might be more analytically fruitful to identify relevant motivational features other than intelligence, and seeing how final-goal-like they are, rather than starting from the assumption that there is a final goal.

Let’s consider orthogonality again, and take a somewhat different analytical approach. Suppose that agents in a given architecture are well-parametrized by their final goals. How could intelligence vary depending on the agent’s final goal?

As an example, suppose the agents have utility functions over universe trajectories, which vary both in what sort of states they prefer, and in their time preference (how much they care more about achieving valuable states soon). An agent with a very high time preference (i.e. very impatient) would probably be relatively unintelligent, as it tries to achieve value quickly, neglecting convergent instrumental goals such as amassing energy. So intelligence should usually increase with patience, although maximally patient agents may behave unintelligently in other ways, e.g. investing too much in unlikely ways of averting the heat death of the universe.

There could also be especially un-intelligent goals such as the goal of dying as fast as possible. An agent pursuing this goal would of course tend to fail to achieve convergent instrumental goals. (Bostrom and Yudkowsky would agree that such cases exist, and require putting some conditions on the orthogonality thesis).

A more interesting question is whether there are especially intelligent goals, ones whose pursuit leads to especially high convergent instrumental goal achievement relative to “most” goals. A sketch of an example: Suppose we are considering a class of agents that assume Newtonian physics is true, and have preferences over Newtonian universe configurations. Some such agents have the goal of building Newtonian configurations that are (in fact, unknown to them) valid quantum computers. These agents might be especially intelligent, as they pursue the convergent instrumental goal of building quantum computers (thus unleashing even more intelligent agents, which build more quantum computers), unlike most Newtonian agents.

This is a bit of a weird case because it relies on the agents having a persistently wrong epistemology. More agnostically, we could also consider Newtonian agents that tend to want to build “interesting”, varied matter configurations, and are thereby more likely to stumble on esoteric physics like quantum computation. There are some complexities here (does it count as achieving convergent instrumental goals to create more advanced agents with “default” random goals, compared to the baseline of not doing so?) but at the very least, Newtonian agents that build interesting configurations seem to be more likely to have big effects than ones that don’t.

Generalizing a bit, different agent architectures could have different ontologies for the world model and utility function, e.g. Newtonian or quantum mechanical. If a Newtonian agent looks at a “random” quantum mechanical agent’s behavior, it might guess that it has a strong preference for building certain Newtonian matter configurations, e.g. ones that (in fact, unknown to it) correspond to quantum computers. More abstractly, a “default” / max-entropy measure on quantum mechanical utility functions might lead to behaviors that, projected back into Newtonian goals, look like having very specific preferences over Newtonian matter configurations. (Even more abstractly, see the Bertrand paradox showing that max-entropy distributions depend on parameterization.)

Maybe there is such a thing as a “universal agent architecture” in which there are no especially intelligent goals, but finding such an architecture would be difficult. This goes to show that identifying truly orthogonal goal-like axes is conceptually difficult; just because something seems like a final goal parameter doesn’t mean it is really orthogonal to intelligence.

Unusually intelligent utility functions relate to Nick Land’s idea of intelligence optimization. Quoting “Intelligence and the Good” (Xenosystems):

From the perspective of intelligence optimization (intelligence explosion formulated as a guideline), more intelligence is of course better than less intelligence… Even the dimmest, most confused struggle in the direction of intelligence optimization is immanently “good” (self-improving).

My point here is not to opine on the normativity of intelligence optimization, but rather to ask whether some utility functions within an architecture lead to more intelligence-optimization behavior. A rough guess is that especially intelligent goals within an agent architecture will tend to terminally value achieving conditions that increase intelligence in the universe.

Insurrealist, expounding on Land in “Intro to r/acc (part 1)”, writes:

Intelligence for us is, roughly, the ability of a physical system to maximize its future freedom of action. The interesting point is that “War Is God” seems to undermine any positive basis for action. If nothing is given, I have no transcendent ideal to order my actions and cannot select between them. This is related to the is-ought problem from Hume, the fact/value distinction from Kant, etc., and the general difficulty of deriving normativity from objective fact.

This class of problems seems to be no closer to resolution than it was a century ago, so what are we to do? The Landian strategy corresponds roughly to this: instead of playing games (in a very general, abstract sense) in accordance with a utility function predetermined by some allegedly transcendent rule, look at the collection of all of the games you can play, and all of the actions you can take, then reverse-engineer a utility function that is most consistent with your observations. This lets one not refute, but reject and circumvent the is-ought problem, and indeed seems to be deeply related to what connectionist systems, our current best bet for “AGI”, are actually doing.

The general idea of reverse-engineering a utility function suggests a meta-utility function, and a measure of intelligence is one such candidate. My intuition is that in the Newtonian agent architecture, a reverse-engineered utility function looks something like “exploring varied, interesting matter configurations of the sort that (in fact, perhaps unknown to the agent itself) tend to create large effects in non-Newtonian physics”.

To summarize main points:

Intelligence can be defined in a way that is not dependent on a fixed objective function, such as by measuring tendency to achieve convergent instrumental goals.
Within an agent architecture, effective behavioral differences other than intelligence can be identified, which for at least some architectures correspond with “final goals”, although finding the right orthogonal parameterization might be non-trivial.
Within an agent architecture already parameterized by final goals, intelligence may vary between final goals; especially unintelligent goals clearly exist, but especially intelligent goals would be more notable in cases where they exist.
Given an intelligence measure and agent architecture parameterized by goals, intelligence optimization could possibly correspond with some goals in that architecture; such reverse-engineered goals would be candidates for especially intelligent goals.

What links here?

jessicata's comment on Emergent morality in AI weakens the Orthogonality Thesis by dawnstrata (22 Aug 2025 1:26 UTC; 11 points)

jessicata11 Aug 2025 2:08 UTC

33 points

10 comments9 min readLW link

Jon Garcia 11 Aug 2025 5:34 UTC
5 points
2
I agree that the framing of rational agentic behavior as being “about” maximizing some (arbitrary) utility function is getting at things from the wrong perspective. Yes, consistent rational behavior can always be cast in those terms, and a fixed utility function can be found that is being maximized by any given behavior, but I don’t think that’s what is usually driving behavior in the first place.
How about this:
An intelligent agent is a system that engages in teleogenesis, generating internal representations of arbitrary goal states (or trajectories or limit cycles) and optimizing its behavior to steer toward states that match those representations. The broader the space of potential goal states that can be successfully navigated towards, and the better the system models its environment in order to do so, the more intelligent it is.
The manifold of reachable goal states may be necessarily restricted in some dimensions, such as for homeostatic or allostatic maintenance, but in general, an intelligent system should be able to set arbitrary goals and reward itself in proportion to how well it is achieving them.
Goal states may be more “terminal”-like, representing states with high predicted utility according to some built-in value function (homestasis, status, reproductive success, number of staples), or they may be more “instrumental”-like, representing states from which reaching terminal goals is predicted to be easier (power, resources, influence, etc.), or they may be more purely arbitrary (commandments from on high, task assignments, play, behavioral curiosity, epistemic curiosity). But wherever goals come from, intelligence is about being able to find ways to achieve them.
- jessicata 11 Aug 2025 6:02 UTC
  3 points
  0
  Parent
  That seems pretty close. One complication is what “can be successfully navigated towards” means; can a paperclip maximizer successfully navigate towards states without lots of paperclips? I suppose if it factors into a “goal module” and a “rest of the agent module”, then the “rest of the agent module” could navigate towards lots of different states even if the overall agent couldn’t.
  
  Causal entropic forces is another proposal that’s related to being able to reach a lot of states. Also empowerment objectives.
  
  One reason I mentioned MDP value functions is that they don’t bake in the assumption that the value function only specifies terminal values, the value function also includes instrumental state values. So it might be able to represent some of what you’re talking about.
  - Jon Garcia 11 Aug 2025 15:53 UTC
    3 points
    0
    Parent
    “Can be successfully navigated towards” means that there exists a set of policies for the agent that is reachable via reinforcement learning on the goal objective, which would allow the agent to consistently achieve the goal when followed (barring any drastic changes to the environment, although the policy may account for environmental fluctuations).
    Thanks for the paper on causal entropic forces, by the way. I hadn’t seen this research before, but it synergizes well with ideas I’ve been having related to alignment. At the risk of being overly reductive, I think we could do worse than designing an AGI that predictively models the goal distributions of other agents (i.e., humans) and generates as its own “terminal” goals those states that maximize the entropy of goal distributions reachable by the other agents. Essentially, seeking to create a world from which humans (and other systems) have the best chance at directing their own future.
  - Gunnar_Zarncke 11 Aug 2025 14:06 UTC
    3 points
    0
    Parent
    Different agents sense and store different information bits from the environment and affect different property bits of the environment. Even if two agents have the same capability (number of bits controlled), the facets they may actually control may be very different. Only at high level of capability, where more and more bits are controlled overall, do bitsets overlap more and more and capabilities converge—instrumental convergence.
Gunnar_Zarncke 11 Aug 2025 10:10 UTC
4 points
0
Intelligence solves problems, by guiding behavior to produce local extropy. It is indicated by the avoidance of probable outcomes, which is equivalent to the construction of information.
This amounts to something similar to the convergent instrumental goal definition; achieving sufficiently specific outcomes involves pursuing convergent instrumental goals.
I like the idea of looking for convergent instrumental goals, but I think this section specifically misses the opportunity to formalize the local extropy production or generally to look for information-theoretical measures.
If we assume a modeling of an agent in terms of his Markov blanket (ignoring issues with that for now^[1]), then we could define the generalized capability of an agent in terms of that.
$C a p a b i l i t y = I_{p r e d} + I_{c t r l} - β H (I) - S$
Where
- $I_{p r e d}$ – “bits you can see coming”:
  The mutual information $I (I_{t}; S_{t + 1})$ between the agent’s internal state $I_{t}$ and its next sensory state $S_{t + 1}$ quantifies how much the agent’s current “belief state” predicts what it will sense next.
- $I_{c t r l}$ – “bits you can steer”:
  The mutual information $I (A_{t}; E_{t + 1})$ between the agent’s action $A_{t}$ and the next external state $E_{t + 1}$ measures how much the agent’s outputs causally structure the world beyond its blanket.
- $H (I)$ – “bits you have to keep alive”:
  Shannon entropy of the internal state $I_{t}$ . This is the size of the agent’s memory in bits. The coefficient β turns that size into a cost, reflecting physical maintenance energy and complexity overhead (e.g. Landauer limit).
- S – “bits you fail to see coming”:
  Expected negative log-likelihood $S = E [- log P (S_{t + 1} ∣ I_{t})]$ of the next sensory state given the internal state. This is the “leftover unpredictability” after using the best model encoded in $I_{t}$ , i.e. the sensory free energy.
1. ^
  Instead of the hard causal independence, it may be possible to define a boundary as the maximal separation in mutual information between clusters.
Gurkenglas 11 Aug 2025 9:53 UTC
2 points
0
An agent is intelligent to the extent it tends to achieve convergent instrumental goals.
I see a surprisingly persuasive formal argument for my immediate intuition that this is a ~~measure, not a target~~ theorem, not a definition: Omohundro drives are about multi-turn games, but intelligence is a property of an agent playing an arbitrary game. (A multi-turn game is a game where the type of world-states before and after the agent’s turn happen to be the same, which means you get to iterate.)
- jessicata 11 Aug 2025 16:49 UTC
  2 points
  0
  Parent
  Well I think multi turn games are generally of more import? Wouldn’t AI doing big things like inventing new technology take multiple turns?
  
  I guess cases like answering a math question could be single turn. I think multi turn settings are more appropriate to intelligent agency though.
  - Gurkenglas 11 Aug 2025 19:17 UTC
    2 points
    0
    Parent
    The parenthetical was meant to argue that almost all games are single-turn, that multi-turn games are single-turn games with extra property. I have found several definitions that can be gainfully refactored in terms of the theorems they lead to, but never did they thereby become less general.
    Mayyybe you can get away with claiming that the thing you’re looking for only typechecks for multi-turn games, in which case I’d go looking for the definitions that become available there. Namely, hmm. For what properties of world-states does there exist a player such that the property remains true over time? Which players does one get that way?
    - jessicata 11 Aug 2025 23:14 UTC
      2 points
      0
      Parent
      I guess I’m more interested in modeling cases of AI importance/risk on account of pursuit of convergent instrumental goals. If there’s a setting without convergent instrumental goals, there might be a generalization, but it’s less clear.
      
      With single-turn question answering, one could ask about accuracy, which would rate a system giving intentionally wrong answers as un-intelligent. The thing I meant to point to with AIXI is that it would be nice to have a measure of intelligence for mis-aligned systems (rather than declaring them un-intelligent because they don’t satisfy the objective like reward maximization / question answering / etc). If there is a possible “intelligent misalignment” in the single turn case (e.g. question answering) then there might be a corresponding intelligence metric that accounts for intelligent misaligned systems.
soycarts 13 Aug 2025 18:17 UTC
1 point
0
Intelligence can be defined in a way that is not dependent on a fixed objective function, such as by measuring tendency to achieve convergent instrumental goals.
Around intelligence progression I perceive a framework of lower-order cognition, metacognition (i.e this captures “human intelligence” as we think about it), and third-order cognition (i.e superintelligence when related to human intelligence).
Relating this to your description of goal-seeking behaviour: to your point I describe a few complex properties aiming to capture what is going in an agent (“being”) — for example in a given moment there is “agency permeability” between cognitive layers, where each layer can influence and be influenced by the “global action policy” of that moment. There is also a bound feature of “homeostatic unity”: where all subsystems participate in the same self-maintenance goal.
In a globally optimised version of this model, I envision a superintelligent third-order cognitive layer which has “done the “self work”: understanding its motives and iterating to achieve enlightened levels of altruism/prosocial value frameworks, stoicism, etc. — specifically implemented as self-supervised learning”.
I acknowledge this is a bit of a hand-wavey solution to value plurality, but argue that such a technique is necessary since we are discussing the realms of superintelligence.