This idea was inspired by a correspondence with Adam Shimi.

It seem very interesting and important to understand to what extent a purely “behaviorist” view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information?

Consider a general reinforcement learning settings: we have a set of actions A, a set of observations O, a policy is a mapping π:(A×O)∗→ΔA, a reward function is a mapping r:(A×O)∗→[0,1], the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions.)

The simplest attempt at defining “goal-directed intelligence” is requiring that the policy π in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows π, or the prior can believe that behavior not according to π leads to some terrible outcome.

The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are “contrived”. However, description complexity is only naturally well-defined up to an additive constant. So, if we want to have a crisp concept, we need to consider an asymptotic in which the complexity of something goes to infinity. Indeed, it seems natural to ask that the complexity of the policy should be much higher than the complexity of the prior and the reward function: in this case we can say that the “intentional stance” is an efficient description. However, this doesn’t make sense with description complexity: the description “optimal policy for U and ζ” is of size K(U)+K(ζ)+O(1) (K(x) stands for “description complexity of x”).

To salvage this idea, we need to take not only description complexity but also computational complexity into account. [EDIT: I was wrong, and we can get a well-defined concept in the unbounded setting too, see child comment. The bounded concept is still interesting.] For the intentional stance to be non-vacuous we need to demand that the policy does some “hard work” in order to be optimal. Let’s make it formal. Consider any function of the type f:Σ∗→ΔΞ where Σ and Ξ are some finite alphabets. Then, we can try to represent it by a probabilistic automaton T:S×Σ→Δ(S×Ξ), where S is the finite set space, T is the transition kernel, and we’re feeding symbols into the automaton one by one. Moreover, T can be represented as a boolean circuit R and this circuit can be the output of some program P executed by some fixed universal Turing machine. We can associate with this object 5 complexity parameters:

The description complexity, which is the length of P.

The computation time complexity, which is the size of R.

The computation space complexity, which is the maximum between the depth of R and log|S|.

The precomputation time complexity, which is the time it takes P to run.

The precomputation space complexity, which is the space P needs to run.

It is then natural to form a single complexity measure by applying a logarithm to the times and taking a linear combination of all 5 (we apply a logarithm so that a brute force search over n bits is roughly equivalent to hard-coding n bits). The coefficients in this combination represent the “prices” of the various resources (but we should probably fix the price of description complexity to be 1). Of course not all coefficients must be non-vanishing, it’s just that I prefer to keep maximal generality for now. We will denote this complexity measure C.

We can use such automatons to represent policies, finite POMDP environments and reward functions (ofc not any policy or reward function, but any that can be computed on a machine with finite space). In the case of policies, the computation time/space complexity can be regarded as the time/space cost of applying the “trained” algorithm, whereas the precomputation time/space complexity can be regarded as the time/space cost of training. If we wish, we can also think of the boolean circuit as a recurrent neural network.

We can also use C to define a prior ζ0, by ranging over programs P that output a valid POMDP and assigning probability proportional to 2−C to each instance. (Assuming that the environment has a finite state space might seem restrictive, but becomes quite reasonable if we use a quasi-Bayesian setting with quasi-POMDPs that are not meant to be complete descriptions of the environment; for now we won’t go into details about this.)

Now, return to our policy π. Given g>0, we define that ”π has goal-directed intelligence (at least) g” when there is a suitable prior ζ and utility function U s.t. for any policy π′, if Eζπ′[U]≥Eζπ[U] then C(π′)≥DKL(ζ0||ζ)+C(U)+g. When g=+∞ (i.e. no finite automaton can match the expected utility of π; in particular, this implies π is optimal since any policy can be approximated by a finite automaton), we say that π is “perfectly goal-directed”. Here, DKL(ζ0||ζ) serves as a way to measure the complexity of ζ, which also ensures ζ is non-dogmatic in some rather strong sense.

[EDIT: if we fix U and ζ then g is essentially the same as Yudkowsky’s definition of optimization power if we regard the policy as the “outcome” and use 2−C as our measure on the space of outcomes.]

With this definition we cannot “cheat” by encoding the policy into the prior or into the utility function, since that would allow no complexity difference. Therefore this notion seems like a non-trivial requirement on the policy. On the other hand, this requirement does hold sometimes, because solving the optimization problem can be much more computationally costly than just evaluating the utility function or sampling the prior.

Actually, as opposed to what I claimed before, we don’t need computational complexity bounds for this definition to make sense. This is because the Solomonoff prior is made of computable hypotheses but is uncomputable itself.

Given g>0, we define that ”π has (unbounded) goal-directed intelligence (at least) g” when there is a prior ζ and utility function U s.t. for any policy π′, if Eζπ′[U]≥Eζπ[U] then K(π′)≥DKL(ζ0||ζ)+K(U)+g. Here, ζ0 is the Solomonoff prior and K is Kolmogorov complexity. When g=+∞ (i.e. no computable policy can match the expected utility of π; in particular, this implies π is optimal since any policy can be approximated by a computable policy), we say that π is “perfectly (unbounded) goal-directed”.

Compare this notion to the Legg-Hutter intelligence measure. The LH measure depends on the choice of UTM in radical ways. In fact, for some UTMs, AIXI (which is the maximum of the LH measure) becomes computable or even really stupid. For example, it can always keep taking the same action because of the fear that taking any other action leads to an inescapable “hell” state. On the other hand, goal-directed intelligence differs only by O(1) between UTMs, just like Kolmogorov complexity. A perfectly unbounded goal-directed policy has to be uncomputable, and the notion of which policies are such doesn’t depend on the UTM at all.

I think that it’s also possible to prove that intelligence is rare, in the sense that, for any computable stochastic policy, if we regard it as a probability measure over deterministic policies, then for any ϵ>0 there is g s.t. the probability to get intelligence at least g is smaller than ϵ.

Also interesting is that, for bounded goal-directed intelligence, increasing the prices can only decrease intelligence by O(1), and a policy that is perfectly goal-directed w.r.t. lower prices is also such w.r.t. higher prices (I think). In particular, a perfectly unbounded goal-directed policy is perfectly goal-directed for any price vector. Informally speaking, an agent that is very smart relatively to a context with cheap computational resources is still very smart relatively to a context where they are expensive, which makes intuitive sense.

If we choose just one computational resource, we can speak of the minimal price for which a given policy is perfectly goal-directed, which is another way to measure intelligence with a more restricted domain. Curiously, our bounded Solomonoff-like prior has the shape of a Maxwell-Boltzmann distribution in which the prices are thermodynamic parameters. Perhaps we can regard the minimal price as the point of a phase transition.

Some problems to work on regarding goal-directed intelligence. Conjecture 5 is especially important for deconfusing basic questions in alignment, as it stands in opposition to Stuart Armstrong’s thesis about the impossibility to deduce preferences from behavior alone.

Conjecture. Informally: It is unlikely to produce intelligence by chance. Formally: Denote Π the space of deterministic policies, and consider some μ∈ΔΠ. Suppose μ is equivalent to a stochastic policy π∗. Then, Eπ∼μ[g(π)]=O(C(π∗)).

Find an “intelligence hierarchy theorem”. That is, find an increasing sequence {gn} s.t. for every n, there is a policy with goal-directed intelligence in (gn,gn+1) (no more and no less).

What is the computational complexity of evaluating g given (i) oracle access to the policy or (ii) description of the policy as a program or automaton?

What is the computational complexity of producing a policy with given g?

Conjecture. Informally: Intelligent agents have well defined priors and utility functions. Formally: For every (U,ζ) with C(U)<∞ and DKL(ζ0||ζ)<∞, and every ϵ>0, there exists g∈(0,∞) s.t. for every policy π with intelligence at least g w.r.t. (U,ζ), and every (~U,~ζ) s.t.π has intelligence at least g w.r.t. them, any optimal policies π∗,~π∗ for (U,ζ) and (~U,~ζ) respectively satisfy Eζ~π∗[U]≥Eζπ∗[U]−ϵ.

re: #5, that doesn’t seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming. That is, assuming 5, we still cannot show that there isn’t some U1≠U2 such that π∗(U1,ζ)=π∗(U2,ζ).

(And as pointed out elsewhere, it isn’t Stuart’s thesis, it’s a well known and basic result in the decision theory / economics / philosophy literature.)

re: #5, that doesn’t seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming.

You misunderstand the intent. We’re talking about inverse reinforcement learning. The goal is not necessarily inferring the unknown U, but producing some behavior that optimizes the unknown U. Ofc if the policy you’re observing is optimal then it’s trivial to do so by following the same policy. But, using my approach we might be able to extend it into results like “the policy you’re observing is optimal w.r.t. certain computational complexity, and your goal is to produce an optimal policy w.r.t. higher computational complexity.”

(Btw I think the formal statement I gave for 5 is false, but there might be an alternative version that works.)

(And as pointed out elsewhere, it isn’t Stuart’s thesis, it’s a well known and basic result in the decision theory / economics / philosophy literature.)

I am referring to this and related work by Armstrong.

This idea was inspired by a correspondence with Adam Shimi.It seem very interesting and important to understand to what extent a purely “behaviorist” view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information?

Consider a general reinforcement learning settings: we have a set of actions A, a set of observations O, a policy is a mapping π:(A×O)∗→ΔA, a reward function is a mapping r:(A×O)∗→[0,1], the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions.)

The simplest attempt at defining “goal-directed intelligence” is requiring that the policy π in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows π, or the prior can believe that behavior not according to π leads to some terrible outcome.

The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are “contrived”. However, description complexity is only naturally well-defined up to an additive constant. So, if we want to have a crisp concept, we need to consider an asymptotic in which the complexity of

somethinggoes to infinity. Indeed, it seems natural to ask that the complexity of the policy should be much higher than the complexity of the prior and the reward function: in this case we can say that the “intentional stance” is an efficient description. However, this doesn’t make sense with description complexity: the description “optimal policy for U and ζ” is of size K(U)+K(ζ)+O(1) (K(x) stands for “description complexity of x”).To salvage this idea, we need to take not only description complexity but also

computationalcomplexity into account. [EDIT: I was wrong, and we can get a well-defined concept in the unbounded setting too, see child comment. The bounded concept is still interesting.] For the intentional stance to be non-vacuous we need to demand that the policy does some “hard work” in order to be optimal. Let’s make it formal. Consider any function of the type f:Σ∗→ΔΞ where Σ and Ξ are some finite alphabets. Then, we can try to represent it by a probabilistic automaton T:S×Σ→Δ(S×Ξ), where S is the finite set space, T is the transition kernel, and we’re feeding symbols into the automaton one by one. Moreover, T can be represented as a boolean circuit R and this circuit can be the output of some program P executed by some fixed universal Turing machine. We can associate with this object 5 complexity parameters:The description complexity, which is the length of P.

The computation time complexity, which is the size of R.

The computation space complexity, which is the maximum between the depth of R and log|S|.

The precomputation time complexity, which is the time it takes P to run.

The precomputation space complexity, which is the space P needs to run.

It is then natural to form a single complexity measure by applying a logarithm to the times and taking a linear combination of all 5 (we apply a logarithm so that a brute force search over n bits is roughly equivalent to hard-coding n bits). The coefficients in this combination represent the “prices” of the various resources (but we should probably fix the price of description complexity to be 1). Of course not all coefficients must be non-vanishing, it’s just that I prefer to keep maximal generality for now. We will denote this complexity measure C.

We can use such automatons to represent policies, finite POMDP environments and reward functions (ofc not

anypolicy or reward function, but any that can be computed on a machine with finite space). In the case of policies, the computation time/space complexity can be regarded as the time/space cost of applying the “trained” algorithm, whereas the precomputation time/space complexity can be regarded as the time/space cost of training. If we wish, we can also think of the boolean circuit as a recurrent neural network.We can also use C to define a prior ζ0, by ranging over programs P that output a valid POMDP and assigning probability proportional to 2−C to each instance. (Assuming that the environment has a finite state space might seem restrictive, but becomes quite reasonable if we use a quasi-Bayesian setting with quasi-POMDPs that are not meant to be complete descriptions of the environment; for now we won’t go into details about this.)

Now, return to our policy π. Given g>0, we define that ”π has goal-directed intelligence (at least) g” when there is a suitable prior ζ and utility function U s.t. for any policy π′, if Eζπ′[U]≥Eζπ[U] then C(π′)≥DKL(ζ0||ζ)+C(U)+g. When g=+∞ (i.e. no finite automaton can match the expected utility of π; in particular, this implies π is optimal since any policy can be

approximatedby a finite automaton), we say that π is “perfectly goal-directed”. Here, DKL(ζ0||ζ) serves as a way to measure the complexity of ζ, which also ensures ζ is non-dogmatic in some rather strong sense.[EDIT: if we fix U and ζ then g is essentially the same as Yudkowsky’s definition of optimization power if we regard the policy as the “outcome” and use 2−C as our measure on the space of outcomes.]

With

thisdefinition we cannot “cheat” by encoding the policy into the prior or into the utility function, since that would allow no complexity difference. Therefore this notion seems like a non-trivial requirement on the policy. On the other hand, this requirementdoeshold sometimes, because solving the optimization problem can be much more computationally costly than just evaluating the utility function or sampling the prior.Actually, as opposed to what I claimed before, we don’t need computational complexity bounds for this definition to make sense. This is because the Solomonoff prior is made of computable hypotheses but is uncomputable itself.

Given g>0, we define that ”π has (unbounded) goal-directed intelligence (at least) g” when there is a prior ζ and utility function U s.t. for any policy π′, if Eζπ′[U]≥Eζπ[U] then K(π′)≥DKL(ζ0||ζ)+K(U)+g. Here, ζ0 is the Solomonoff prior and K is Kolmogorov complexity. When g=+∞ (i.e. no computable policy can match the expected utility of π; in particular, this implies π is optimal since any policy can be

approximatedby a computable policy), we say that π is “perfectly (unbounded) goal-directed”.Compare this notion to the Legg-Hutter intelligence measure. The LH measure depends on the choice of UTM in radical ways. In fact, for some UTMs, AIXI (which is the maximum of the LH measure) becomes computable or even really stupid. For example, it can always keep taking the same action because of the fear that taking any other action leads to an inescapable “hell” state. On the other hand, goal-directed intelligence differs only by O(1) between UTMs, just like Kolmogorov complexity. A perfectly unbounded goal-directed policy has to be uncomputable, and the notion of which policies are such doesn’t depend on the UTM at all.

I think that it’s also possible to prove that intelligence is rare, in the sense that, for any computable stochastic policy, if we regard it as a probability measure over deterministic policies, then for any ϵ>0 there is g s.t. the probability to get intelligence at least g is smaller than ϵ.

Also interesting is that, for bounded goal-directed intelligence, increasing the prices can only decrease intelligence by O(1), and a policy that is perfectly goal-directed w.r.t. lower prices is also such w.r.t. higher prices (I think). In particular, a perfectly unbounded goal-directed policy is perfectly goal-directed for

anyprice vector. Informally speaking, an agent that is very smart relatively to a context with cheap computational resources is still very smart relatively to a context where they are expensive, which makes intuitive sense.If we choose just one computational resource, we can speak of the minimal price for which a given policy is perfectly goal-directed, which is another way to measure intelligence with a more restricted domain. Curiously, our bounded Solomonoff-like prior has the shape of a Maxwell-Boltzmann distribution in which the prices are thermodynamic parameters. Perhaps we can regard the minimal price as the point of a phase transition.

Some problems to work on regarding goal-directed intelligence. Conjecture 5 is especially important for deconfusing basic questions in alignment, as it stands in opposition to Stuart Armstrong’s thesis about the impossibility to deduce preferences from behavior alone.

Conjecture. Informally: It is unlikely to produce intelligence by chance. Formally: Denote Π the space of deterministic policies, and consider some μ∈ΔΠ. Suppose μ is equivalent to a stochastic policy π∗. Then, Eπ∼μ[g(π)]=O(C(π∗)).

Find an “intelligence hierarchy theorem”. That is, find an increasing sequence {gn} s.t. for every n, there is a policy with goal-directed intelligence in (gn,gn+1) (no more and no less).

What is the computational complexity of evaluating g given (i) oracle access to the policy or (ii) description of the policy as a program or automaton?

What is the computational complexity of producing a policy with given g?

Conjecture. Informally: Intelligent agents have well defined priors and utility functions. Formally: For every (U,ζ) with C(U)<∞ and DKL(ζ0||ζ)<∞, and every ϵ>0, there exists g∈(0,∞) s.t. for every policy π with intelligence at least g w.r.t. (U,ζ), and every (~U,~ζ) s.t.π has intelligence at least g w.r.t. them, any optimal policies π∗,~π∗ for (U,ζ) and (~U,~ζ) respectively satisfy Eζ~π∗[U]≥Eζπ∗[U]−ϵ.

re: #5, that doesn’t seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming. That is, assuming 5, we still cannot show that there isn’t some U1≠U2 such that π∗(U1,ζ)=π∗(U2,ζ).

(And as pointed out elsewhere, it isn’t Stuart’s thesis, it’s a well known and basic result in the decision theory / economics / philosophy literature.)

You misunderstand the intent. We’re talking about inverse reinforcement learning. The goal is not necessarily inferring the unknown U, but producing some behavior that optimizes the unknown U. Ofc if the policy you’re observing is optimal then it’s trivial to do so by following the same policy. But, using my approach we might be able to extend it into results like “the policy you’re observing is optimal w.r.t. certain computational complexity, and your goal is to produce an optimal policy w.r.t. higher computational complexity.”

(Btw I think the formal statement I gave for 5 is false, but there might be an alternative version that works.)

I am referring to this and related work by Armstrong.