TLDR: Systems which locally maximal influence can be described as VNM decision-makers.
There are at least 3 different motivations leading to the concept of “agent” in the context of AI alignment:
The sort of system we are concerned about (i.e. which poses risk)
The sort of system we want to build (in order to defend from dangerous systems)
The sort of systems that humans are (in order to meaningfully talk about “human preferences”)
Motivation #1 naturally suggests a descriptive approach, motivation #2 naturally suggests a prescriptive approach, and motivation #3 is sort of a mix of both: on the one hand, we’re describing something that already exists, on the other hand, the concept of “preferences” inherently comes from a normative perspective. There are also reasons to think these different motivation should converge on a single, coherent concept.
Here, we will focus on motivation #1.
A central reason why we are concerned about powerful unaligned agents, is that they are influential. Agents are the sort of system that, when instantiated in a particular environment is likely to heavily change this environment, potentially in ways inconsistent with the preferences of other agents.
Bayesian VNM
Consider a nice space[1]X of possible “outcomes”, and a system that can choose[2] out of a closed set of distributions D⊆ΔX. I propose that an influential system should satisfy the following desideratum:
The system cannot select μ∗∈D which can be represented as a non-trivial lottery over other elements in D. In other words, μ∗ has to be an extreme point of the convex hull of D.
Why? Because a system that selects a non-extreme point leaves something to chance. If the system can force outcome μ∈ΔX, or outcome ν∈ΔX but chooses instead outcome pμ+(1−p)ν, for p∈(0,1) and μ≠ν, this means the system gave up on its ability to choose between μ and ν in favor of a p-biased coin. Such a system is not “locally[3] maximally” influential[4].
[EDIT: The original formulation was wrong, h/t @harfe for catching the error.]
The desideratum implies that there is a convergent sequence of utility functions {uk:X→R}k∈N s.t.
For every k∈N, Eμ[uk] has a unique maximum μk in D.
The sequence μk converges to μ∗.
In other words, such a system can be approximated by a VNM decision-maker within any precision. For finite D, we don’t need the sequence, instead there is some u:X→R s.t.μ∗ is the unique maximum of Eμ[u] over D. This observation is mathematically quite simple, but I haven’t seen it made elsewhere (but I would not be surprised if it did appear in the decision theory literature somewhere).
Infra-Bayesian VNM?
Now, let’s say that the system is choosing out of a set of credal sets (crisp infradistributions) D⊆□X. I propose the following desideratum:
[EDIT: Corrected according to a suggestion by @harfe, original version was too weak.]
Let ^D be the closure of D w.r.t. convex combinations and joins[5]. Let Θ∗∈□X be selected by the system. Then:
For any Φ,Ψ∈^D and p∈(0,1), if Θ∗=pΦ+(1−p)Ψ then Φ=Ψ.
For any Φ∈^D, if Φ⊆Θ∗ then Φ=Θ∗.
The justification is, a locally maximal influential system should leave the outcome neither to chance nor to ambiguity (the two types of uncertainty we have with credal sets).
We would like to say that this implies that the system is choosing according to maximin relatively to a particular utility function. However, I don’t think this is true, as the following example shows:
Example: Let X={0,1}, and D consist of the probability intervals Θ0:=[0,23], Θ1:=[13,1] and Θ2:=[13,23]. Then, it is (I think) consistent with the desideratum to have Θ∗=Θ2.
Instead, I have the following conjecture:
Conjecture: There exists some space Y, some ξ∈ΔY and convergent sequence {uk:Y×X→R}k∈N s.t.
Θ∗=limk→∞argmaxΘ∈DEy∼ξ[minμ∈ΘEx∼μ[uk(y,x)]]
As before, the maxima should be unique.
Such a “generalized utility function” can be represented as an ordinary utility function with a latent Y-valued variable, if we replace D with D′⊆□(Y×X) defined by
D′:={ξ⋉Θ∣Θ∈D}
However, using utility functions constructed in this way leads to issues with learnability, which probably means there are also issues with computational feasibility. Perhaps in some natural setting, there is a notion of “maximally influential under computational constraints” which implies an “ordinary” maximin decision rule.
This approach does rule out optimistic or “mesomistic” decision-rules. Optimistic decision makers tend to give up on influence, because they believe that “nature” would decide favorably for them. Influential agents cannot give up on influence, therefore they should be pessimistic.
Sequential Decision-Making
What would be the implications in a sequential setting? That is, suppose that we have a set of actions A, a set of observations O, X:=(A×O)ω, a prior ζ:(A×O)∗×A→ΔO and
D:={ζπ∣π:O∗→A}
In this setting, the result is vacuous because of an infamous issue: any policy can be justified by a contrived utility functions that favors it. However, this is only because the formal desideratum doesn’t capture the notion of “influence” sufficiently well. Indeed, a system whose influence boils down entirely to its own outputs is not truly influential. What motivation #1 asks of us, is talk about systems that influence the world-at-large, including relatively “faraway” locations.
One way to fix some of the problem is, take X:=Oω and define D accordingly. This singles out systems that have influence over their observations rather than only their actions, which is already non-vacuous (some policies are not such). However, such a system can still be myopic. We can take this further, and select “long-term” influence by projecting onto late observations or some statistics over observations. However, in order to talk about actually “far-reaching” influence, we probably need to switch to the infra-Bayesian physicalism setting. There, we can set X:=2Γ, i.e. select for system that have influence over physically manifest computations.
That is, if Ψ,Φ∈^D then their join (convex hull) Ψ∨Φ is also in ^D, and so is pΨ+(1−p)Φ for every p∈[0,1]. Moreover, ^D is the minimal closed superset of D with this property. Notice that this implies ^D is closed w.r.t. arbitrary infra-convex combinations, i.e. for any Y, K:Y→^D and Ξ∈□Y, we have K∗Ξ∈^D.
I think there are some subtleties with the (non-infra) bayesian VNM version, which come down to the difference between “extreme point” and “exposed point” of D. If a point is an extreme point that is not an exposed point, then it cannot be the unique expected utility maximizer under a utility function (but it can be a non-unique maximizer).
For extreme points it might still work with uniqueness, if, instead of a VNM-decision-maker, we require a slightly weaker decision maker whose preferences satisfy the VNM axioms except continuity.
Another excellent catch, kudos. I’ve really been sloppy with this shortform. I corrected it to say that we can approximate the system arbitrarily well by VNM decision-makers. Although, I think it’s also possible to argue that a system that selects a non-exposed point is not quite maximally influential, because it’s selecting somethings that’s very close to delegating some decision power to chance.
Also, maybe this cannot happen when D is the inverse limit of finite sets? (As is the case in sequential decision making with finite action/observation spaces). I’m not sure.
I think this condition might be too weak and the conjecture is not true under this definition.
If Φ1⊆Φ2, then we have
Ey∼ξminμ∈Φ2Ex∼μu(x,y)≤Ey∼ξminμ∈Φ1Ex∼μu(x,y)
(because a minimum over a larger set is smaller).
Thus, Φ2 can only be the unique argmax if Φ1=Φ2.
Consider the example ^D={[0,x]:x∈[0,1]}.
Then ^D is closed.
And Θ∗=[0,1] satisfies
Θ∗=Φ∨Ψ⟹Φ⊆Ψ∨Ψ⊆Φ.
But per the above it cannot be a unique maximizer.
Maybe the issue can be fixed if we strengthen the condition so that
Φ∗ has to be also minimal with respect to ⊆.
Example: Let X={0,1}, and D consist of the probability intervals Θ0:=[0,23], Θ1:=[13,1] and Θ2:=[13,23]. Then, it is (I think) consistent with the desideratum to have Θ∗=Θ2.
Not only that interpreting Θ∗=Θ2 requires an unusual decision rule (which I will be calling “utility hyperfunction”), but applying any ordinary utility function to this example yields a non-unique maximum. This is another point in favor of the significance of hyperfunctions.
TLDR: Systems which locally maximal influence can be described as VNM decision-makers.
There are at least 3 different motivations leading to the concept of “agent” in the context of AI alignment:
The sort of system we are concerned about (i.e. which poses risk)
The sort of system we want to build (in order to defend from dangerous systems)
The sort of systems that humans are (in order to meaningfully talk about “human preferences”)
Motivation #1 naturally suggests a descriptive approach, motivation #2 naturally suggests a prescriptive approach, and motivation #3 is sort of a mix of both: on the one hand, we’re describing something that already exists, on the other hand, the concept of “preferences” inherently comes from a normative perspective. There are also reasons to think these different motivation should converge on a single, coherent concept.
Here, we will focus on motivation #1.
A central reason why we are concerned about powerful unaligned agents, is that they are influential. Agents are the sort of system that, when instantiated in a particular environment is likely to heavily change this environment, potentially in ways inconsistent with the preferences of other agents.
Bayesian VNM
Consider a nice space[1] X of possible “outcomes”, and a system that can choose[2] out of a closed set of distributions D⊆ΔX. I propose that an influential system should satisfy the following desideratum:
The system cannot select μ∗∈D which can be represented as a non-trivial lottery over other elements in D. In other words, μ∗ has to be an extreme point of the convex hull of D.
Why? Because a system that selects a non-extreme point leaves something to chance. If the system can force outcome μ∈ΔX, or outcome ν∈ΔX but chooses instead outcome pμ+(1−p)ν, for p∈(0,1) and μ≠ν, this means the system gave up on its ability to choose between μ and ν in favor of a p-biased coin. Such a system is not “locally[3] maximally” influential[4].
[EDIT: The original formulation was wrong, h/t @harfe for catching the error.]
The desideratum implies that there is a convergent sequence of utility functions {uk:X→R}k∈N s.t.
For every k∈N, Eμ[uk] has a unique maximum μk in D.
The sequence μk converges to μ∗.
In other words, such a system can be approximated by a VNM decision-maker within any precision. For finite D, we don’t need the sequence, instead there is some u:X→R s.t.μ∗ is the unique maximum of Eμ[u] over D. This observation is mathematically quite simple, but I haven’t seen it made elsewhere (but I would not be surprised if it did appear in the decision theory literature somewhere).
Infra-Bayesian VNM?
Now, let’s say that the system is choosing out of a set of credal sets (crisp infradistributions) D⊆□X. I propose the following desideratum:
[EDIT: Corrected according to a suggestion by @harfe, original version was too weak.]
Let ^D be the closure of D w.r.t. convex combinations and joins[5]. Let Θ∗∈□X be selected by the system. Then:
For any Φ,Ψ∈^D and p∈(0,1), if Θ∗=pΦ+(1−p)Ψ then Φ=Ψ.
For any Φ∈^D, if Φ⊆Θ∗ then Φ=Θ∗.
The justification is, a locally maximal influential system should leave the outcome neither to chance nor to ambiguity (the two types of uncertainty we have with credal sets).
We would like to say that this implies that the system is choosing according to maximin relatively to a particular utility function. However, I don’t think this is true, as the following example shows:
Example: Let X={0,1}, and D consist of the probability intervals Θ0:=[0,23], Θ1:=[13,1] and Θ2:=[13,23]. Then, it is (I think) consistent with the desideratum to have Θ∗=Θ2.
Instead, I have the following conjecture:
Conjecture: There exists some space Y, some ξ∈ΔY and convergent sequence {uk:Y×X→R}k∈N s.t.
Θ∗=limk→∞argmaxΘ∈DEy∼ξ[minμ∈ΘEx∼μ[uk(y,x)]]As before, the maxima should be unique.
Such a “generalized utility function” can be represented as an ordinary utility function with a latent Y-valued variable, if we replace D with D′⊆□(Y×X) defined by
D′:={ξ⋉Θ∣Θ∈D}However, using utility functions constructed in this way leads to issues with learnability, which probably means there are also issues with computational feasibility. Perhaps in some natural setting, there is a notion of “maximally influential under computational constraints” which implies an “ordinary” maximin decision rule.
This approach does rule out optimistic or “mesomistic” decision-rules. Optimistic decision makers tend to give up on influence, because they believe that “nature” would decide favorably for them. Influential agents cannot give up on influence, therefore they should be pessimistic.
Sequential Decision-Making
What would be the implications in a sequential setting? That is, suppose that we have a set of actions A, a set of observations O, X:=(A×O)ω, a prior ζ:(A×O)∗×A→ΔO and
D:={ζπ∣π:O∗→A}In this setting, the result is vacuous because of an infamous issue: any policy can be justified by a contrived utility functions that favors it. However, this is only because the formal desideratum doesn’t capture the notion of “influence” sufficiently well. Indeed, a system whose influence boils down entirely to its own outputs is not truly influential. What motivation #1 asks of us, is talk about systems that influence the world-at-large, including relatively “faraway” locations.
One way to fix some of the problem is, take X:=Oω and define D accordingly. This singles out systems that have influence over their observations rather than only their actions, which is already non-vacuous (some policies are not such). However, such a system can still be myopic. We can take this further, and select “long-term” influence by projecting onto late observations or some statistics over observations. However, in order to talk about actually “far-reaching” influence, we probably need to switch to the infra-Bayesian physicalism setting. There, we can set X:=2Γ, i.e. select for system that have influence over physically manifest computations.
I won’t keep track of topological technicalities here, probably everything here works at least for compact Polish spaces.
Meaning that the system has some output, and different counterfactual outputs correspond to different elements of D.
I say “locally” because it refers to something like a partial order, not a global scalar measure of influence.
See also Yudkowsky’s notion of efficient systems “not leaving free energy”.
That is, if Ψ,Φ∈^D then their join (convex hull) Ψ∨Φ is also in ^D, and so is pΨ+(1−p)Φ for every p∈[0,1]. Moreover, ^D is the minimal closed superset of D with this property. Notice that this implies ^D is closed w.r.t. arbitrary infra-convex combinations, i.e. for any Y, K:Y→^D and Ξ∈□Y, we have K∗Ξ∈^D.
I think there are some subtleties with the (non-infra) bayesian VNM version, which come down to the difference between “extreme point” and “exposed point” of D. If a point is an extreme point that is not an exposed point, then it cannot be the unique expected utility maximizer under a utility function (but it can be a non-unique maximizer).
For extreme points it might still work with uniqueness, if, instead of a VNM-decision-maker, we require a slightly weaker decision maker whose preferences satisfy the VNM axioms except continuity.
Another excellent catch, kudos. I’ve really been sloppy with this shortform. I corrected it to say that we can approximate the system arbitrarily well by VNM decision-makers. Although, I think it’s also possible to argue that a system that selects a non-exposed point is not quite maximally influential, because it’s selecting somethings that’s very close to delegating some decision power to chance.
Also, maybe this cannot happen when D is the inverse limit of finite sets? (As is the case in sequential decision making with finite action/observation spaces). I’m not sure.
I think this condition might be too weak and the conjecture is not true under this definition.
If Φ1⊆Φ2, then we have Ey∼ξminμ∈Φ2Ex∼μu(x,y)≤Ey∼ξminμ∈Φ1Ex∼μu(x,y) (because a minimum over a larger set is smaller). Thus, Φ2 can only be the unique argmax if Φ1=Φ2.
Consider the example ^D={[0,x]:x∈[0,1]}. Then ^D is closed. And Θ∗=[0,1] satisfies Θ∗=Φ∨Ψ⟹Φ⊆Ψ∨Ψ⊆Φ. But per the above it cannot be a unique maximizer.
Maybe the issue can be fixed if we strengthen the condition so that Φ∗ has to be also minimal with respect to ⊆.
You’re absolutely right, good job! I fixed the OP.
Not only that interpreting Θ∗=Θ2 requires an unusual decision rule (which I will be calling “utility hyperfunction”), but applying any ordinary utility function to this example yields a non-unique maximum. This is another point in favor of the significance of hyperfunctions.