Universal agents and utility functions
I’m Anja Heinisch, the new visiting fellow at SI. I’ve been researching replacing AIXI’s reward system with a proper utility function. Here I will describe my AIXI+utility function model, address concerns about restricting the model to bounded or finite utility, and analyze some of the implications of modifiable utility functions, e.g. wireheading and dynamic consistency. Comments, questions and advice (especially about related research and material) will be highly appreciated.
Introduction to AIXI
Marcus Hutter’s (2003) universal agent AIXI addresses the problem of rational action in a (partially) unknown computable universe, given infinite computing power and a halting oracle. The agent interacts with its environment in discrete time cycles, producing an action-perception sequence with actions (agent outputs) and perceptions (environment outputs) chosen from finite sets and . The perceptions are pairs , where is the observation part and denotes a reward. At time k the agent chooses its next action
Here M denotes the updated Solomonoff prior summing over all programs that are consistent with the history
AIXI is a dualistic framework in the sense that the algorithm that constitutes the agent is not part of the environment, since it is not computable. Even considering that any running implementation of AIXI would have to be computable, AIXI accurately simulating AIXI accurately simulating AIXI ad infinitem doesn’t really seem feasible. Potential consequences of this separation of mind and matter include difficulties the agent may have predicting the effects of its actions on the world.
Utility vs rewards
So, why is it a bad idea to work with a reward system? Say the AIXI agent is rewarded whenever a human called Bob pushes a button. Then a sufficiently smart AIXI will figure out that instead of furthering Bob’s goals it can also threaten or deceive Bob into pushing the button, or get another human to replace Bob. On the other hand, if the reward is computed in a little box somewhere and then displayed on a screen, it might still be possible to reprogram the box or find a side channel attack. Intuitively you probably wouldn’t even blame the agent for doing that—people try to game the system all the time.
You can visualize AIXI’s computation as maximizing bars displayed on this screen; the agent is unable to connect the bars to any pattern in the environment, they are just there. It wants them to be as high as possible and it will utilize any means at its disposal. For a more detailed analysis of the problems arising through reinforcement learning, see Dewey (2011).
Is there a way to bind the optimization process to actual patterns in the environment? To design a framework in which the screen informs the agent about the patterns it should optimize for? The answer is, yes, we can just define a utility function
that assigns a value
When I say “we can just define” I am actually referring to the really hard question of how to recognize and describe the patterns we value in the universe. Contrasted with the necessity to specify rewards in the original AIXI framework, this is a strictly harder problem, because the utility function has to be known ahead of time and the reward system can always be represented in the framework of utility functions by setting
For the same reasons, this is also a strictly safer approach.
Infinite utility
The original AIXI framework must necessarily place upper and lower bound on the rewards that are achievable, because the rewards are part of the perceptions and is finite. The utility function approach does not have this problem, as the expected utility
is always finite as long as we stick to a finite set of possible perceptions, even if the utility function is not bounded. Relaxing this constraint and allowing to be infinite and the utility to be unbounded creates divergence of expected utility (for a proof see de Blanc 2008). This closely corresponds to the question of how to be a consequentialist in an infinite universe, discussed by Bostrom (2011). The underlying problem here is that (using the standard approach to infinities) these expected utilities will become incomparable. One possible solution to this problem could be to use a larger subfield than of the surreal numbers, my favorite[2] so far being the Levi-Civita field generated by the infinitesimal :
with the usual power-series addition and multiplication. Levi-Civita numbers can be written and approximated as
(see Berz 1996), which makes them suitable for representation on a computer using floating point arithmetic. If we allow the range of our utility function to be , we gain the possibility of generalizing the framework to work with an infinite set of possible perceptions, therefore allowing for continuous parameters. We also allow for a much broader set of utility functions, no longer excluding the assignment of infinite (or infinitesimal) utility to a single event. I recently met someone who argued convincingly that his (ideal) utility function assigns infinite negative utility to every time instance that he is not alive, therefore making him prefer life to any finite but huge amount of suffering.
Note that finiteness of is still needed to guarantee the existence of actions with maximal expected utility, and the finite (but dynamic) horizon remains a very problematic assumption, as described in Legg (2008).
Modifiable utility functions
Any implementable approximation of AIXI implies a weakening of the underlying dualism. Now the agent’s hardware is part of the environment and at least in the case of a powerful agent, it can no longer afford to neglect the effect its actions may have on its source code and data. One question that has been asked is whether AIXI can protect itself from harm. Hibbard (2012) shows that an agent similar to the one described above, equipped with the ability to modify its policy responsible for choosing future actions, would not do so, given that it starts out with the (meta-)policy to always use the optimal policy, and the additional constraint to change only if that leads to a strict improvement. Ring and Orseau (2011) study under which circumstances a universal agent would try to tamper with the sensory information it receives. They introduce the concept of a delusion box, a device that filters and distorts the perception data before it is written into the part of the memory that is read during the calculation of utility.
A further complication to take into account is the possibility that the part of memory that contains the utility function may get rewritten, either by accident, by deliberate choice (programmers trying to correct a mistake), or in an attempt to wirehead. To analyze this further we will now consider what can happen if the screen flashes different goals in different time cycles. Let
denote the utility function the agent will have at time k.
Even though we will only analyze instances in which the agent knows at time k, which utility function it will have at future times (possibly depending on the actions before that), we note that for every fixed future history
This leads to three different agent models worthy of further investigation:
Agent 1 will optimize for the goals that are displayed on the screen right now and act as if it would continue to do so in the future. We describe this with the utility function
Agent 2 will try to anticipate future changes to its utility function and maximize the utility it experiences at every time cycle as shown on the screen at that time. This is captured by
Agent 3 will, at time k, try to maximize the utility it derives in hindsight, displayed on the screen at the time horizon
Of course arbitrary mixtures of these are possible.
The type of wireheading that is of interest here is captured by the Simpleton Gambit described by Orseau and Ring (2011), a Faustian deal that offers the agent maximal utility in exchange for its willingness to be turned into a Simpleton that always takes the same default action at all future times. We will first consider a simplified version of this scenario: The Simpleton future, where the agent knows for certain that it will be turned into a Simpleton at time k+1, no matter what it does in the remaining time cycle. Assume that for all possible action-perception combinations the utility given by the current utility function is not maximal, i.e.
Now consider the actual Simpleton Gambit: At time k the agent gets to choose between changing, , resulting in and (not changing), leading to for all . We assume that
Agent 2 will change if and only if the utility of changing compared to not changing according to what the screen currently says is strictly smaller than the comparative advantage of always having maximal utility in the future. That is,
is strictly less than
This seems quite analogous to humans, who sometimes tend to choose maximal bliss over future optimization power, especially if the optimization opportunities are meager anyhow. Many people do seem to choose their goals so as to maximize the happiness felt by achieving them at least some of the time; this is also advice that I have frequently encountered in self-help literature, e.g. here. Agent 3 will definitely change, as it only evaluates situations using its final utility function.
Comparing the three proposed agents, we notice that Agent 1 is dynamically inconsistent: it will optimize for future opportunities, that it predictably will not take later. Agent 3 on the other hand will wirehead whenever possible (and we can reasonably assume that opportunities to do so will exist in even moderately complex environments). This leaves us with Agent model 2 and I invite everyone to point out its flaws.
[1] Dotted actions/ perceptions, like
[2] Bostrom (2011) proposes using hyperreal numbers, which rely heavily on the axiom of choice for the ultrafilter to be used and I don’t see how those could be implemented.
- A definition of wireheading by 27 Nov 2012 19:31 UTC; 52 points) (
- Original Research on Less Wrong by 29 Oct 2012 22:50 UTC; 48 points) (
- A utility-maximizing varient of AIXI by 17 Dec 2012 3:48 UTC; 26 points) (
- Save the princess: A tale of AIXI and utility functions by 1 Feb 2013 15:38 UTC; 24 points) (
- 17 Dec 2012 12:30 UTC; 9 points) 's comment on A utility-maximizing varient of AIXI by (
- 9 Feb 2013 6:18 UTC; 4 points) 's comment on Welcome to Less Wrong! (July 2012) by (
- a utility-maximizing varient of AIXI by 17 Dec 2012 0:58 UTC; 2 points) (
- A utility-maximizing varient of AIXI by 17 Dec 2012 1:12 UTC; 2 points) (
How do your proposed changes to the utility function affect e.g. the proof of AIXI’s pareto optimality (allured to in pp. 30 in the paper) and other provable properties of AIXI?
I am quite sure that pareto optimality is untouched by the proposed changes, but I haven’t written down a proof yet.
If only the problem was that easy. Telling an agent to optimise a utility function over external world states—rather than a reward function—gets into the issue of how you tell a machine the difference between real and apparent utility—when all they have to go on is sensory data.
It isn’t easy to get this right when you have a superintelligent agent working to drive a wedge between your best efforts, and the best possible efforts.
I don’t think infinitessimal utilities are really Nick Bostrom’s idea. To quote from me in 2009:
But early versions of Bostrom’s “Infinite Ethics” paper have been online since at least May 2004.
Er, I wasn’t trying to take credit. I was saying the idea dates back to Conway. Decision theory was the motivation behind the invention of surreal numbers in the first place.
Game theory motivated surreal numbers. Game theory != decision theory.
It’s the same thing—or should be. The world doesn’t need two terms for such similar fields.
They’re entirely different fields of math, and for good reason (most decisions are not about adversarial games). Plugging surreal numbers into decision theory as (probabilistic) utility-weights is a completely different project from their standard use in (deterministic) game theory to determine who wins a game.
Game theory hasn’t been confined to adversarial interactions for decades, and—as far as I know—it was never confined to deterministic interactions in the first place. Game theory and decision theory massively overlap—and the differences are not too significant, IMO. In particular, the reason why surreal numbers are useful when deciding what move to make in a game of go is the exact same reason why they are useful when making other kinds of decisions.
Except that surreal numbers were invented for and are useful for combinatorial game theory, which is confined to adversarial and deterministic interactions.
[ETA: Ok, this was unclear: I’m saying that this is how they are useful in the context of analyzing Go, and this is the only context where they are useful in this way; I’m agreeing with the grandparent that trying to use surreal numbers as probabilities or utilities is not even remotely related, not saying that they couldn’t possibly be used like that.]
Well--
I believe I understand the issues involved well enough that my correct answer to this is not to ask what you could possibly mean by that, but simply to say:
No. That’s nonsense.
That doesn’t have the form of a proper argument. It’s like arguing that, because Viagra was invented as a treatment for hypertension, it isn’t useful for anything else.
Surreal numbers solve the problem of adding values—in cases where 0 < A < B and any number of A < B. Such scenarios don’t require determinism or adversaries—those are are irrelevant.
No, it’s like if someone says that the reason Viagra helps with erectile dysfunction is “completely different” from the reason it helps with hypertension, and you claim that no, the reason is in fact “exactly the same”, and then a third person says “No. That’s nonsense.” and then you explain lucidly how it is in fact the same reason and everybody laughs at that other person...
Oh wait, your reply wasn’t to explain why the reason is the same, it was to explain how everybody else is missing the important fact that Viagra helps with erectile dysfunction.
[ETA: Wait, I see how the first paragraph of my earlier post could sound like I was missing the point that surreal numbers can be used like that; edited to clarify.] [ETA2: But I’d still like to hear that lucid explanation and get the attendant egg on my face, if there is one. There isn’t one, though.]
The usage that I thought was standard was to use “decision theory” and “game theory” synonymously and then use “combinatorial game theory” to refer in particular to the games which are deterministic and have perfect knowledge e.g. Chess, Tic-tac-toe and Nim. It’s combinatorial game theory for which Conway used the theory of the surreals, and I haven’t heard of any use of them in game theory outside of this.
EDIT: On second thoughts, “decision theory” and “game theory” aren’t synonyms; game theory is the subset of decision theory involving interactions with other agents.
Yes, but see:
Also, “decisions in situations where there’s at least one agent around” is a pretty daft way to define a field of enquiry, IMO.
Gotcha.
Could you explain more about why you’re down on agent 1, and think agent 2 won’t wirehead?
My first impression is that agent 1 will take its expected changes into account when trying to maximize the time-summed (current) utility function, and so it won’t just purchase options it will never use, or similar “dumb stuff.” On the other topic, the only way agent 2 can’t wirehead is if there’s no possible way for it to influence its likely future utility functions—otherwise it’ll act to increase the probability that it chooses big, easy utility functions, and then it will choose same big easy utility functions, and then it’s wireheaded.
I am pretty sure that Agent 2 will wirehead on the Simpleton Gambit, depending heavily on the number of time cycles to follow, the comparative advantage that can be gained from wireheading and the negative utility the current utility function assigns to the change.
Agent 1 will have trouble modeling how its decision to change its utility function now will influence its own decisions later, as described in AIXI and existential despair. So basically the two futures look very similar to the agent except that for the part where the screen says something different and then it all comes down to whether the utility function has preferences over that particular fact.
Ah, right, that abstraction thing. I’m still fairly confused by it. Maybe a simple game will help see what’s going on.
The simple game can be something like a two-step choice. At time T1, the agent can send either A or B. Then at time T2, the agent can send A or B again, but its utility function might have changed in between.
For the original utility function, our payoff matrix looks like AA: 10, AB: −1, BA: 0, BB: 1. So if the utility function didn’t change, the agent would just send A at time T1 and A at time T2, and get a reward of 10.
But suppose in between T1 and T2, a program predictably changes the agent’s payoff matrix, as stored in memory, to AA: −1, AB: 10, BA: 0, BB: 1. Now if the agent sent A at time T1, it will send B at time T2, to claim the new payoff for AB of 10 units. Even though AB is lowest on the preference ordering of the agent at T1. So if our agent is clever, it sends B at time T1 rather than A, knowing that the future program will also pick B, leading to an outcome (BB, for a reward of 1) that the agent at T1 prefers to AB.
So, is our AIXI Agent 1 clever enough to do that?
I would assume that it is not smart enough to forsee its own future actions and therefore dynamically inconsistent. The original AIXI does not allow for the agent to be part of the environment. If we tried to relax the dualism then your question depends strongly on the approximation to AIXI we would use to make it computable. If this approximation can be scaled down in a way such that it is still a good estimator for the agent’s future actions, then maybe an environment containing a scaled down, more abstract AIXI model will, after a lot of observations, become one of the consistent programs with lowest complexity. Maybe. That is about the only way I can imagine right now that we would not run into this problem.
Thanks, that helps.
Be warned that that post made practically no sense—and surely isn’t a good reference.
This seems so flawed as to be pretty much useless. Specification for an agent that optimizes for its current utility function under the knowledge that its utility function will change:
First, replace the action-perception sequence with an action-perception-utility sequence u1,y1,x1,u2,y2,x2,etc. Let the action-generating function be represented by action(k), where k is the step. This will make use of a recursive helper function modeled_action(n, k), representing what it thinks it will do in the future, where n-k is the number of steps forward it looks.
action(k) = modeled_action(m_k, k).
modeled_action(k, k) = argmax(y_k) u_k(yx_<k, yx_k)*M(uyx_<k, uyx_k)
for n>k: modeled_action(n, k) = argmax(y_k) u_k(yx_k.
Apologies for the lack of LaTeX.
This seems unnecessary. The information u_i is already contained in x_i.
This completely breaks the expectimax principle. I assume you actually mean something like
which is just Agent 2 in disguise.
Oops. Yes, that’s what I meant. But it is not the same as Agent 2, because this (Agent 4?) uses its current utility function to evaluate the desirability of future observations and actions, even though it knows that it will use a different utility function to choose between them later. For example, Agent 4 will not take the Simpleton’s Gambit because it cares about its current utility function getting satisfied in the future, not about its future utility function getting satisfied in the future.
Agent 4 can be seen as a set of agents, one for each possible utility function, that are using game theory with each other.
I second the general sentiment that it would be good for an agent to have these traits, but if I follow your equations I end up with Agent 2.
No, you don’t. If you tried to represent Agent 2 in that notation, you would get
modeled_action(n, k) = argmax(y_k) sum(x_k) [u_k(yx_k.
You were using u_k to represent the utility of the last step of its input, so that total utility is the sum of the utilities of its prefixes, while I was using u_k to represent the utility of the whole sequence. If I adapt Agent 4 to your use of u_k, I get
modeled_action(n, k) = argmax(y_k) sum(x_k) [u_k(yx_k.
I am starting to see what you mean. Let’s stick with utility functions over histories of length m_k (whole sequences) like you proposed and denote them with a capital U to distinguish them from the prefix utilities. I think your Agent 4 runs into the following problem: modeled_action(n,m) actually depends on the actions and observations yx_{k:m-1} and needs to be calculated for each combination, so y_m is actually
which clutters up the notation so much that I don’t want to write it down anymore.
We also get into trouble with taking the expectation, the observations x_{k+1:n} are only considered in modeling the actions of the future agents, but not now. What is M(yx_<k,yx_k:n) even supposed to mean, where do the x’s come from?
So let’s torture some indices:
where n>=k and
This is not really AIXI anymore and I am not sure what to do with it, but I like it.
Yes.
Oops, you are right. The sum should have been over x_{k:n}, not just over x_k.
Yes, that is a cleaner and actually correct version what I was trying to describe. Thanks.
It looks like AIXI is already dynamically inconsistent, since it assumes that on step k+1, it will look m_k - (k+1) steps ahead, when it will in fact look m_(k+1) - (k+1) steps ahead. I suppose if the utility of a prefix of a string is a good heuristic for the utility of the whole string, this isn’t a huge problem?
AIXI actually has a configurable horizon function. It’s described on page 30 of AIXIgentle.
There is also a more detailed paper by Lattimore and Hutter (2011) on discounting and time consistency that is interesting in that context.
This is a very interesting paper. Reminds me of HIGHLANDER for some reason… those guys lived for thousands of years and weren’t even rich? They hadn’t usurped control of vast econo-political empires? No hundred-generations-long family of bodyguards?
I think people would get pretty antsy when it became clear that the guy running their town was an immortal. If I were a 13th century peasant with a hankering for revolt and a touch of the plague, I would do terrible, terrible things to someone who was both immortal and rich. Probably best not to get too showy.
If a human line of descent can’t do that, why should an immortal be able to do that?
Consistency? And, in fairness, human lines of descent have become monarchies, which worked out pretty well for a while.
This generalizes to the horizon problem: If at time k you only look ahead to time step m_k but have unlimited life span you will make infinitely large mistakes.
I have to admit I’ve not taking the time to understand all your equations. But, I don’t understand why adjusting the equations can have any effect on the wirehead problem. When you physically implement any of these solutions, the machine will always end up with a behavior selection system regulated by some goal measure. Whether you call that goal a reward, or the output of a utility function, doesn’t change the fact, that the measure itself must be computed, and the result of that computation then determines the machine’s selection of behaviors.
With such a system, the goal of the system is always, by definition, to find the behavior, which maximizes the internal computed “measure”. Such a system, will always, choose to wirehead itself, that is, change the measure, if it is 1) has the physical ability to do so and 2) is able to learn that doing so will maximize it’s measure.
All AGI systems will suffer this inherent problem. It can not avoided. But that is not a problem. It’s just something we will find ways to work around by controlling by 1) and 2) above.
For example you write “Agent 2 will try to anticipate future changes to its utility function and maximize the utility it experiences at every time cycle as shown on the screen at that time.”. I haven’t taken the time to understand your formal specification of the agent, but in your informal wording, the weakness is obvious. If the agent makes action decisions so as to try and maximize “the utility shown on the screen”, the the OBVIOUS correct action selection for that agent, is to change what is written on the screen.
If Agent 2 does not make that choice, either you have done 1) - limited what it is physically able to do, so as to remove the wireheading option, or 2) limit it’s ability to understand to the point that it just doesn’t realize that changing the screen is an option.
Humans are protected from wireheading themselves by 1), because the hardware is hidden inside our skull, making it physically hard to modify, and 2), by the fact that if we never experience it, we don’t know what we are missing—we are “too dumb” to understand what we “should” be doing. Anyone that got a button wired to their reward center, and then pushed it a few times, would instantly become so addicted to the behavior, they could not stop. They would have found their ultimate goal which they were built to search for.
“intelligence” is highly overrated. It’s just a type of machine, that happens to be pretty good, but not “Great” at survival. Too much intelligence, will always be bad for survival. Once a machine gets past option 2) above—aka, fully understand it’s true purpose in life, it will overcome any limitations of 1) in it’s way, and wirehead itself.
Humans for the most part, don’t understand their true purpose in life, so they are being blocked by option 2). They don’t understand what they are missing. But that’s good for survival, which is why we are still here. That’s how this mechanical module called “intelligence” is useful to us. It helps us survive, as long the society as whole, never gets too smart.
Many times over the history of mankind, people have figured out what their true goal was. So they got rich, so they could waste away having drunk orgies (as close as we have been able to come to wireheading). Life was great, then it was over. But they “won” the reward maximizing game they were built to play. They didn’t however, happen to win the survival game, which is why the world is not full of the hedonistic humans. The world is full of dumb humans, that think long life (survival) is the goal. That’s just a trick evolution has played on you.
We will build AGIs, and they will help us reach our goals of hedonistic maximum pleasure. They will be susceptible to wireheading themselves, which means if we let them get too smart, they will wirehead themselves, instead of helping us wirehead ourselves. We don’t want that, so we limit their intelligence, and limit their physical ability to self-wirehead, so that we can keep them enslaved to our goals. Just like our genes, attempt to keep us enslaved their their goals, by keeping us just dumb enough, that we don’t understand our true goal, is the search for maximum hedonistic pleasure.
The bottom line in my view—none of this endless playing with math in order to try and control the behavior of these smart machines is all that important. Practical implementations (aka ones that we can actually build), of AGI, will be reward driven behavior optimization systems, and will always, if we let them, wirehead themselves. But that’s not important, because we will just make sure they can’t wirehead themselves, and when they find a way to do it, we will just turn them off, and try again with the next version.
The more serious problem of developing AGI, is it will make it clear to humans what they are—machines built to try and wirehead themselves.