I think this addresses the problem I’m discussing only in the case where the source code contains an explicit utility function. You can then create new source code by merging those utility functions.
But in the case where it doesn’t (e.g. the source code is an uninterpretable neural network) you are left with the same problem.
Edited to add: Though even when the utility function is explicit, it seems like the benefits of lying about your source code could outweigh the cost of changing your utility function. For example, suppose A and B are bargaining, and A says “you should give me more cake because I get very angry if I don’t get cake”. Even if this starts off as a lie, it might then be in A’s interests to use your mechanism above to self-modify into A’ that does get very angry if it doesn’t get cake, and which therefore has a better bargaining position (because, under your protocol, it has “proved” that it was A’ all along).
I’ve argued previously that EUMs being able to merge easily creates an incentive for other kinds of agents (including humans or human-aligned AIs) to self-modify into EUMs (in order to merge into the winning coalition that takes over the world, or just to defend against other such coalitions), and this seems bad because they’re likely to do it before they fully understand what their own utility functions should be.
Can I interpret you as trying to solve this problem, i.e., find ways for non-EUMs to build coalitions that can compete with such merged EUMs?
I found this a very interesting question to try to answer. My first reaction was that I don’t expect EUMs with explicit utility functions to be competitive enough for this to be very relevant (like how purely symbolic AI isn’t competitive enough with deep learning to be very relevant).
But then I thought about how companies are close-ish to having an explicit utility function (maximize shareholder value) which can be merged with others (e.g. via acquisitions). And this does let them fundraise better, merge into each other, and so on.
Similarly, we can think of cases where countries were joined together by strategic marriages (the unification of Spain, say) as only being possible because the (messy, illegible) interests of the country were rounded off to the (relatively simple) interests of their royals. And so the royals being guaranteed power over the merged entity via marriage allowed the mergers to happen much more easily than if they had to create a merger which served the interests of the “country as a whole”.
For a more modern illustration: suppose that the world ends up with a small council who decide how AGI goes. Then countries with a dictator could easily bargain to join this coalition in exchange for their dictator getting a seat on this council. Whereas democratic countries would have a harder time doing so, because they might feel very internally conflicted about their current leader gaining the level of power that they’d get from joining the council.
(This all feels very related to Seeing Like a State, which I’ve just started reading.)
So upon reflection: yes, it’s reasonable to interpret me as trying to solve the problem of getting the benefits of being governed by a set of simple and relatively legible goals, without the costs that are usually associated with that.
Note that I say “legible goals” instead of “EUM” because in my mind you can be an EUM with illegible goals (like a neural network that implements EUM internally), or a non-EUM with legible goals (like a risk-averse money-maximizer), and merging is more bottlenecked on legibility than EUM-ness.
or a non-EUM with legible goals (like a risk-averse money-maximizer)
Our prototypical examples of risk-averse money-maximizers are EUMs. In particular, the Kelly bettor is probably the most central example: it maximizes expected log wealth (i.e. log future money). The concavity of the logarithm makes it risk averse: a Kelly bettor will always take a sure payoff over an uncertain outcome with the same expected payoff.
I bring this up mainly because the wording makes it sound like you’re under the impression that being an EUM is inconsistent with being a risk-averse money-maximizer, in which case you probably have an incorrect understanding of the level of generality of (nontrivial) expected utility maximization, and should probably update toward EUMs being a better model of real agents more often than you previously thought.
I think my thought process when I typed “risk-averse money-maximizer” was that an agent could be risk-averse (in which case it wouldn’t be an EUM) and then separately be a money-maximizer.
But I didn’t explicitly think “the risk-aversion would be with regard to utility not money, and risk-aversion with regard to money could still be risk-neutral with regard to utility”, so I appreciate the clarification.
Note that the same mistake, but with convexity in the other direction, also shows up in the OP:
Alice and Bob could toss a coin to decide between options #1 and #2, but then they wouldn’t be acting as an EUM (since EUMs can’t prefer a probabilistic mixture of two options to either option individually).
An EUM can totally prefer a probabilistic mixture of two options to either option individually; this happens whenever utility is convex with respect to resources (e.g. money). For instance, suppose an agent’s utility is u(money) = money^2. I offer this agent a $1 bet on a fair coinflip at even odds, i.e. it gets $0 if it loses and $2 if it wins. The agent takes the bet: the bet offers u = 0.5*0^2 + 0.5*2^2 = 2, while the baseline offers a certainty of $1 which has u = 1.0*1^2 = 1.
Your example bet is a probabilistic mixture of two options: $0 and $2. The agent prefers one of the options individually (getting $2) over any probabilistic mixture of getting $0 and $2.
In other words, your example rebuts the claim that an EUM can’t prefer a probabilistic mixture of two options to the expectation of those two options. But that’s not the claim I made.
It seems great that someone is working on this, but I wonder how optimistic you are, and what your reasons are. My general intuition (in part from the kinds of examples you give) is that the form of the agent and/or goals probably matter quite a bit as far as how easy it is to merge or build/join a coalition (or the cost-benefits of doing so), and once we’re able to build agents of different forms, humans’ form of agency/goals isn’t likely to be optimal as far as building coalitions (and maybe EUMs aren’t optimal either, but something non-human will be), and we’ll face strong incentives to self-modify (or simplify our goals, etc.) before we’re ready. (I guess we see this in companies/countries already, but the problem will get worse with AIs that can explore a larger space of forms of agency/goals.)
Again it’s great that someone is trying to solve this, in case there is a solution, but do you have an argument for being optimistic about this?
One argument for being optimistic: the universe is just very big, and there’s a lot to go around. So there’s a huge amount of room for positive-sum bargaining.
Another: at any given point in time, few of the agents that currently exist would want their goals to become significantly simplified (all else equal). So there’s a strong incentive to coordinate to reduce competition on this axis.
Lastly: if at each point in time, the set of agents who are alive are in conflict with potentially-simpler future agents in a very destructive way, then they should all just Do Something Else. In particular, if there’s some decision-theoretic argument roughly like “more powerful agents should continue to spend some of their resources on the values of their less-powerful ancestors, to reduce the incentives for inter-generational conflict”, even agents with very simple goals might be motivated by it. I call this “the generational contract”.
I buy your arguments for optimism about not needing to simplify/change our goals to compete. (I also think that there are other stronger reasons to expect we don’t need goal simplification like just keeping humans alive and later giving back the resources which is quite simple and indirectly points to what humans want. For ultimately launching space probes, I expect the overhead of complex goals is low. There is some complexity hidden in this proposal, but it seems like it should handle this specific goal simplicity concern.)
I don’t feel compelled by “the universe is very big” arguments for making cooperation look better for me personally because I put most of the weight on linear returns. A few reasons for this:
My sense is that we probably trivially saturate altruistic positive values (I want X to happen) which aren’t very scope sensitive. This doesn’t require any bargaining IMO, it just happens by default due to things like the universe being extremely big (at least tegmark 3 or whatever).
I generally find non-linear returns-ish views pretty non-compelling from a direct moral standpoint.
Edited to add: Though even when the utility function is explicit, it seems like the benefits of lying about your source code could outweigh the cost of changing your utility function. For example, suppose A and B are bargaining, and A says “you should give me more cake because I get very angry if I don’t get cake”. Even if this starts off as a lie, it might then be in A’s interests to use your mechanism above to self-modify into A’ that does get very angry if it doesn’t get cake, and which therefore has a better bargaining position (because, under your protocol, it has “proved” that it was A’ all along).
To disincentivize such lies, it seems that the merger can’t be based on each agent’s reported utility function, or even correctly observed current utility function, but instead the two sides have to negotiate some way of finding out each side’s real utility function, perhaps based on historical records/retrodictions of how each AI was trained. Another way of looking at this is, a superintelligent AI probably has a pretty good guess of the other AI’s real utility function based on its own historical knowledge, simulations, etc., and this makes the lying problem a lot less serious than it otherwise might be.
a superintelligent AI probably has a pretty good guess of the other AI’s real utility function based on its own historical knowledge, simulations, etc.
This seems very unclear to me—in general it’s not easy for agents to predict the goals of other agents with their own level of intelligence, because the amount of intelligence aimed at deception increases in proportion to the amount of intelligence aimed at discovering that deception.
(You could look at the AI’s behavior from when it was less intelligent, but then—as with humans—it’s hard to distinguish sincere change from improvement at masking undesirable goals.)
But regardless, that’s a separate point. If you can do that, you don’t need your mechanism above. If you can’t, then my objection still holds.
Here’s maybe a related point: AIs might find it useful to develop an ability to reveal their internals in a verifiable way under certain conditions (say, when the other AI offers to do the same thing and there’s a way to do a secure “handshake”). So deception ability would be irrelevant, because AIs that can credibly refrain from deception with each other would choose to do so and get a first-best outcome, instead of second-best as voting theory would suggest.
A real world analogy is some of the nuclear precommitments mentioned in Schelling’s book. Like when the US and Soviets knowingly refrained from catching some of each other’s spies, because if a flock of geese triggers the warning radars or something, spies could provide their side with the crucial information that an attack isn’t really happening and there’s no need to retaliate.
I think this addresses the problem I’m discussing only in the case where the source code contains an explicit utility function. You can then create new source code by merging those utility functions.
But in the case where it doesn’t (e.g. the source code is an uninterpretable neural network) you are left with the same problem.
Edited to add: Though even when the utility function is explicit, it seems like the benefits of lying about your source code could outweigh the cost of changing your utility function. For example, suppose A and B are bargaining, and A says “you should give me more cake because I get very angry if I don’t get cake”. Even if this starts off as a lie, it might then be in A’s interests to use your mechanism above to self-modify into A’ that does get very angry if it doesn’t get cake, and which therefore has a better bargaining position (because, under your protocol, it has “proved” that it was A’ all along).
I’ve argued previously that EUMs being able to merge easily creates an incentive for other kinds of agents (including humans or human-aligned AIs) to self-modify into EUMs (in order to merge into the winning coalition that takes over the world, or just to defend against other such coalitions), and this seems bad because they’re likely to do it before they fully understand what their own utility functions should be.
Can I interpret you as trying to solve this problem, i.e., find ways for non-EUMs to build coalitions that can compete with such merged EUMs?
I found this a very interesting question to try to answer. My first reaction was that I don’t expect EUMs with explicit utility functions to be competitive enough for this to be very relevant (like how purely symbolic AI isn’t competitive enough with deep learning to be very relevant).
But then I thought about how companies are close-ish to having an explicit utility function (maximize shareholder value) which can be merged with others (e.g. via acquisitions). And this does let them fundraise better, merge into each other, and so on.
Similarly, we can think of cases where countries were joined together by strategic marriages (the unification of Spain, say) as only being possible because the (messy, illegible) interests of the country were rounded off to the (relatively simple) interests of their royals. And so the royals being guaranteed power over the merged entity via marriage allowed the mergers to happen much more easily than if they had to create a merger which served the interests of the “country as a whole”.
For a more modern illustration: suppose that the world ends up with a small council who decide how AGI goes. Then countries with a dictator could easily bargain to join this coalition in exchange for their dictator getting a seat on this council. Whereas democratic countries would have a harder time doing so, because they might feel very internally conflicted about their current leader gaining the level of power that they’d get from joining the council.
(This all feels very related to Seeing Like a State, which I’ve just started reading.)
So upon reflection: yes, it’s reasonable to interpret me as trying to solve the problem of getting the benefits of being governed by a set of simple and relatively legible goals, without the costs that are usually associated with that.
Note that I say “legible goals” instead of “EUM” because in my mind you can be an EUM with illegible goals (like a neural network that implements EUM internally), or a non-EUM with legible goals (like a risk-averse money-maximizer), and merging is more bottlenecked on legibility than EUM-ness.
This is a tangent, but might be important.
Our prototypical examples of risk-averse money-maximizers are EUMs. In particular, the Kelly bettor is probably the most central example: it maximizes expected log wealth (i.e. log future money). The concavity of the logarithm makes it risk averse: a Kelly bettor will always take a sure payoff over an uncertain outcome with the same expected payoff.
I bring this up mainly because the wording makes it sound like you’re under the impression that being an EUM is inconsistent with being a risk-averse money-maximizer, in which case you probably have an incorrect understanding of the level of generality of (nontrivial) expected utility maximization, and should probably update toward EUMs being a better model of real agents more often than you previously thought.
I think my thought process when I typed “risk-averse money-maximizer” was that an agent could be risk-averse (in which case it wouldn’t be an EUM) and then separately be a money-maximizer.
But I didn’t explicitly think “the risk-aversion would be with regard to utility not money, and risk-aversion with regard to money could still be risk-neutral with regard to utility”, so I appreciate the clarification.
Note that the same mistake, but with convexity in the other direction, also shows up in the OP:
An EUM can totally prefer a probabilistic mixture of two options to either option individually; this happens whenever utility is convex with respect to resources (e.g. money). For instance, suppose an agent’s utility is u(money) = money^2. I offer this agent a $1 bet on a fair coinflip at even odds, i.e. it gets $0 if it loses and $2 if it wins. The agent takes the bet: the bet offers u = 0.5*0^2 + 0.5*2^2 = 2, while the baseline offers a certainty of $1 which has u = 1.0*1^2 = 1.
Your example bet is a probabilistic mixture of two options: $0 and $2. The agent prefers one of the options individually (getting $2) over any probabilistic mixture of getting $0 and $2.
In other words, your example rebuts the claim that an EUM can’t prefer a probabilistic mixture of two options to the expectation of those two options. But that’s not the claim I made.
It seems great that someone is working on this, but I wonder how optimistic you are, and what your reasons are. My general intuition (in part from the kinds of examples you give) is that the form of the agent and/or goals probably matter quite a bit as far as how easy it is to merge or build/join a coalition (or the cost-benefits of doing so), and once we’re able to build agents of different forms, humans’ form of agency/goals isn’t likely to be optimal as far as building coalitions (and maybe EUMs aren’t optimal either, but something non-human will be), and we’ll face strong incentives to self-modify (or simplify our goals, etc.) before we’re ready. (I guess we see this in companies/countries already, but the problem will get worse with AIs that can explore a larger space of forms of agency/goals.)
Again it’s great that someone is trying to solve this, in case there is a solution, but do you have an argument for being optimistic about this?
One argument for being optimistic: the universe is just very big, and there’s a lot to go around. So there’s a huge amount of room for positive-sum bargaining.
Another: at any given point in time, few of the agents that currently exist would want their goals to become significantly simplified (all else equal). So there’s a strong incentive to coordinate to reduce competition on this axis.
Lastly: if at each point in time, the set of agents who are alive are in conflict with potentially-simpler future agents in a very destructive way, then they should all just Do Something Else. In particular, if there’s some decision-theoretic argument roughly like “more powerful agents should continue to spend some of their resources on the values of their less-powerful ancestors, to reduce the incentives for inter-generational conflict”, even agents with very simple goals might be motivated by it. I call this “the generational contract”.
I buy your arguments for optimism about not needing to simplify/change our goals to compete. (I also think that there are other stronger reasons to expect we don’t need goal simplification like just keeping humans alive and later giving back the resources which is quite simple and indirectly points to what humans want. For ultimately launching space probes, I expect the overhead of complex goals is low. There is some complexity hidden in this proposal, but it seems like it should handle this specific goal simplicity concern.)
I don’t feel compelled by “the universe is very big” arguments for making cooperation look better for me personally because I put most of the weight on linear returns. A few reasons for this:
My sense is that we probably trivially saturate altruistic positive values (I want X to happen) which aren’t very scope sensitive. This doesn’t require any bargaining IMO, it just happens by default due to things like the universe being extremely big (at least tegmark 3 or whatever).
I generally find non-linear returns-ish views pretty non-compelling from a direct moral standpoint.
To disincentivize such lies, it seems that the merger can’t be based on each agent’s reported utility function, or even correctly observed current utility function, but instead the two sides have to negotiate some way of finding out each side’s real utility function, perhaps based on historical records/retrodictions of how each AI was trained. Another way of looking at this is, a superintelligent AI probably has a pretty good guess of the other AI’s real utility function based on its own historical knowledge, simulations, etc., and this makes the lying problem a lot less serious than it otherwise might be.
This seems very unclear to me—in general it’s not easy for agents to predict the goals of other agents with their own level of intelligence, because the amount of intelligence aimed at deception increases in proportion to the amount of intelligence aimed at discovering that deception.
(You could look at the AI’s behavior from when it was less intelligent, but then—as with humans—it’s hard to distinguish sincere change from improvement at masking undesirable goals.)
But regardless, that’s a separate point. If you can do that, you don’t need your mechanism above. If you can’t, then my objection still holds.
Here’s maybe a related point: AIs might find it useful to develop an ability to reveal their internals in a verifiable way under certain conditions (say, when the other AI offers to do the same thing and there’s a way to do a secure “handshake”). So deception ability would be irrelevant, because AIs that can credibly refrain from deception with each other would choose to do so and get a first-best outcome, instead of second-best as voting theory would suggest.
A real world analogy is some of the nuclear precommitments mentioned in Schelling’s book. Like when the US and Soviets knowingly refrained from catching some of each other’s spies, because if a flock of geese triggers the warning radars or something, spies could provide their side with the crucial information that an attack isn’t really happening and there’s no need to retaliate.