I think you’re being unfair to that Rob tweet and the MIRI position; having enough goal-directedness to maximize the number of granite spheres + no special structure to reward humans is a far weaker assumption than utility maximization. The argument in the tweet also goes through if the AI has 1000 goals as alien as maximizing granite spheres, which I would guess Rob thinks is more realistic. (note that I haven’t talked to him and definitely don’t speak for him or MIRI)
Shard theory is mostly just a frame and hasn’t discovered anything yet; the nontrivial observations about how agents and values behave rely on ~9 nonobvious claims, and the obviously true observations are not very powerful in arguing for alternate models of how powerful AI behaves. [If this sounds critical of shard theory, note that I’m excited about the shard theory frame, it just seems premature to conclude things from the evidence we have]
[edited to add:] Reflection might give some degree of coherence. This is important in the MIRI frame and also in the shard theory frame.
The argument in the tweet also goes through if the AI has 1000 goals as alien as maximizing granite spheres, which I would guess Rob thinks is more realistic.
As an aside: If one thinks 1000 goals is more realistic, then I think it’s better to start communicating using examples like that, instead of “single goal” examples. (I myself lazily default to “paperclips” to communicate AGI risk quickly to laypeople, so I am critiquing myself to some extent as well.)
Anyways, on your read, how is “maximize X-quantity” different from “max EU where utility is linearly increasing in granite spheres”?
There’s a trivial sense in which the agent is optimizing the world and you can rationalize a utility function from that, but I think an agent that, from our perspective, basically just maximizes granite spheres can look quite different from the simple picture of an agent that always picks the top action according to some (not necessarily explicit) granite-sphere valuation of the actions, in ways such that the argument still goes through.
The agent can have all the biases humans do.
The agent can violate VNM axioms in any other way that doesn’t ruin it, basically anything that has low frequency or importance.
The agent only tries to maximize granite spheres 1 out of every 5 seconds, and the other 4⁄5 is spent just trying not to be turned off.
The agent has arbitrary deontological restrictions, say against sending any command to its actuators whose hash starts with 123.
The agent has 5 goals it is jointly pursuing, but only one of them is consequentialist.
The agent will change its goal depending on which cosmic rays it sees, but is totally incorrigible to us.
The original wording of the tweet was “Suppose that the AI’s sole goal is to maximize the number of granite spheres in its future light cone.” This is a bit closer to my picture of EU maximization but some of the degrees of freedom still apply.
1. Yeah, I think that’s fair. I may have pattern matched/jumped to conclusions too eagerly. Or rather, I’ve been convinced that my allegation is not very fair. But mostly, the Rob tweet provided the impetus for me to synthesise/dump all my issues with EU maximisation. I think the complaint can stand on its own, even if Rob wasn’t quite staking the position I thought he was.
That said, I do think that multi objective optimisation is way more existentially safe than optimising for a single simple objective. I don’t actually think the danger directly translates. And I think it’s unlikely that multi-objective optimisers would not care about humans or other agents.
I suspect the value shard formation hypotheses would imply instrumental convergence towards developing some form of morality. Cooperation is game theoretically optimal. Though it’s not clear yet, how accurate the value shard formation hypothesis is true.
2. I’m not relying too heavily on Shard Theory I don’t think. I mostly cited it because it’s what actually lead me in that direction not because I fully endorse it. The only shard theory claims I rely on are:
Values are contextual influences on decision making
That said, I do think that multi objective optimisation is way more existentially safe than optimising for a single simple objective. I don’t actually think the danger directly translates. And I think it’s unlikely that multi-objective optimisers would not care about humans or other agents.
I think one possible form of existential catastrophe is that human values get only a small share of the universe, and as a result the “utility” of the universe is much smaller than it could be. I worry this will happen if only one or few of the objectives of multi objective optimization cares about humans or human values.
Also, if one of the objectives does care about humans or human values, it might still have to do so in exactly the right way in order to prevent (other forms of) existential catastrophe, such as various dystopias. Or if more than one cares, they might all have to care in exactly the right way. So I don’t see multi objective optimisation as much safer by default, or much easier to align.
I don’t currently understand what it means for the agent to have to care in “exactly the right way.” I find myself wanting to point to Alignment allows “nonrobust” decision-influences, but I know you’ve already read that post, so…
Perhaps I’m worried about implicit equivocation between
“the agent’s values have to be adversarially robust so that they make decisions in precisely the right way, such that the plans which most appeal to the decision-maker are human-aligned plans” (I think this is wrong, as I argue in that post)
“There may be a lot of sensitivity of outcome-goodnesshuman to the way the human-aligned shards influence decisions, such that the agent has to learn lots of ‘niceness parameters’ in order to end up treating us well” (Seems plausible)
I think (i) is doomed and unnecessary and not how realistic agents work, and I think (ii) might be very hard. But these are really not the same thing, I think (i) and (ii) present dissimilar risk profiles and research interventions. I think “EU maximization” frames focus on (i).
I think that multi-decision-influence networks seem much easier to align and much safer for humans.
It seems fine to me that you think this. As I wrote in a previous post, “Trust your intuitions, but don’t waste too much time arguing for them. If several people are attempting to answer the same question and they have different intuitions about how best to approach it, it seems efficient for each to rely on his or her intuition to choose the approach to explore.”
As a further meta point, I think there’s a pattern where because many existing (somewhat) concrete AI alignment approaches seem doomed (we can fairly easy see how they would end up breaking), people come up with newer, less concrete approaches which don’t seem doomed, but only because they’re less concrete and therefore it’s harder to predict what they would actually do, or because fewer people have looked into them in detail and tried to break them. See this comment where I mentioned a similar worry with regard to Paul Christiano’s IDA when it was less developed.
In this case, I think there are many ways that a shard-based agent could potentially cause existential catastrophes, but it’s hard for me to say more, since I don’t know the details of what your proposal will be.
(For example, how do the shards resolve conflicts, and how will they eventually transform into a reflectively stable agent? If one of the shards learns a distorted version of human values, which would cause an existential catastrophe if directly maximized for, how exactly does that get fixed by the time the agent becomes reflectively stable? Or if the agent never ends up maximizing anything, why isn’t that liable to be a form of existential catastrophe? How do you propose to prevent astronomical waste caused by the agent spending resources on shard values that aren’t human values? What prevents the shard agent from latching onto bad moral/philosophical ideas and causing existential catastrophes that way?)
I don’t want to discourage you from working more on your approach and figuring out the details, but at the same time it seems way too early to say, hey let’s stop working on other approaches and focus just on this one.
I think these are great points, thanks for leaving this comment. I myself patiently await the possible day where I hit an obvious shard theory landmine which has the classic alignment-difficulty “feel” to it. That day can totally come, and I want to be ready to recognize if it does.
at the same time it seems way too early to say, hey let’s stop working on other approaches and focus just on this one.
FWIW I’m not intending to advocate “shard theory or GTFO”, and agree that would be bad as a community policy.
I’ve tried to mention a few times[1] (but perhaps insufficiently prominently) that I’m less excited about people going “oh yeah I guess shard theory is great or something, let’s just think about that now” and more excited about reactions like “Oh, I guess I should have been practicing more constant vigilance, time to think about alignment deeply on my own terms, setting aside established wisdom for the moment.” I’m excited about other people thinking about alignment from first principles and coming up with their own inside views, with their own theories and current-best end-to-end pictures of AGI training runs.
A-Outer: Suppose I agreed. Suppose I just dropped outer/inner. What next?
A: Then you would have the rare opportunity to pause and think while floating freely between agendas. I will, for the moment, hold off on proposing solutions. Even if my proposal is good, discussing it now would rob us of insights you could have contributed as well. There will be a shard theory research agenda post which will advocate for itself, in due time.
I’ve also made this point briefly at the end of in-person talks. Maybe I should say it more often.
This is a claim I strongly disagree with, assuming there aren’t enforcement mechanisms like laws or contracts. If there isn’t enforcement, then this reduces to the Prisoner’s Dilemma, and there defection is game-theoretically optimal. Cooperation only works if things can be enforced, and the likelihood that we will be able to enforce things like contracts on superhuman intelligences is essentially like that of animals enforcing things on a human, i.e so low that it’s not worth privileging the hypothesis.
And this is important, because it speaks to why the alignment problem is hard: agents with vastly differing capabilities can’t enforce much of anything, so defection is going to happen. And I think this prediction bears out in real life relations with animals, that is humans can defect consequence free, so this usually happens.
One major exception is pets, where the norm really is cooperation, and the version that would be done for humans is essentially benevolent totalitarianism. Life’s good in such a society, but modern democratic freedoms are almost certainly gone or so manipulated that it doesn’t matter.
That might not be bad, but I do want to note that in game theory without enforcement is where defection rules.
instrumental convergence towards developing some form of morality.
That respects the less capable agent’s wants, and stably is the necessary thing. And the answer to this is negative, expect in the pets case. And even here, this will entail the end of democracy and most freedom as we know it. It might actually be benevolent totalitarianism, and you may make an argument that this is desirable, though I do want to note the costs.
In one sense, I no longer endorse the previous comment. In another sense, I sort of endorse the previous comment.
I was basically wrong about alignment requiring human values to be game-theoretically optimal, and I think that cooperation is actually doable without relying on game theory tools like enforcement, because the situation with human alignment is very different than the situation with AI alignment, because we have access to the AI’s brain and can directly reward good things and negatively reward bad things, combined with the fact that we have a very powerful optimizer called SGD that lets us straightforwardly select over minds and directly edit the AI’s brain, which aren’t things we have to align humans, partially for ethical reasons and partially for technological reasons.
I also think my analogy of human-animal alignment is actually almost as bad as human-evolution alignment, which is worthless, and instead the better analogy for how to predict the likelihood of AI alignment is prefrontal cortex-survival value alignment, or innate reward alignment, which is very impressive alignment.
However, I do think that even with that assumption of aligned AIs, I do think that democracy is likely to decay pretty quickly under AI, especially because of the likely power imbalances, especially hundreds of years into the future. We will likely retain some freedoms under aligned AI rule, but I expect it to be a lot less than what we’re used to today, and it will transition into a form of benevolent totalitarianism.
Optimising multiple objective functions in a way that cannot be collapsed into a single utility function to e.g. the reals.
I guess multi objective optimisation can be represented by a single utility function that maps to a vector space, but as far as I’m aware, utility functions usually have a field as their codomain.
I agree with the following caveats:
I think you’re being unfair to that Rob tweet and the MIRI position; having enough goal-directedness to maximize the number of granite spheres + no special structure to reward humans is a far weaker assumption than utility maximization. The argument in the tweet also goes through if the AI has 1000 goals as alien as maximizing granite spheres, which I would guess Rob thinks is more realistic. (note that I haven’t talked to him and definitely don’t speak for him or MIRI)
Shard theory is mostly just a frame and hasn’t discovered anything yet; the nontrivial observations about how agents and values behave rely on ~9 nonobvious claims, and the obviously true observations are not very powerful in arguing for alternate models of how powerful AI behaves. [If this sounds critical of shard theory, note that I’m excited about the shard theory frame, it just seems premature to conclude things from the evidence we have]
[edited to add:] Reflection might give some degree of coherence. This is important in the MIRI frame and also in the shard theory frame.
As an aside: If one thinks 1000 goals is more realistic, then I think it’s better to start communicating using examples like that, instead of “single goal” examples. (I myself lazily default to “paperclips” to communicate AGI risk quickly to laypeople, so I am critiquing myself to some extent as well.)
Anyways, on your read, how is “maximize X-quantity” different from “max EU where utility is linearly increasing in granite spheres”?
There’s a trivial sense in which the agent is optimizing the world and you can rationalize a utility function from that, but I think an agent that, from our perspective, basically just maximizes granite spheres can look quite different from the simple picture of an agent that always picks the top action according to some (not necessarily explicit) granite-sphere valuation of the actions, in ways such that the argument still goes through.
The agent can have all the biases humans do.
The agent can violate VNM axioms in any other way that doesn’t ruin it, basically anything that has low frequency or importance.
The agent only tries to maximize granite spheres 1 out of every 5 seconds, and the other 4⁄5 is spent just trying not to be turned off.
The agent has arbitrary deontological restrictions, say against sending any command to its actuators whose hash starts with 123.
The agent has 5 goals it is jointly pursuing, but only one of them is consequentialist.
The agent will change its goal depending on which cosmic rays it sees, but is totally incorrigible to us.
The original wording of the tweet was “Suppose that the AI’s sole goal is to maximize the number of granite spheres in its future light cone.” This is a bit closer to my picture of EU maximization but some of the degrees of freedom still apply.
1. Yeah, I think that’s fair. I may have pattern matched/jumped to conclusions too eagerly. Or rather, I’ve been convinced that my allegation is not very fair. But mostly, the Rob tweet provided the impetus for me to synthesise/dump all my issues with EU maximisation. I think the complaint can stand on its own, even if Rob wasn’t quite staking the position I thought he was.
That said, I do think that multi objective optimisation is way more existentially safe than optimising for a single simple objective. I don’t actually think the danger directly translates. And I think it’s unlikely that multi-objective optimisers would not care about humans or other agents.
I suspect the value shard formation hypotheses would imply instrumental convergence towards developing some form of morality. Cooperation is game theoretically optimal. Though it’s not clear yet, how accurate the value shard formation hypothesis is true.
2. I’m not relying too heavily on Shard Theory I don’t think. I mostly cited it because it’s what actually lead me in that direction not because I fully endorse it. The only shard theory claims I rely on are:
Values are contextual influences on decision making
Reward is not the optimisation target
Do you think the first is “non obvious”?
I think one possible form of existential catastrophe is that human values get only a small share of the universe, and as a result the “utility” of the universe is much smaller than it could be. I worry this will happen if only one or few of the objectives of multi objective optimization cares about humans or human values.
Also, if one of the objectives does care about humans or human values, it might still have to do so in exactly the right way in order to prevent (other forms of) existential catastrophe, such as various dystopias. Or if more than one cares, they might all have to care in exactly the right way. So I don’t see multi objective optimisation as much safer by default, or much easier to align.
I think that multi-decision-influence networks seem much easier to align and much safer for humans.
Even on the view you advocate here (where some kind of perfection is required), “perfectly align part of the motivations” seems substantially easier than “perfectly align all of the AI’s optimization so it isn’t optimizing for anything you don’t want.”
all else equal, the more interests present at a bargaining table, the greater the chance that some of the interests are aligned.
I don’t currently understand what it means for the agent to have to care in “exactly the right way.” I find myself wanting to point to Alignment allows “nonrobust” decision-influences, but I know you’ve already read that post, so…
Perhaps I’m worried about implicit equivocation between
“the agent’s values have to be adversarially robust so that they make decisions in precisely the right way, such that the plans which most appeal to the decision-maker are human-aligned plans” (I think this is wrong, as I argue in that post)
“There may be a lot of sensitivity of outcome-goodnesshuman to the way the human-aligned shards influence decisions, such that the agent has to learn lots of ‘niceness parameters’ in order to end up treating us well” (Seems plausible)
I think (i) is doomed and unnecessary and not how realistic agents work, and I think (ii) might be very hard. But these are really not the same thing, I think (i) and (ii) present dissimilar risk profiles and research interventions. I think “EU maximization” frames focus on (i).
It seems fine to me that you think this. As I wrote in a previous post, “Trust your intuitions, but don’t waste too much time arguing for them. If several people are attempting to answer the same question and they have different intuitions about how best to approach it, it seems efficient for each to rely on his or her intuition to choose the approach to explore.”
As a further meta point, I think there’s a pattern where because many existing (somewhat) concrete AI alignment approaches seem doomed (we can fairly easy see how they would end up breaking), people come up with newer, less concrete approaches which don’t seem doomed, but only because they’re less concrete and therefore it’s harder to predict what they would actually do, or because fewer people have looked into them in detail and tried to break them. See this comment where I mentioned a similar worry with regard to Paul Christiano’s IDA when it was less developed.
In this case, I think there are many ways that a shard-based agent could potentially cause existential catastrophes, but it’s hard for me to say more, since I don’t know the details of what your proposal will be.
(For example, how do the shards resolve conflicts, and how will they eventually transform into a reflectively stable agent? If one of the shards learns a distorted version of human values, which would cause an existential catastrophe if directly maximized for, how exactly does that get fixed by the time the agent becomes reflectively stable? Or if the agent never ends up maximizing anything, why isn’t that liable to be a form of existential catastrophe? How do you propose to prevent astronomical waste caused by the agent spending resources on shard values that aren’t human values? What prevents the shard agent from latching onto bad moral/philosophical ideas and causing existential catastrophes that way?)
I don’t want to discourage you from working more on your approach and figuring out the details, but at the same time it seems way too early to say, hey let’s stop working on other approaches and focus just on this one.
I think these are great points, thanks for leaving this comment. I myself patiently await the possible day where I hit an obvious shard theory landmine which has the classic alignment-difficulty “feel” to it. That day can totally come, and I want to be ready to recognize if it does.
FWIW I’m not intending to advocate “shard theory or GTFO”, and agree that would be bad as a community policy.
I’ve tried to mention a few times[1] (but perhaps insufficiently prominently) that I’m less excited about people going “oh yeah I guess shard theory is great or something, let’s just think about that now” and more excited about reactions like “Oh, I guess I should have been practicing more constant vigilance, time to think about alignment deeply on my own terms, setting aside established wisdom for the moment.” I’m excited about other people thinking about alignment from first principles and coming up with their own inside views, with their own theories and current-best end-to-end pictures of AGI training runs.
From Inner and outer alignment decompose one hard problem into two extremely hard problems:
I’ve also made this point briefly at the end of in-person talks. Maybe I should say it more often.
This is a claim I strongly disagree with, assuming there aren’t enforcement mechanisms like laws or contracts. If there isn’t enforcement, then this reduces to the Prisoner’s Dilemma, and there defection is game-theoretically optimal. Cooperation only works if things can be enforced, and the likelihood that we will be able to enforce things like contracts on superhuman intelligences is essentially like that of animals enforcing things on a human, i.e so low that it’s not worth privileging the hypothesis.
And this is important, because it speaks to why the alignment problem is hard: agents with vastly differing capabilities can’t enforce much of anything, so defection is going to happen. And I think this prediction bears out in real life relations with animals, that is humans can defect consequence free, so this usually happens.
One major exception is pets, where the norm really is cooperation, and the version that would be done for humans is essentially benevolent totalitarianism. Life’s good in such a society, but modern democratic freedoms are almost certainly gone or so manipulated that it doesn’t matter.
That might not be bad, but I do want to note that in game theory without enforcement is where defection rules.
That respects the less capable agent’s wants, and stably is the necessary thing. And the answer to this is negative, expect in the pets case. And even here, this will entail the end of democracy and most freedom as we know it. It might actually be benevolent totalitarianism, and you may make an argument that this is desirable, though I do want to note the costs.
In one sense, I no longer endorse the previous comment. In another sense, I sort of endorse the previous comment.
I was basically wrong about alignment requiring human values to be game-theoretically optimal, and I think that cooperation is actually doable without relying on game theory tools like enforcement, because the situation with human alignment is very different than the situation with AI alignment, because we have access to the AI’s brain and can directly reward good things and negatively reward bad things, combined with the fact that we have a very powerful optimizer called SGD that lets us straightforwardly select over minds and directly edit the AI’s brain, which aren’t things we have to align humans, partially for ethical reasons and partially for technological reasons.
I also think my analogy of human-animal alignment is actually almost as bad as human-evolution alignment, which is worthless, and instead the better analogy for how to predict the likelihood of AI alignment is prefrontal cortex-survival value alignment, or innate reward alignment, which is very impressive alignment.
However, I do think that even with that assumption of aligned AIs, I do think that democracy is likely to decay pretty quickly under AI, especially because of the likely power imbalances, especially hundreds of years into the future. We will likely retain some freedoms under aligned AI rule, but I expect it to be a lot less than what we’re used to today, and it will transition into a form of benevolent totalitarianism.
What exactly do you mean by “multi objective optimization”?
Optimising multiple objective functions in a way that cannot be collapsed into a single utility function to e.g. the reals.
I guess multi objective optimisation can be represented by a single utility function that maps to a vector space, but as far as I’m aware, utility functions usually have a field as their codomain.