(4) Therefore, imperfect value loading can still result in an okay outcome.
This is, of course, not necessarily always the case for any given imperfect value loading. However, our world serves as a single counterexample to the rule that all imperfect optimization will be disastrous.
(5) Given: A maxipok strategy is optimal. (“Maximize the probability of an okay outcome.”)
(6) Given: Partial optimization for human values is easier than total optimization. (Where “partial optimization” is at least close enough to achieve an okay outcome.)
(7) ∴ MIRI should focus on imperfect value loading.
Note that I’m not convinced of several of the givens, so I’m not certain of the conclusion. However, the argument itself looks convincing to me. I’ve also chosen to leave assumptions like “imperfect value loading results in partial optimization” unstated as part of the definitions of those 2 terms. However, I’ll try and add details to any specific areas, if questioned.
However, our world serves as a single counterexample to the rule that all imperfect optimization will be disastrous.
Except that the proposed rule is more like, given an imperfect objective function, the outcome is likely to turn from ok to disastrous at some point as optimization power is increased. See the Context Disaster and Edge Instantiation articles at Arbital.
The idea of context disasters applies to humans or humanity as a whole as well as AIs, since as you mentioned we are already optimizing for something that is not exactly our true values. Even without the possibility of AI we have a race between technological progress (which increases our optimization power) and progress in coordination and understanding our values (which improve our objective function), which we can easily lose.
I think a problem arises with conclusion 4: I can agree that humans imperfectly steering the world for their own values has resulted in a world averagely ok, but AI will possibly be much more powerful than humans. As far as corporation and sovereing states can be seen to be super-human entities, then we can see that imperfect value optimization has created massive suffering: think of all the damage a ruthless corporation can inflict e.g. by polluting the environment, or a state where political assassination is easy and widespread. An imperfectly aligned value optimization might result in an average world that is ok, but possibly this world would be separated in a heaven and hell, which I think is not an acceptable outcome.
This is a good point. Pretty much all the things we’re optimizing for which aren’t our values are due to coordination problems. (There’s also Akrasia/addiction sorts of things, but that’s optimizing for values which we don’t endorse upon reflection, and so arguably isn’t as bad as optimizing for a random part of value-space.)
So, Moloch might optimize for things like GDP instead of Gross National Happiness, and individuals might throw a thousand starving orphans under the bus for a slightly bigger yacht or whatever, but neither is fully detached from human values. Even if U(orphans)>>U(yacht), at least there’s an awesome yacht to counterbalance the mountain of suck.
I guess the question is precisely how diverse human values are in the grand scheme of things, and what the odds are of hitting a human value when picking a random or semi-random subset of value-space. If we get FAI slightly wrong, precisely how wrong does it have to be before it leaves our little island of value-space? Tiling the universe with smiley faces is obviously out, but what about hedonium, or wire heading everyone? Faced with an unwinnable AI arms race and no time for true FAI, I’d probably consider those better than nothing.
That’s a really, really tiny sliver of my values though, so I’m not sure I’d even endorse such a strategy if the odds were 100:1 against FAI. If that’s the best we could do by compromising, I’d still rate the expected utility of MIRI’s current approach higher, and hold out hope for FAI.
(1) Given: AI risk comes primarily from AI optimizing for things besides human values.
I don’t that’s a good description of the orthogonality thesis. An AI that optimizes for a single human value like purity could still produce huge problems.
Given: humans already are optimizing for things besides human values.
Human’s don’t effectively self modify to achieve specific objectives in the way an AGI could.
(6) Given: Partial optimization for human values is easier than total optimization. (Where “partial optimization” is at least close enough to achieve an okay outcome.)
I don’t that’s a good description of the orthogonality thesis.
Probably not, but it highlights the relevant (or at least related) portion. I suppose I could have been more precise by specifying terminal values, since things like paperclips are obviously instrumental values, at least for us.
Human’s don’t effectively self modify
Agreed, except in the trivial case where we can condition ourselves to have different emotional responses. That’s substantially less dangerous, though.
Partial optimization for human values is easier than total optimization.
Why do you believe that?
I’m not sure I do, in the sense that I wouldn’t assign the proposition >50% probability. However, I might put the odds at around 25% for a Reduced Impact AI architecture providing a useful amount of shortcuts.
That seems like decent odds of significantly boosting expected utility. If such an AI would be faster to develop by even just a couple years, that could make the difference between winning and loosing an AI arms race. Sure, it’d be at the cost of a utopia, but if it boosted the odds of success enough it’d still have enough expected utility to compensate.
I expect that MIRI would mostly disagree with claim 6.
Can you suggest something specific that MIRI should change about their agenda?
When I try to imagine problems for which imperfect value loading suggests different plans from perfectionist value loading, I come up with things like “don’t worry about whether we use the right set of beings when creating a CEV”. But MIRI gives that kind of problem low enough priority that they’re acting as if they agreed with imperfect value loading.
I’m pretty sure I also mostly disagree with claim 6. (See my other reply below.)
The only specific concrete change that comes to mind is that it may be easier to take one person’s CEV than aggregate everyone’s CEV. However, this is likely to be trivially true, if the aggregation method is something like averaging.
If that’s 1 or 2 more lines of code, then obviously it doesn’t really make sense to try and put those lines in last to get FAI 10 seconds sooner, except in a sort of spherical cow in a vacuum sort of sense. However, if “solving the aggregation problem” is a couple years worth of work, maybe it does make sense to prioritize other things first in order to get FAI a little sooner. This is especially true in the event of an AI arms race.
I’m especially curious whether anyone else can come up with scenarios where a maxipok strategy might actually be useful. For instance, is there any work being done on CEV which is purely on the extrapolation procedure or procedures for determining coherence? It seems like if only half our values can easily be made coherent, and we can load them into an AI, that might generate an okay outcome.
3) World is OK with humans optimizing for the wrong things because humans eventually die and take their ideas with them good or bad. Power and wealth is redistributed. Humans get old, they get weak, they get dull, they lose interest. AI gets it wrong then well that’s it.
Not necessarily, depends on your AI and how god-like it is.
In the XIX century you could probably make the same argument about corporations: once one corporation rises above the rest, it will use its power to squash competition and install itself as the undisputed economic ruler forever and ever. The reality turned out to be rather different and not for the lack of trying.
Not necessarily, depends on your AI and how god-like it is.
I hope you’re right. I just automatically think that AI is going to be god-like by default.
In the XIX century you could probably make the same argument about corporations
Not just corporations; you could make the some argument for sovereign states, foundations, trusts, militaries, and religious orgs.
Weak argument is that corporations with their visions, charters, and mission statements are ultimately run by a meatbag or run jointly by meatbags that die/retire, at least that’s how it currently is. You can’t retain humans forever. Corporations lose valuable and capable employee brains over time and replace them with new brains, which maybe better or worse, but you certainly cant keep your best humans forever. Power is checked; Bill Gates plans his legacy, while Sumner Redstone is infirm with kids jockeying for power and Steve Jobs is dead.
Well, the default on LW is EY’s FOOM scenario where an AI exponentially bootstraps itself into Transcendent Realms and, as you say, that’s it. The default in the rest of the world… isn’t like that.
1) Sure. 2) Okay. 3) Yup. 4) This is weasely. Sure, 1-3 are enough to establish that an okay outcome is possible, but don’t really say anything about probability. You also don’t talk about how good of an optimization process is trying to optimize these values.
5) Willing to assume for the sake of argument. 6) Certainly true but not certainly useful. 7) Doesn’t follow, unless you read 6 in a way that makes it potentially untrue.
All of this would make more sense if you tried to put probabilities to how likely you think certain outcomes are.
That was pretty much my take. I get the feeling that “okay” outcomes are a vanishingly small portion of probability space. This suggests to me that the additional marginal effort to stipulate “okay” outcomes instead of perfect CEV is extremely small, if not negative. (By negative, I mean that it would actually take additional effort to program an AI to maximize for “okay” outcomes instead of CEV.)
However, I didn’t want to ask a leading question, so I left it in the present form. It’s perhaps academically interesting that the desirability of outcomes as a function of “similarity to CEV” is a continuous curve rather than a binary good/bad step function. However, I couldn’t really see any way of taking advantage of this. I posted mainly to see if others might spot potential low hanging fruit.
I guess the interesting follow up questions are these: Is there any chance that humans are sufficiently adaptable that human values are more than just an infinitesimally small sliver of the set of all possible values? If so, is there any chance this enables an easier alternative version of the control problem? It would be nice to have a plan B.
(1) Given: AI risk comes primarily from AI optimizing for things besides human values.
(2) Given: humans already are optimizing for things besides human values. (or, at least besides our Coherent Extrapolated Volition)
(3) Given: Our world is okay.^[CITATION NEEDED!]
(4) Therefore, imperfect value loading can still result in an okay outcome.
This is, of course, not necessarily always the case for any given imperfect value loading. However, our world serves as a single counterexample to the rule that all imperfect optimization will be disastrous.
(5) Given: A maxipok strategy is optimal. (“Maximize the probability of an okay outcome.”)
(6) Given: Partial optimization for human values is easier than total optimization. (Where “partial optimization” is at least close enough to achieve an okay outcome.)
(7) ∴ MIRI should focus on imperfect value loading.
Note that I’m not convinced of several of the givens, so I’m not certain of the conclusion. However, the argument itself looks convincing to me. I’ve also chosen to leave assumptions like “imperfect value loading results in partial optimization” unstated as part of the definitions of those 2 terms. However, I’ll try and add details to any specific areas, if questioned.
Except that the proposed rule is more like, given an imperfect objective function, the outcome is likely to turn from ok to disastrous at some point as optimization power is increased. See the Context Disaster and Edge Instantiation articles at Arbital.
The idea of context disasters applies to humans or humanity as a whole as well as AIs, since as you mentioned we are already optimizing for something that is not exactly our true values. Even without the possibility of AI we have a race between technological progress (which increases our optimization power) and progress in coordination and understanding our values (which improve our objective function), which we can easily lose.
I think a problem arises with conclusion 4: I can agree that humans imperfectly steering the world for their own values has resulted in a world averagely ok, but AI will possibly be much more powerful than humans.
As far as corporation and sovereing states can be seen to be super-human entities, then we can see that imperfect value optimization has created massive suffering: think of all the damage a ruthless corporation can inflict e.g. by polluting the environment, or a state where political assassination is easy and widespread.
An imperfectly aligned value optimization might result in an average world that is ok, but possibly this world would be separated in a heaven and hell, which I think is not an acceptable outcome.
This is a good point. Pretty much all the things we’re optimizing for which aren’t our values are due to coordination problems. (There’s also Akrasia/addiction sorts of things, but that’s optimizing for values which we don’t endorse upon reflection, and so arguably isn’t as bad as optimizing for a random part of value-space.)
So, Moloch might optimize for things like GDP instead of Gross National Happiness, and individuals might throw a thousand starving orphans under the bus for a slightly bigger yacht or whatever, but neither is fully detached from human values. Even if U(orphans)>>U(yacht), at least there’s an awesome yacht to counterbalance the mountain of suck.
I guess the question is precisely how diverse human values are in the grand scheme of things, and what the odds are of hitting a human value when picking a random or semi-random subset of value-space. If we get FAI slightly wrong, precisely how wrong does it have to be before it leaves our little island of value-space? Tiling the universe with smiley faces is obviously out, but what about hedonium, or wire heading everyone? Faced with an unwinnable AI arms race and no time for true FAI, I’d probably consider those better than nothing.
That’s a really, really tiny sliver of my values though, so I’m not sure I’d even endorse such a strategy if the odds were 100:1 against FAI. If that’s the best we could do by compromising, I’d still rate the expected utility of MIRI’s current approach higher, and hold out hope for FAI.
I don’t that’s a good description of the orthogonality thesis. An AI that optimizes for a single human value like purity could still produce huge problems.
Human’s don’t effectively self modify to achieve specific objectives in the way an AGI could.
Why do you believe that?
Probably not, but it highlights the relevant (or at least related) portion. I suppose I could have been more precise by specifying terminal values, since things like paperclips are obviously instrumental values, at least for us.
Agreed, except in the trivial case where we can condition ourselves to have different emotional responses. That’s substantially less dangerous, though.
I’m not sure I do, in the sense that I wouldn’t assign the proposition >50% probability. However, I might put the odds at around 25% for a Reduced Impact AI architecture providing a useful amount of shortcuts.
That seems like decent odds of significantly boosting expected utility. If such an AI would be faster to develop by even just a couple years, that could make the difference between winning and loosing an AI arms race. Sure, it’d be at the cost of a utopia, but if it boosted the odds of success enough it’d still have enough expected utility to compensate.
I expect that MIRI would mostly disagree with claim 6.
Can you suggest something specific that MIRI should change about their agenda?
When I try to imagine problems for which imperfect value loading suggests different plans from perfectionist value loading, I come up with things like “don’t worry about whether we use the right set of beings when creating a CEV”. But MIRI gives that kind of problem low enough priority that they’re acting as if they agreed with imperfect value loading.
I’m pretty sure I also mostly disagree with claim 6. (See my other reply below.)
The only specific concrete change that comes to mind is that it may be easier to take one person’s CEV than aggregate everyone’s CEV. However, this is likely to be trivially true, if the aggregation method is something like averaging.
If that’s 1 or 2 more lines of code, then obviously it doesn’t really make sense to try and put those lines in last to get FAI 10 seconds sooner, except in a sort of spherical cow in a vacuum sort of sense. However, if “solving the aggregation problem” is a couple years worth of work, maybe it does make sense to prioritize other things first in order to get FAI a little sooner. This is especially true in the event of an AI arms race.
I’m especially curious whether anyone else can come up with scenarios where a maxipok strategy might actually be useful. For instance, is there any work being done on CEV which is purely on the extrapolation procedure or procedures for determining coherence? It seems like if only half our values can easily be made coherent, and we can load them into an AI, that might generate an okay outcome.
3) World is OK with humans optimizing for the wrong things because humans eventually die and take their ideas with them good or bad. Power and wealth is redistributed. Humans get old, they get weak, they get dull, they lose interest. AI gets it wrong then well that’s it.
Not necessarily, depends on your AI and how god-like it is.
In the XIX century you could probably make the same argument about corporations: once one corporation rises above the rest, it will use its power to squash competition and install itself as the undisputed economic ruler forever and ever. The reality turned out to be rather different and not for the lack of trying.
I hope you’re right. I just automatically think that AI is going to be god-like by default.
Not just corporations; you could make the some argument for sovereign states, foundations, trusts, militaries, and religious orgs.
Weak argument is that corporations with their visions, charters, and mission statements are ultimately run by a meatbag or run jointly by meatbags that die/retire, at least that’s how it currently is. You can’t retain humans forever. Corporations lose valuable and capable employee brains over time and replace them with new brains, which maybe better or worse, but you certainly cant keep your best humans forever. Power is checked; Bill Gates plans his legacy, while Sumner Redstone is infirm with kids jockeying for power and Steve Jobs is dead.
Well, the default on LW is EY’s FOOM scenario where an AI exponentially bootstraps itself into Transcendent Realms and, as you say, that’s it. The default in the rest of the world… isn’t like that.
1) Sure.
2) Okay.
3) Yup.
4) This is weasely. Sure, 1-3 are enough to establish that an okay outcome is possible, but don’t really say anything about probability. You also don’t talk about how good of an optimization process is trying to optimize these values.
5) Willing to assume for the sake of argument.
6) Certainly true but not certainly useful.
7) Doesn’t follow, unless you read 6 in a way that makes it potentially untrue.
All of this would make more sense if you tried to put probabilities to how likely you think certain outcomes are.
That was pretty much my take. I get the feeling that “okay” outcomes are a vanishingly small portion of probability space. This suggests to me that the additional marginal effort to stipulate “okay” outcomes instead of perfect CEV is extremely small, if not negative. (By negative, I mean that it would actually take additional effort to program an AI to maximize for “okay” outcomes instead of CEV.)
However, I didn’t want to ask a leading question, so I left it in the present form. It’s perhaps academically interesting that the desirability of outcomes as a function of “similarity to CEV” is a continuous curve rather than a binary good/bad step function. However, I couldn’t really see any way of taking advantage of this. I posted mainly to see if others might spot potential low hanging fruit.
I guess the interesting follow up questions are these: Is there any chance that humans are sufficiently adaptable that human values are more than just an infinitesimally small sliver of the set of all possible values? If so, is there any chance this enables an easier alternative version of the control problem? It would be nice to have a plan B.