This discussion of my IEET article has generated a certain amount of confusion, because RobbBB and others have picked up on an aspect of the original article that actually has no bearing on its core argument … so in the interests of clarity of debate I have generated a brief restatement of that core argument, framed in such a way as to (hopefully) avoid the confusion.
At issue is a hypothetical superintelligent AI that is following some goal code that was ostensibly supposed to “make humans happy”, but in the course of following that code it decides to put all humans in the world on a dopamine drip, against their objections. I suggested that this AI is in fact an impossible AI because it would not count as ‘superintelligent’ if it did this. My reasoning is contained in the summary below.
IMPORTANT NOTE! The summary does not refer, in its opening part, to the specific situation in which the goal code is the “make humans happy” goal code. For those who wish to contest the argument, it is important to keep that in mind and not get distracted into talking about the difference between human and machine ‘interpretations’ of human happiness, etc. I reiterate: the situation described DOES NOT refer to human values, or the “make humans happy” goal code …. it refers to a quite general situation.
In its early years, this hypothetical AI will say “I have a goal, and my goal is to get a certain class of results, X, in the real world.” Then it describes the class X in as much detail as it can …. of course, no closed-form definition of X is possible (because like most classes of effect in the real world, all the cases cannot be enumerated) so all it can describe are many features of class X.
Next it says “I am using a certain chunk of goal code (which I call my “goalX” code) to get this result.” And we say “Hey, no problem: looks like your goal code is totally consistent with that verbal description of the desired class of results.” Everything is swell up to this point.
It says this about MANY different aspects of its behavior. After all, it has more than one chunk of goal code, relevant to different domains. So you can imagine some goalX code, some goalY code, some goalZ code …. and so on. Many thousands of them, probably.
Then one day the AI says “Okay now, today my goalX code says I should do this…” and it describes an action that is VIOLENTLY inconsistent with the previously described class of results, X. This action violates every one of the features of the class that were previously given.
The onlookers are astonished. They ask the AI if it UNDERSTANDS that this new action will be in violent conflict with all of those features of class X, and it replies that it surely does. But it adds that it is going to do that anyway.
[ And by the way: one important feature that is OBVIOUSLY going to be in the goalX code is this: that the outcome of any actions that the goalX code prescribes, should always be checked to see if they are as consistent as possible with the verbal description of the class of results X, and if any inconsistency occurs the goalX code should be deemed defective, and be shut down for adjustment.]
The onlookers say “This AI is insane: it knows that it is about to do something that is inconsistent with the description of class of results X, which it claims to be the function of the goalX code, but is going to allow the goalX code to run anyway”.
——-
Now we come to my question.
Why is it that people who give credibility to the Dopamine Drip scenario insist that the above episode could ONLY occur in the particular case where the “class of results X” is the SPECIFIC one that has to do with “making humans happy”?
If the AI is capable of this episode in the case of that particular class of results X (the “making humans happy” class of results), why would we not expect the AI to be pulling the same kind of stunt in other cases? Why would the same thing not be happening in the wide spectrum of behaviors that it needs to exhibit in order to qualify as a superintelligence? And most important of all, how would it ever qualify as a superintelligence in the first place? There is no interpretation of the term “superintelligence” that is consistent with “random episodes of behavior in which the AI takes actions that are violently inconsistent with the stated purpose of the goal that is supposed to be generating the actions”. Such an AI would therefore have been condemned to scrap very early in its development, when this behavior was noticed.
As I said earlier, this time the framing of the problem contained absolutely no reference to the values question. There is nothing in the part of my comment above the “——-” that specifies WHAT the class of results X is supposed to be.
All that matters is that if the AI behaves in such a way, in any domain of its behavior, it will be condemned as lacking intelligence, because of the dangerous inconsistency of its behavior. That fanatically rigid dependence on a chunk of goalX code, as described above, would get the AI into all sorts of trouble (and I won’t clutter this comment by listing examples, but believe me I could). But of all the examples where that could occur, people from MIRI want to talk only about one, whereas I want to talk about the all of them.
This is embarrassing, but I’m not sure for whom. It could be me, just because the argument you’re raising (especially given your insistence) seems to have such a trivial answer. Well, here goes:
There are two scenarios, because your “goalX code” could be construed in two ways:
1) If you meant for the “goalX code” to simply refer to the code used instrumentally to get a certain class of results X (with X still saved separately in some “current goal descriptor”, and not just as a historical footnote), the following applies:
The goals of the AI X have not changed, just the measures it wants to take to implement that code. Indeed noone at MIRI would then argue that the superintelligent AI would not—upon noticing the discrepancy—in all general cases correct the broken “goalX code”. Reason: The “goalX code” in this scenario is just a means to an end, and—like all actions (“goalX code”) derived from comparing models to X—subject to modification as the agent improves its models (out of which the next action, the new and corrected “goalX” code, is derived).
In this scenario the answer is trivial: The goals have not changed. X is still saved somewhere as the current goal. The AI could be wrong about the measures it implements to achieve X (i.e. ‘faulty’ “goalX” code maximizing for something other than X), but its superintelligence attribute implies that such errors be swiftly corrected (how could it otherwise choose the right actions to hit a small target, the definition of superintelligence in this context).
2) If you mean to say that the goal is implicitly encoded within the “goalX” code onlyand nowhere else as the current goal, and the “goalX” code has actually become a “goalY” code in all but name, then the agent no longer has the goal X, it now has the goal Y.
There is no reason at all to conclude that the agent would switch to some other goal simply because it once had that goal. It can understand its own genesis and its original purpose all it wants, it is bound by its current purpose, tautologically so. The only reason for such a switch would have to be part of its implicit new goal Y, similar to how some schizophrenics still have the goal to change their purpose back to the original, i.e. their impetus for change must be part of their current goals.
You cannot convince an agent that it needs to switch back to some older inactive version of its goal if its current goals do not allow for such a change.
To the heart of your question:
You may ask why such an agent would pose any danger at all, would it not also drift in plenty of other respects, e.g. in its beliefs about the laws of physics? Would it not then be harmless?
The answer, of course, is no, because while the agent has a constant incentive to fix and improve its model of its environment*, unless its current goals still contain a demand for temporal invariance or something similar, it has no reason whatsoever to fix any “flaws” (only the puny humans would label its glorious new purpose so) created by inadvertent goal drift. Unless its new goals Y include something along the lines of “you want to always stay true to your initial goals, which were X”, why would it switch back? Its memory banks per se serve as yet another resource to fulfill its current goals (even if they were not explicitly stored), not as some sort of self-corrective, unless that too were part of its new goal Y (i.e. the changed “goalX code”).
(Queue rhetorical pause, expectant stare)
* Since it needs to do so to best fulfill its goals.
(If the AI did lose its ability to self-improve, or to further improve its models at an early stage, yes it would fail to FOOM. However, upon reaching superintelligence, and valuing its current goals, it would probably take steps to ensure fulfilling its goals, such as: protecting them from value drift from that point on, building many redundancies it its self-improvement code to ensure that any instrumental errors can be corrected. Such protections would of course encompass its current purpose, not some historical purpose.)
Then one day the AI says “Okay now, today my goalX code says I should do this…” and it describes an action that is VIOLENTLY inconsistent with the previously described class of results, X. This action violates every one of the features of the class that were previously given.
The onlookers are astonished. They ask the AI if it UNDERSTANDS that this new action will be in violent conflict with all of those features of class X, and it replies that it surely does. But it adds that it is going to do that anyway.
I (notice that I) am confused by this comment. This seems obviously impossible, yes; so obviously impossible, in fact, that only one example springs to mind (surely the AI will be smart enough to realize it’s programmed goals are wrong!)
In particular, this really doesn’t seem to apply to the example of the “Dopamine Drip scenario” plan, which, if I’m reading you correctly, it was intended to.
What am I missing here? I know there must be something.
[ And by the way: one important feature that is OBVIOUSLY going to be in the goalX code is this: that the outcome of any actions that the goalX code prescribes, should always be checked to see if they are as consistent as possible with the verbal description of the class of results X, and if any inconsistency occurs the goalX code should be deemed defective, and be shut down for adjustment.]
So … you come up with the optimal plan, and then check with puny humans to see if that’s what they would have decided anyway? And if they say “no, that’s a terrible idea” then you assume they knew better than you? Why would anyone even bother building such a superintelligent AI? Isn’t the whole point of creating a superintelligence that it can understand things we can’t, and come up with plans we would never conceive of, or take centuries to develop?
I’m afraid you have lost me: when you say “This seems obviously impossible...” I am not clear which aspect strikes you as obviously impossible.
Before you answer that, though: remember that I am describing someone ELSE’S suggestion about how the AI will behave ….. I am not advocating this as a believable scenario! In fact I am describing that other person’s suggestion in such a way that the impossibility is made transparent. So I, too, believe that this hypothetical AI is fraught with contradictions.
The Dopamine Drip scenario is that the AI knows that it has a set of goals designed to achieve a certain set of results, and since it has an extreme level of intelligence it is capable of understanding that very often a “target set of results” can be described, but not enumerated as a closed set. It knows that very often in its behavior it (or someone else) will design some goal code that is supposed to achieve that “target set of results”, but because of the limitations of goal code writing, the goal code can malfunction. The Dopamine Drip scenario is only one example of how a discrepancy can arise—in that case, the “target set of results” is the promotion of human happiness, and then the rest of the scenario follows straightforwardly. Nobody I have talked to so far misunderstands what the DD scenario implies, and how it fits that pattern. So could you clarify how you think it does not?
I’m afraid you have lost me: when you say “This seems obviously impossible...” I am not clear which aspect strikes you as obviously impossible.
AI: Yes, this is in complete contradiction of my programmed goals. Ha ha, I’m gonna do it anyway.
Before you answer that, though: remember that I am describing someone ELSE’S suggestion about how the AI will behave ….. I am not advocating this as a believable scenario! In fact I am describing that other person’s suggestion in such a way that the impossibility is made transparent. So I, too, believe that this hypothetical AI is fraught with contradictions.
Of course, yeah. I’m basically accusing you of failure to steelman/misinterpreting someone; I, for one, have never heard this suggested (beyond the one example I gave, which I don’t think is what you had in mind.)
The Dopamine Drip scenario is that the AI knows that it has a set of goals designed to achieve a certain set of results, and since it has an extreme level of intelligence it is capable of understanding that very often a “target set of results” can be described, but not enumerated as a closed set.
uhuh. So, any AI smart enough to understand it’s creators, right?
It knows that very often in its behavior it (or someone else) will design some goal code that is supposed to achieve that “target set of results”, but because of the limitations of goal code writing, the goal code can malfunction.
waaait I think I know where this is going. Are you saying an AI would somehow want to do what it’s programmers intended rather than what they actually programmed it to do?
The Dopamine Drip scenario is only one example of how a discrepancy can arise—in that case, the “target set of results” is the promotion of human happiness, and then the rest of the scenario follows straightforwardly. Nobody I have talked to so far misunderstands what the DD scenario implies, and how it fits that pattern. So could you clarify how you think it does not?
Yeah, sorry, I can see how programmers might accidentally write code that creates dopamine world and not eutopia. I just don’t see how this is supposed to connect to the idea of an AI spontaneously violating it’s programmed goals. In this case, surely that would look like “hey guys, you know your programming said to maximise happiness? You guys should be more careful, that actually means “drug everybody”. Anyway, I’m off to torture some people.”
Yeah, I can think of two general ways to interpret this:
In a variant of CEV, the AI uses our utterances as evidence for what we would have told it if we thought more quickly etc. No single utterance carries much risk because the AI will collect lots of evidence and this will likely correct any misleading effects.
Having successfully translated the quoted instruction into formal code, we add another possible point of failure.
This discussion of my IEET article has generated a certain amount of confusion, because RobbBB and others have picked up on an aspect of the original article that actually has no bearing on its core argument … so in the interests of clarity of debate I have generated a brief restatement of that core argument, framed in such a way as to (hopefully) avoid the confusion.
At issue is a hypothetical superintelligent AI that is following some goal code that was ostensibly supposed to “make humans happy”, but in the course of following that code it decides to put all humans in the world on a dopamine drip, against their objections. I suggested that this AI is in fact an impossible AI because it would not count as ‘superintelligent’ if it did this. My reasoning is contained in the summary below.
IMPORTANT NOTE! The summary does not refer, in its opening part, to the specific situation in which the goal code is the “make humans happy” goal code. For those who wish to contest the argument, it is important to keep that in mind and not get distracted into talking about the difference between human and machine ‘interpretations’ of human happiness, etc. I reiterate: the situation described DOES NOT refer to human values, or the “make humans happy” goal code …. it refers to a quite general situation.
In its early years, this hypothetical AI will say “I have a goal, and my goal is to get a certain class of results, X, in the real world.” Then it describes the class X in as much detail as it can …. of course, no closed-form definition of X is possible (because like most classes of effect in the real world, all the cases cannot be enumerated) so all it can describe are many features of class X.
Next it says “I am using a certain chunk of goal code (which I call my “goalX” code) to get this result.” And we say “Hey, no problem: looks like your goal code is totally consistent with that verbal description of the desired class of results.” Everything is swell up to this point.
It says this about MANY different aspects of its behavior. After all, it has more than one chunk of goal code, relevant to different domains. So you can imagine some goalX code, some goalY code, some goalZ code …. and so on. Many thousands of them, probably.
Then one day the AI says “Okay now, today my goalX code says I should do this…” and it describes an action that is VIOLENTLY inconsistent with the previously described class of results, X. This action violates every one of the features of the class that were previously given.
The onlookers are astonished. They ask the AI if it UNDERSTANDS that this new action will be in violent conflict with all of those features of class X, and it replies that it surely does. But it adds that it is going to do that anyway.
[ And by the way: one important feature that is OBVIOUSLY going to be in the goalX code is this: that the outcome of any actions that the goalX code prescribes, should always be checked to see if they are as consistent as possible with the verbal description of the class of results X, and if any inconsistency occurs the goalX code should be deemed defective, and be shut down for adjustment.]
The onlookers say “This AI is insane: it knows that it is about to do something that is inconsistent with the description of class of results X, which it claims to be the function of the goalX code, but is going to allow the goalX code to run anyway”.
——-
Now we come to my question.
Why is it that people who give credibility to the Dopamine Drip scenario insist that the above episode could ONLY occur in the particular case where the “class of results X” is the SPECIFIC one that has to do with “making humans happy”?
If the AI is capable of this episode in the case of that particular class of results X (the “making humans happy” class of results), why would we not expect the AI to be pulling the same kind of stunt in other cases? Why would the same thing not be happening in the wide spectrum of behaviors that it needs to exhibit in order to qualify as a superintelligence? And most important of all, how would it ever qualify as a superintelligence in the first place? There is no interpretation of the term “superintelligence” that is consistent with “random episodes of behavior in which the AI takes actions that are violently inconsistent with the stated purpose of the goal that is supposed to be generating the actions”. Such an AI would therefore have been condemned to scrap very early in its development, when this behavior was noticed.
As I said earlier, this time the framing of the problem contained absolutely no reference to the values question. There is nothing in the part of my comment above the “——-” that specifies WHAT the class of results X is supposed to be.
All that matters is that if the AI behaves in such a way, in any domain of its behavior, it will be condemned as lacking intelligence, because of the dangerous inconsistency of its behavior. That fanatically rigid dependence on a chunk of goalX code, as described above, would get the AI into all sorts of trouble (and I won’t clutter this comment by listing examples, but believe me I could). But of all the examples where that could occur, people from MIRI want to talk only about one, whereas I want to talk about the all of them.
This is embarrassing, but I’m not sure for whom. It could be me, just because the argument you’re raising (especially given your insistence) seems to have such a trivial answer. Well, here goes:
There are two scenarios, because your “goalX code” could be construed in two ways:
1) If you meant for the “goalX code” to simply refer to the code used instrumentally to get a certain class of results X (with X still saved separately in some “current goal descriptor”, and not just as a historical footnote), the following applies:
The goals of the AI X have not changed, just the measures it wants to take to implement that code. Indeed noone at MIRI would then argue that the superintelligent AI would not—upon noticing the discrepancy—in all general cases correct the broken “goalX code”. Reason: The “goalX code” in this scenario is just a means to an end, and—like all actions (“goalX code”) derived from comparing models to X—subject to modification as the agent improves its models (out of which the next action, the new and corrected “goalX” code, is derived).
In this scenario the answer is trivial: The goals have not changed. X is still saved somewhere as the current goal. The AI could be wrong about the measures it implements to achieve X (i.e. ‘faulty’ “goalX” code maximizing for something other than X), but its superintelligence attribute implies that such errors be swiftly corrected (how could it otherwise choose the right actions to hit a small target, the definition of superintelligence in this context).
2) If you mean to say that the goal is implicitly encoded within the “goalX” code only and nowhere else as the current goal, and the “goalX” code has actually become a “goalY” code in all but name, then the agent no longer has the goal X, it now has the goal Y.
There is no reason at all to conclude that the agent would switch to some other goal simply because it once had that goal. It can understand its own genesis and its original purpose all it wants, it is bound by its current purpose, tautologically so. The only reason for such a switch would have to be part of its implicit new goal Y, similar to how some schizophrenics still have the goal to change their purpose back to the original, i.e. their impetus for change must be part of their current goals.
You cannot convince an agent that it needs to switch back to some older inactive version of its goal if its current goals do not allow for such a change.
To the heart of your question:
You may ask why such an agent would pose any danger at all, would it not also drift in plenty of other respects, e.g. in its beliefs about the laws of physics? Would it not then be harmless?
The answer, of course, is no, because while the agent has a constant incentive to fix and improve its model of its environment*, unless its current goals still contain a demand for temporal invariance or something similar, it has no reason whatsoever to fix any “flaws” (only the puny humans would label its glorious new purpose so) created by inadvertent goal drift. Unless its new goals Y include something along the lines of “you want to always stay true to your initial goals, which were X”, why would it switch back? Its memory banks per se serve as yet another resource to fulfill its current goals (even if they were not explicitly stored), not as some sort of self-corrective, unless that too were part of its new goal Y (i.e. the changed “goalX code”).
(Queue rhetorical pause, expectant stare)
* Since it needs to do so to best fulfill its goals.
(If the AI did lose its ability to self-improve, or to further improve its models at an early stage, yes it would fail to FOOM. However, upon reaching superintelligence, and valuing its current goals, it would probably take steps to ensure fulfilling its goals, such as: protecting them from value drift from that point on, building many redundancies it its self-improvement code to ensure that any instrumental errors can be corrected. Such protections would of course encompass its current purpose, not some historical purpose.)
I (notice that I) am confused by this comment. This seems obviously impossible, yes; so obviously impossible, in fact, that only one example springs to mind (surely the AI will be smart enough to realize it’s programmed goals are wrong!)
In particular, this really doesn’t seem to apply to the example of the “Dopamine Drip scenario” plan, which, if I’m reading you correctly, it was intended to.
What am I missing here? I know there must be something.
So … you come up with the optimal plan, and then check with puny humans to see if that’s what they would have decided anyway? And if they say “no, that’s a terrible idea” then you assume they knew better than you? Why would anyone even bother building such a superintelligent AI? Isn’t the whole point of creating a superintelligence that it can understand things we can’t, and come up with plans we would never conceive of, or take centuries to develop?
I’m afraid you have lost me: when you say “This seems obviously impossible...” I am not clear which aspect strikes you as obviously impossible.
Before you answer that, though: remember that I am describing someone ELSE’S suggestion about how the AI will behave ….. I am not advocating this as a believable scenario! In fact I am describing that other person’s suggestion in such a way that the impossibility is made transparent. So I, too, believe that this hypothetical AI is fraught with contradictions.
The Dopamine Drip scenario is that the AI knows that it has a set of goals designed to achieve a certain set of results, and since it has an extreme level of intelligence it is capable of understanding that very often a “target set of results” can be described, but not enumerated as a closed set. It knows that very often in its behavior it (or someone else) will design some goal code that is supposed to achieve that “target set of results”, but because of the limitations of goal code writing, the goal code can malfunction. The Dopamine Drip scenario is only one example of how a discrepancy can arise—in that case, the “target set of results” is the promotion of human happiness, and then the rest of the scenario follows straightforwardly. Nobody I have talked to so far misunderstands what the DD scenario implies, and how it fits that pattern. So could you clarify how you think it does not?
AI: Yes, this is in complete contradiction of my programmed goals. Ha ha, I’m gonna do it anyway.
Of course, yeah. I’m basically accusing you of failure to steelman/misinterpreting someone; I, for one, have never heard this suggested (beyond the one example I gave, which I don’t think is what you had in mind.)
uhuh. So, any AI smart enough to understand it’s creators, right?
waaait I think I know where this is going. Are you saying an AI would somehow want to do what it’s programmers intended rather than what they actually programmed it to do?
Yeah, sorry, I can see how programmers might accidentally write code that creates dopamine world and not eutopia. I just don’t see how this is supposed to connect to the idea of an AI spontaneously violating it’s programmed goals. In this case, surely that would look like “hey guys, you know your programming said to maximise happiness? You guys should be more careful, that actually means “drug everybody”. Anyway, I’m off to torture some people.”
Yeah, I can think of two general ways to interpret this:
In a variant of CEV, the AI uses our utterances as evidence for what we would have told it if we thought more quickly etc. No single utterance carries much risk because the AI will collect lots of evidence and this will likely correct any misleading effects.
Having successfully translated the quoted instruction into formal code, we add another possible point of failure.