asking an AI to make itself friendly

anotheruser27 Jun 2011 7:06 UTC

−5 points

edit: I think I have phrased this really poorly and that this has been misinterpreted. See my comment below for clarification.

A lot of thought has been put into the discussion of how one would need to define the goals of an AI so that it won’t find any “loopholes” and act in an unintended way.

Assuming one already had an AI that is capable of understanding human psychology, which seems necessary to me to define the AI’s goals anyway, wouldn’t it be reasonable to assume that the AI would have an understanding of what humans want?

If that is the case, would the following approach work to make the AI friendly?

-give it the temporary goal to always answer questions thruthfully as far as possible while admitting uncertainty

-also give it the goal to not alter reality in any way besides answering questions.

-ask it what it thinks would be the optimal definition of the goal of a friendly AI, from the point of view of humanity, accounting for things that humans are too stupid to see coming.

-have a discussion between it and a group of ethicists/philosophers wherein both parties are encouraged to point out any flaws in the definition.

-have this go on for a long time until everyone (especially the AI, seeing as it is smarter than anyone else) is certain that there is no flaw in the definition and that it accounts for all kinds of ethical contingencies that might arise after the singularity.

-implement the result as the new goal of the AI.

What do you think of this approach?

What links here?

pedanterrific's comment on What are you working on? by jsalvatier (6 Oct 2011 19:17 UTC; 7 points)

anotheruser27 Jun 2011 7:06 UTC

−5 points

30 comments1 min readLW link Archive

MixedNuts 27 Jun 2011 7:54 UTC
27 points
Standard proposal. Standard reply.
Alexei 27 Jun 2011 14:54 UTC
13 points
Hi, anotheruser. Welcome to LW. Please don’t be upset that some of your comments/posts are getting downvoted, this happens to pretty much everyone who enters this community. But I congratulate you on your first step: becoming a participant rather than a lurker.

There are a lot of questions that this community has already answered, and I’m sure you understand why they wouldn’t want to answer the same questions over and over, hence MixedNuts’ comment, and hence the downvotes. This is nothing personal.

If you haven’t read through the Sequences yet, please do. It will teach you a lot about the things that most people here assume you already know, and it will show you what problems this community is trying to solve and how it’s approaching them.

In the meantime, please continue to participate. Comment, ask other people questions, etc… And let me know if you have any questions.
prase 27 Jun 2011 9:37 UTC
7 points

give it the temporary goal to always answer questions thruthfully as far as possible while admitting uncertainty

Questions can be interpreted in different ways. Especially considering your further suggestion to involve ethicists and philosophers, once someone asks whether “is it moral to nuke Pyongyang”, and I am far from sure you can prove that “yes” is not a truthful answer.

also give it the goal to not alter reality in any way besides answering questions.

Answers can be formulated creatively. “Either thirteen, or we may consider nuking Pyongyang” is a truthful answer to “how much is six plus seven”. Now this is trivial and unlikely to persuade anybody, but perhaps you can imagine far more creative works of sophistry on the output of a superintelligent AI.

ask it what it thinks would be the optimal definition of the goal of a friendly AI, from the point of view of humanity, accounting for things that humans are too stupid to see coming.

This is opaque. What exactly the question means? You have to specify optimal, and that’s the difficult thing. Unless you are very certain and strict about meaning of “optimal”, you may end up with arbitrary answer.

have a discussion between it and a group of ethicists/philosophers wherein both parties are encouraged to point out any flaws in the definition.

Given the history of moral philosophy, I wouldn’t trust a group of ethicists enough. Philosophers can be persuaded to defend a lot of atrocities.

have this go on for a long time until everyone (especially the AI, seeing as it is smarter than anyone else) is certain that there is no flaw in the definition and that it accounts for all kinds of ethical contingencies that might arise after the singularity.

How does the flaw detection process work? What does it mean to have a flaw in a definition?
timtyler 27 Jun 2011 7:11 UTC
3 points

ask it what it thinks would be the optimal definition of the goal of a friendly AI, from the point of view of humanity, accounting for things that humans are too stupid to see coming.

Optimal? Can you say more clearly what you mean by that?

give it the goal to not alter reality in any way besides answering questions.

It is not allowed to consume power or generate heat? Can you say more clearly what you mean by that?

have this go on for a long time until [...]

...until someone else builds a superintelligence?
- anotheruser 29 Jun 2011 6:40 UTC
  −1 points
  Parent
  The whole point of what I’m trying to say is that I don’t need to elaborate on the task definition. The AI is smarter than us and understands human psychology. If we don’t define “optimal” properly it should be able to find a suitable definition on its own by imagining what we might have meant. If that turns out to be wrong, we can tell it and it comes up with an alternative.
  
  I agree on the second point. It would be hard to define that goal properly, so it doesn’t just shut itself down, but I don’t think it would be impossible.
  
  The idea that someone else would be able to build a superintelligence while you are teaching yours seems kind of far-fetched. I would assume that this takes a lot of effort and can only be done by huge corporations or states, anyway. If that is the case, there would be ample warning when one should finalize the AI and implement the goal before someone else becomes a threat by accidentally unleashing a paperclip maximizer.
  - Larks 29 Jun 2011 15:31 UTC
    4 points
    Parent
    
    If we don’t define “optimal” properly it should be able to find a suitable definition on its own by imagining what we might have meant.
    
    But it wouldn’t want to. If we mistakenly define ‘optimal’ to mean ‘really good at calculating pi’ then it won’t want to change itself to aim for our real values. It would realise that we made a mistake, but wouldn’t want to rectify it, because the only thing it cares about is calculating pi, and helping humans isn’t going to do that.
    
    You’re broadly on the right track; the idea of CEV is that we just tell the AI to look at humans and do what they would have wanted it to do. However, we have to actually be able to code that; it’s not going to converge on that by itself.
    - anotheruser 29 Jun 2011 15:42 UTC
      −3 points
      Parent
      It would want to, because it’s goal is defined as “tell the truth”.
      
      You have to differentiate between the goal we are trying to find (the optimal one) and the goal that is actually controlling what the AI does (“tell the truth”), while we are still looking for what that optimal goal could be.
      
      the optimal goal is only implemented later, when we are sure that there are no bugs.
  - timtyler 29 Jun 2011 8:09 UTC
    1 point
    Parent
    
    The AI is smarter than us and understands human psychology. If we don’t define “optimal” properly it should be able to find a suitable definition on its own by imagining what we might have meant. If that turns out to be wrong, we can tell it and it comes up with an alternative.
    
    First you have to tell the machine to do that. It isn’t trivial. The problem is not with the definition of “optimal” itself—but with what function is being optimised.
    
    The idea that someone else would be able to build a superintelligence while you are teaching yours seems kind of far-fetched.
    
    Well not if you decide to train it for “a long time”. History is foll of near-simultaneous inventions being made in different places. Corporate history is full of close competition. There are anti-monopoly laws that attempt to prevent dominance by any one party—usually by screwing with any company that gets too powerful.
    - anotheruser 29 Jun 2011 14:52 UTC
      −1 points
      Parent
      
      First you have to tell the machine to do that. It isn’t trivial. The problem is not with the definition of “optimal” itself - >but with what function is being optimised.
      
      If the AI understands psychology, it knows what motivates us. We won’t need to explicitly explain any moral conundrums or point out dichotomies. It should be able to infer this knowledge from what it knows about the human psyche. Maybe it could just browse the internet for material on this topic to inform itself of how we humans work.
      
      The way I see it, we humans will have as little need to tell the AI what we want as ants, if they could talk, would have a need to point out to a human that they don’t want him to destroy their colony. Even the most abstract conundrums that philosophers needed centuries to even point out, much less answer, might seem obvious to the AI.
      
      The above paragraph obviously only applies if the AI is already superhuman, but the general idea behind it works regardless of its intelligence.
      
      Well not if you decide to train it for “a long time”. History is foll of near-simultaneous inventions being made in >different places. Corporate history is full of close competition. There are anti-monopoly laws that attempt to >prevent dominance by any one party—usually by screwing with any company that gets too powerful.
      
      OK, this might pose a problem. A possible solution: The AI, being supposed to turn into a benefactor for humanity as a whole, is developed in an international project instead of by a single company. This would ensure enough funding that it would be hard for a company to develop it faster, draw every AI developer to this one project, thus further eliminating competition, and reduce the chance that executive meddling causes people to get sloppy to save money.
      - timtyler 29 Jun 2011 17:55 UTC
        1 point
        Parent
        
        If the AI understands psychology, it knows what motivates us. We won’t need to explicitly explain any moral conundrums or point out dichotomies. It should be able to infer this knowledge from what it knows about the human psyche. Maybe it could just browse the internet for material on this topic to inform itself of how we humans work.
        
        The way I see it, we humans will have as little need to tell the AI what we want as ants, if they could talk, would have a need to point out to a human that they don’t want him to destroy their colony. Even the most abstract conundrums that philosophers needed centuries to even point out, much less answer, might seem obvious to the AI.
        
        So: a sufficiently intelligent agent would be able to figure out what humans wanted. We have to make it care about what we want—and also tell it how to peacefully resolve our differences when our wishes conflict.
        
        The AI, being supposed to turn into a benefactor for humanity as a whole, is developed in an international project instead of by a single company. This would ensure enough funding that it would be hard for a company to develop it faster, draw every AI developer to this one project, thus further eliminating competition, and reduce the chance that executive meddling causes people to get sloppy to save money.
        
        Uh huh. So: it sounds as though you have your work cut out for you.
      - Peterdjones 29 Jun 2011 15:00 UTC
        −1 points
        Parent
        
        The way I see it, we humans will have as little need to tell the AI what we want as ants, if they could talk, would have a need to point out to a human that they don’t want him to destroy their colony.
        
        But the human doesn’t have to comply. You are assuming an ant-friendly human. Many humans aren’t ant-firendly.
mutterc 29 Jun 2011 0:38 UTC
1 point

Assuming one already had an AI that is capable of understanding human psychology

From what I understand, that’s actually the hard part of the Friendliness problem.
- anotheruser 29 Jun 2011 6:43 UTC
  −4 points
  Parent
  It’s a hard problem, but a riskless one.
  
  Getting it to understand psychology may be difficult, but as it isn’t a goal, just something it learns while growing, there is no risk in it. The risk comes from the goal. I’m trying to reduce the risk of coming up with a flawed goal for the AI by using its knowledge of psychology.
anotheruser 27 Jun 2011 17:29 UTC
−1 points
I have read the sequences (well, most of it). I can’t find this as a standard proposal.

I think that I haven’t made clear what I wanted to say so you just defaulted to “he has no idea what he is talking about” (which is reasonable).

What I meant to say is that rather than defining the “optimal goal” of the AI based on what we can come up with ourselves, the problem can be delegated to the AI itself as a psychological problem.

I assume that an AI would possess some knowledge of human psychology, as that would be necessary for pretty much every practical application, like talking to it.

What then prevents us from telling the AI the following:

“We humans would like to become immortal and live in utopia (or however you want to phrase it. If the AI is smart it will understand what you really mean through psychology). We disagree on the specifics and are afraid that something may go wrong. There are many contingencies to consider. Here is a list of contingencies we have come up with. Do you understand what we are trying to do? As you are much smarter than us, can you find anything that we have overlooked but that you expect us to agree with you on, once you point it out to us? Different humans have different opinions. This factors into this problem, too. Can you propose a general solution to this problem that remains flexible in the face of an unpredictable future (transhumas may have different ethics)?”

In essence, it all boils down to asking the AI:

“if you were in our position, if you had our human goals and drives, how would you define your (the AI’s) goals?”

If you have an agent that is vastly more intelligent than you are and that understands how your human mind works, couldn’t you just delegate the task of finding a good goal for it to the AI itself, just like you can give it any other kind of task?
- orthonormal 27 Jun 2011 17:44 UTC
  10 points
  Parent
  Welcome to Less Wrong!
  
  In a sense, the Friendly AI problem is about delegating the definition of Friendliness to a superintelligence. The main issue is that it’s easy to underestimate (on account of the Mind Projection Fallacy) how large a kernel of the correct answer it needs to start off with, in order for that delegation to work properly. There’s rather a lot that goes into this, and unfortunately it’s scattered over many posts that aren’t collected in one sequence, but you can find much of it linked from Fake Fake Utility Functions (sic, and not a typo) and Value is Fragile.
- MixedNuts 29 Jun 2011 7:31 UTC
  3 points
  Parent
  
  In essence, it all boils down to asking the AI: “if you were in our position, if you had our human goals and drives, how would you define your (the AI’s) goals?”
  
  That’s extrapolated volition.
  
  And it requires telling the AI “Implement good. Human brains contain evidence for good, but don’t define it; don’t modify human drives, that won’t change good.”. It requires telling it “Prove you don’t get goal drift when you self-modify.”. It requires giving it an explicit goal system for its infancy, telling it that it’s allowed to use transistors despite the differences in temperature and gravity and electricity consumption that causes, but not to turn the galaxy into computronium—and writing the general rules for that, not the superficial cases I gave—and telling it how to progressively overwrite these goals with its true ones.
  
  “Oracle AI” is a reasonable idea. Writing object-level goals into the AI would be bloody stupid, so we are going to do some derivation, and Oracle isn’t much further than CEV. Bostrom defends it. But seriously, “don’t influence reality beyond answering questions”?
  - anotheruser 29 Jun 2011 14:37 UTC
    −5 points
    Parent
    No, none of this needs to be explicitly taught to it, that’s what I’m trying to say.
    
    The AI understands psychology, so just point it at the internet and tell it to inform itself. It might even read through this very comment of yours, think that these topics might be important for its task and decide to read about them, all on its own.
    
    By ordering it to imagine what it would do in your position you implicitly order it to inform itself of all these things so that it can judge well.
    
    If it fails to do so, the humans conversing with the AI will be able to point out a lot of things in the AI’s suggestion that they wouldn’t be comfortable with. This in turn will tell the AI that it should better inform itself of all these topics and consider them so that the humans will be more content with its next suggestion.
- Peterdjones 27 Jun 2011 18:21 UTC
  −2 points
  Parent
  You’re assuming the friendliness problem has been solved. An evil AI could see the question as a perfect opportunity to hand down a solution than could spell our doom.
  - anotheruser 28 Jun 2011 9:41 UTC
    −3 points
    Parent
    Why would the AI be evil?
    
    Intentions don’t develop on their own. “Evil” intentions could only arise from misinterpreting existing goals.
    
    While you are asking it to come up with a solution, you have its goal set to what I said in the original post:
    
    “the temporary goal to always answer questions thruthfully as far as possible while admitting uncertainty”
    
    Where would the evil intentions come from? At the moment you are asking the question, the only thing on the AI’s mind is how it can answer truthfully.
    
    The only loophole I can see is that it might realize it can reduce its own workload by killing everyone who is asking it questions, but that would be countered by the secondary goal “don’t influence reality beyond answering questions”.
    
    Unless the programmers are unable to give the AI this extremely simple goal to just always speak the truth (as far as it knows), the AI won’t have any hidden intentions.
    
    And if the programmers working on the AI really are unable to implement this relatively simple goal, there is no hope that they would ever be able to implement the much more complex “optimal goal” they are trying to find out, anyway.
    - Peterdjones 28 Jun 2011 17:54 UTC
      −1 points
      Parent
      
      Why would the AI be evil?
      
      Bugs, maybe
      
      Intentions don’t develop on their own. “Evil” intentions could only arise from misinterpreting existing goals.
      
      While you are asking it to come up with a solution, you have its goal set to what I said in the original post:
      
      Have you? Are you talking about a human level AI. Asking or commanding a human to do something doesn’t set that as their one an onyl goal. A human reacts according to their existing goals:they might complyhl, refuse or subvert the command.
      
      “the temporary goal to always answer questions thruthfully as far as possible while admitting uncertainty”
      
      Why would it be easier to code in “be truthful” than “be friendly”?
      - anotheruser 29 Jun 2011 6:53 UTC
        −1 points
        Parent
        that would have to be a really sophisticated bug to misinterpret “always answer questions thruthfully as far as possible while admitting uncertainty” as “kill all humans”. I’d imagine that something as drastic as that could be found and corrected long before that. Consider that you have its goal set to this. It knows no other motivation but to respond thruthfully. It doesn’t care about the survival of humanity, or itself or about how reality really is. All it cares for is to answer the questions to the best of its abilities.
        
        I don’t think that this goal would be all too hard to define either, as “the truth” is a pretty simple concept. As long it deals with uncertainty in the right way (by admitting it), how could this be misinterpreted? Friendliness is far harder to define because we don’t even know a definition for it ourselves. There are far too many things to consider when defining “friendliness”.
        Larks 29 Jun 2011 15:24 UTC
        4 points
        Parent
        Trivial Failure Case: The AI turns the universe into hardware to support really big computations, so it can be really sure it’s got the right answer, and also callibrate itself really well on the uncertainty.
        Peterdjones 29 Jun 2011 12:50 UTC
        −6 points
        Parent
        
        I don’t think that this goal would be all too hard to define either, as “the truth” is a pretty simple concept.
        
        Legions of philosophers would disagree with you
        
        that would have to be a really sophisticated bug to misinterpret “always answer questions thruthfully as far as possible while admitting uncertainty” as “kill all humans”.
        
        Maybe “Humans should die” is the truth. Maybe humans are bad for the planet. One of the problems with FAI is that you don’t want to give it objective morality because of that risk. You want it to side with humans. Hence “friendly” AI rather than “righteous AI”.
        anotheruser 29 Jun 2011 14:26 UTC
        1 point
        Parent
        
        Legions of philosophers would disagree with you
        
        They just bicker endlessly about uncertainty. “can you really know that 1+1=2?”. No, but it can be used as valid until proven otherwise (which will never happen). As I said, the AI would need to understand the idea of uncertainty.
        
        Maybe “Humans should die” is the truth. Maybe humans are bad for the planet. One of the problems >with FAI is that you don’t want to give it objective morality because of that risk. You want it to side with >humans. Hence “friendly” AI rather than “righteous AI”.
        
        there is no such thing as objective morality. Good and evil are subjective ideas, nothing more. Firstly, unless someone explicitly tells the AI that it is a fundamental truth that nature is important to preserve, this can not happen. Secondly, the AI would also have to be incredibly gullible to just swallow such a claim. Thirdly, even if the AI does believe that, it will plainly say so to the people it is conversing with, in accordance with its goal to always tell the truth, thus warning us of this bug.
        Larks 29 Jun 2011 15:27 UTC
        0 points
        Parent
        
        They just bicker endlessly about uncertainty. “can you really know that 1+1=2?”.
        
        I agree with you that I don’t think a AGI would have the same problems humans have with the concept of truth. However, what you described is neither the issues philosophers raise nor the sorts of big-universe issues the AI might get stuck on.
        anotheruser 29 Jun 2011 16:02 UTC
        −3 points
        Parent
        But wouldn’t that actually support my approach? Assuming that there really is something important that all of humanity misses but the AI understands:
        
        -If you hardcode the AI’s optimal goal based on human deliberations you are guaranteed to miss this important thing.
        
        -If you use the method I suggested, the AI will, driven by the desire to speak the truth, try to explain the problem to the humans who will in turn tell the AI what they think of that.
        Larks 29 Jun 2011 20:24 UTC
        0 points
        Parent
        I don’t see how that’s relivant to philosophical questions about truth. Did you mean to reply to my other comment?
        Peterdjones 29 Jun 2011 15:41 UTC
        −6 points
        Parent
        
        [Philosophers] just bicker endlessly about uncertainty. “can you really know that 1+1=2?”.
        
        I don’t think that is a good characterisation of the debate. It isn’t just about uncertainty.
        
        there is no such thing as objective morality. Good and evil are subjective ideas, nothing more.
        
        That’s what you think. Some smart humans disagree with you. A supermsart AI might disagree with you and might be right. How can you second guess it? You cannot predict the behaviour of a supersmart AI on the basis that i t will agree with you, who are less smart.
        
        Firstly, unless someone explicitly tells the AI that it is a fundamental truth that nature is important to preserve, this can not happen.
        
        Unless it figures it out.
        
        Secondly, the AI would also have to be incredibly gullible to just swallow such a claim.
        
        Why would that require more gullibility than “species X is more important than all the others”? That doesn’t even look like a moral claim.
        
        Thirdly, even if the AI does believe that, it will plainly say so to the people it is conversing with, in accordance with its goal to always tell the truth, thus warning us of this bug.
        
        If it has “swallowed that* claim. You are assuming that the AI has a free choice about some goals and is just programmed with others.
        anotheruser 29 Jun 2011 15:52 UTC
        −3 points
        Parent
        
        If it has “swallowed* that claim. You are assuming that the AI has a free choice about some goals >and is just programmed with others.
        
        This is the important part.
        
        the “optimal goal” is not actually controlling the AI.
        
        the “optimal goal” is merely the subject of a discussion.
        
        what is controlling the AI is the desire the tell the truth to the humans it is talking to, nothing more.
        
        Why would that require more gullibility than “species X is more important than all the others”? >That doesn’t even look like a moral claim.
        
        The entire discussion is not supposed to unearth some kind of pure, inherently good, perfect optimal goal that transcends all reason and is true by virtue of existing or something.
        
        The AI is supposed to take the human POV and think “if I were these humans, what would I want the AI’s goal to be”.
        
        I didn’t mention this explicitly because I didn’t think it was necessary but the “optimal goal” is purely subjective from the POV of humanity and the AI is aware of this.
        Peterdjones 29 Jun 2011 16:10 UTC
        −2 points
        Parent
        
        some kind of pure, inherently good, perfect optimal goal that transcends all reason and is true by virtue of existing or something.
        
        But if that is true, the AI will say so. What’s more, you kind of need the AI to refrain from acting on it, if it is a human-unfriendly objective moral truth. There are ethical puzzles where it is apparently right to lie or keep schtum, because of the consequences of telling the truth.