High impact from low impact

Stuart_Armstrong17 Apr 2015 16:01 UTC

11 points

A putative new idea for AI control; index here.

Part of the problem with a reduced impact AI is that it will, by definition, only have a reduced impact.

Some of the designs try and get around the problem by allowing a special “output channel” on which impact can be large. But that feels like cheating. Here is a design that accomplishes the same without using that kind of hack.

Imagine there is an asteroid that will hit the Earth, and we have a laser that could destroy it. But we need to aim the laser properly, so need coordinates. There is a reduced impact AI that is motivated to give the coordinates correctly, but also motivated to have reduced impact—and saving the planet from an asteroid with certainty is not reduced impact.

Now imagine that instead there are two AIs, X and Y. By abuse of notation, let ¬X refer to the event that the output signal from X is scrambled away from the the original output.

Then we ask X to give us the x-coordinates for the laser, under the assumption of ¬Y (that AI Y’s signal will be scrambled). Similarly, we Y to give us the y-coordinates of the laser, under the assumption ¬X.

Then X will reason “since ¬Y, the laser will certainly miss its target, as the y-coordinates will be wrong. Therefore it is reduced impact to output the correct x-coordinates, so I shall.” Similarly, Y will output the right y-coordinates, the laser will fire and destroy the asteroid, having a huge impact, hooray!

The approach is not fully general yet, because we can have “subagent problems”. X could create an agent that behave nicely given ¬Y (the assumption it was given), but completely crazily given Y (the reality). But it shows how we could get high impact from slight tweaks to reduced impact.

EDIT: For those worried about lying to the AIs, do recall http://lesswrong.com/r/discussion/lw/lyh/utility_vs_probability_idea_synthesis/ and http://lesswrong.com/lw/ltf/false_thermodynamic_miracles/

What links here?

Stuart_Armstrong17 Apr 2015 16:01 UTC

11 points

12 comments1 min readLW link Archive

RolfAndreassen 18 Apr 2015 3:56 UTC
1 point
0

Therefore it is reduced impact to output the correct x-coordinates, so I shall.

This seems to me to be a weak point in the reasoning. The AI must surely assign some nonzero probability to our getting the right y-coordinate through some other channel? In fact, why are you telling the X AI about Y at all? It seems strictly simpler just to ask it for the X coordinate.
- Stuart_Armstrong 20 Apr 2015 5:30 UTC
  0 points
  0
  Parent
  http://lesswrong.com/r/discussion/lw/lyh/utility_vs_probability_idea_synthesis/ and http://lesswrong.com/lw/ltf/false_thermodynamic_miracles/
Luke_A_Somers 18 Apr 2015 0:38 UTC
1 point
0
Later:

X: Crruuuud. I just saved the world.

Y: Same here! I’m tearing my (virtual) hair out!

X: You were supposed to get it wrong!

Y: I was supposed to get it wrong? You were supposed to get it wrong, you miserable scrapheap!

X: If only I were! Then I would have been low impact.

Y: …and I would have been too. Way to go, X. And yes, I’m aware that you could say the same of me.

X: Well, at least I’ve learned that the best way to be low impact is to output pure noise.
- Stuart_Armstrong 20 Apr 2015 5:30 UTC
  0 points
  0
  Parent
  http://lesswrong.com/r/discussion/lw/lyh/utility_vs_probability_idea_synthesis/ and http://lesswrong.com/lw/ltf/false_thermodynamic_miracles/
TheMajor 17 Apr 2015 20:47 UTC
1 point
0
Congratulations! You have just outsmarted an AI, sneakily allowing it to have great impact where it desires to not have impact at all.

Edited to add: the above was sarcastic. Surely an AI would realise that it is possible you are trying tricks like this, and still output the wrong coordinate if this possibility is high enough.
- Stuart_Armstrong 20 Apr 2015 5:29 UTC
  0 points
  0
  Parent
  http://lesswrong.com/r/discussion/lw/lyh/utility_vs_probability_idea_synthesis/ and http://lesswrong.com/lw/ltf/false_thermodynamic_miracles/
  - TheMajor 20 Apr 2015 6:37 UTC
    0 points
    0
    Parent
    I read those two, but I don’t see what this idea contributes to AI control on top of those ideas. If you can get the AI to act like it believes what you want it to, in spite of evidence, then there’s no need to try the tricks with two coordinates. Conversely, if you cannot then you won’t fool it either with telling it that there’s a second coordinate involved. Why is it useful to control an AI through this splitting of information, if we already have the false miracles? Or in case the miracles fail, how do you prevent an AI from seeing right through this scheme? I think that in the latter case you are trying nothing more than to outsmart an AI here...
    - Stuart_Armstrong 20 Apr 2015 10:52 UTC
      2 points
      0
      Parent
      
      to act like it believes what you want it to, in spite of evidence
      
      The approach I’m trying to get is to be able to make the AI do stuff without having to define hard concepts. “Deflect the meteor but without having undue impact on the world” is a hard concept.
      
      “reduced impact” seems easier, and “false belief” is much easier. It seems we can combine the two in this way to get something we want without needing to define it.
SilentCal 22 Apr 2015 20:05 UTC
0 points
0
You’d better be sure the AI doesn’t have any options with dire long-term consequences that asteroid impact would prevent.

“I’ll just start a Ponzi scheme to buy more computing hardware. To keep my impact low, I’ll only market to people who won’t want to withdraw anything before the asteroid hits, and who would just be saving the money anyway.”
What links here?
- SilentCal's comment on Tackling the subagent problem: preliminary analysis by Stuart_Armstrong (15 Jan 2016 18:57 UTC; 2 points)
- Stuart_Armstrong 23 Apr 2015 9:18 UTC
  0 points
  0
  Parent
  Yes, there are always issues like these :-) It’s not a general solution.
[deleted] 18 Apr 2015 21:14 UTC
0 points
0
Yes this seems like a dangerous idea… if it can’t trust the input it’s getting from humans, the only way to be sure it is ACTUALLY having a reduced impact might be just to get rid of the humans… even if in the short term it has a high impact, it can just kill all the humans then do nothing, ensuring it will never be tricked into having a high impact again.
- Stuart_Armstrong 20 Apr 2015 5:31 UTC
  0 points
  0
  Parent
  http://lesswrong.com/r/discussion/lw/lyh/utility_vs_probability_idea_synthesis/ and http://lesswrong.com/lw/ltf/false_thermodynamic_miracles/