Rob Bensinger comments on AGI Ruin: A List of Lethalities

Rob Bensinger 8 Jun 2022 10:05 UTC
LW: 16 AF: 5
5
AF
For example, I claim that while AlphaGo could be said to be agent-y, it does not care about atoms. And I think that we could make it fantastically more superhuman at Go, and it would still not care about atoms. Atoms are just not in the domain of its utility function.
In particular, I don’t think it has an incentive to break out into the real world to somehow get itself more compute, so that it can think more about its next move. It’s just not modeling the real world at all. It’s not even trying to rack up a bunch of wins over time. It’s just playing the single platonic game of Go.
I would distinguish three ways in which different AI systems could be said to “not care about atoms”:
1. The system is thinking about a virtual object (e.g., a Go board in its head), and it’s incapable of entertaining hypotheses about physical systems. Indeed, we might add the assumption that it can’t entertain hypotheses like ‘this Go board I’m currently thinking about is part of a larger universe’ at all. (E.g., there isn’t some super-Go-board I and/or the board are embedded in.)
2. The system can think about atoms/physics, but it only terminally cares about digital things in a simulated environment (e.g., winning Go), and we’re carefully keeping it from ever learning that it’s inside a simulation / that there’s a larger reality it can potentially affect.
3. The system can think about atoms/physics, and it knows that our world exists, but it still only terminally cares about digital things in the simulated environment.
Case 3 is not safe, because controlling the physical world is a useful way to control the simulation you’re in. (E.g., killing all agents in base reality ensures that they’ll never shut down your simulation.)
Case 2 is potentially safe but fragile, because you’re relying on your ability to trick/outsmart an alien mind that may be much smarter than you. If you fail, this reduces to case 3.
(Also, it’s not obvious to me that you can do a pivotal act using AGI-grade reasoning about simulations. Which matters if other people are liable to destroy the world with case-3 AGIs, or just with ordinary AGIs that terminally value things about the physical world.)
Case 1 strikes me as genuinely a lot safer, but a lot less useful. I don’t expect humanity to be satisfied with those sorts of AI systems, or to coordinate to only ever build them—like, I don’t expect any coordination here. And I’m not seeing a way to leverage a system like this to save the world, given that case-2, 3, etc. systems will eventually exist too.
What links here?
- Mesa-optimization for goals defined only within a training environment is dangerous by Rubi J. Hudson (17 Aug 2022 3:56 UTC; 6 points)
- ESRogs 8 Jun 2022 20:53 UTC
  5 points
  −2
  Parent
  
  Case 3 is not safe, because controlling the physical world is a useful way to control the simulation you’re in. (E.g., killing all agents in base reality ensures that they’ll never shut down your simulation.)
  
  In my mind, this is still making the mistake of not distinguishing the true domain of the agent’s utility function from ours.
  
  Whether the simulation continues to be instantiated in some computer in our world is a fact about our world, not about the simulated world.
  
  AlphaGo doesn’t care about being unplugged in the middle of a game (unless that dynamic was part of its training data). It cares about the platonic game of go, not about the instantiated game it’s currently playing.
  
  We need to worry about leaky abstractions, as per my original comment. So we can’t always assume the agent’s domain is what we’d ideally want it to be.
  
  But I’m trying to highlight that it’s possible (and I would tentatively go further and say probable) for agents not to care about the real world.
  
  To me, assuming care about the real world (including wanting not to be unplugged) seems like a form of anthropomorphism.
  
  For any given agent-y system I think we need to analyze whether it in particular would come to care about real world events. I don’t think we can assume in general one way or the other.
  - Rob Bensinger 9 Jun 2022 0:59 UTC
    6 points
    3
    Parent
    AlphaGo doesn’t care about being unplugged in the middle of a game (unless that dynamic was part of its training data). It cares about the platonic game of go, not about the instantiated game it’s currently playing.
    What if the programmers intervene mid-game to give the other side an advantage? Does a Go AGI, as you’re thinking of it, care about that?
    I’m not following why a Go AGI (with the ability to think about the physical world, but a utility function that only cares about states of the simulation) wouldn’t want to seize more hardware, so that it can think better and thereby win more often in the simulation; or gain control of its hardware and directly edit the simulation so that it wins as many games as possible as quickly as possible.
    Why would having a utility function that only assigns utility based on X make you indifferent to non-X things that causally affect X? If I only terminally cared about things that happened a year from now, I would still try to shape the intervening time because doing so will change what happens a year from now.
    (This is maybe less clear in the case of shutdown, because it’s not clear how an agent should think about shutdown if its utility is defined states of its simulation. So I’ll set that particular case aside.)
    - David Johnston 9 Jun 2022 1:03 UTC
      2 points
      1
      Parent
      A Go AI that learns to play go via reinforcement learning might not “have a utility function that only cares about winning Go”. Using standard utility theory, you could observe its actions and try to rationalise them as if they were maximising some utility function, and the utility function you come up with probably wouldn’t be “win every game of Go you start playing” (what you actually come up with will depend, presumably, on algorithmic and training regime details). The reason why the utility function is slippery is that it’s fundamentally an adaptation executor, not a utility maxmiser.
- David Johnston 8 Jun 2022 11:13 UTC
  2 points
  0
  Parent
  3. The system can think about atoms/physics, and it knows that our world exists, but it still only terminally cares about digital things in the simulated environment.
  Case 3 is not safe, because controlling the physical world is a useful way to control the simulation you’re in. (E.g., killing all agents in base reality ensures that they’ll never shut down your simulation.)
  Not necessarily. Train something multimodally on digital games of Go and on, say, predicting the effects of modifications to its own code on its success at Go. It could be a) good at go and b) have some real understanding of “real world actions” that make it better at Go, and still not actually take any real world actions to make it better at Go, even if it had the opportunity. You could modify the training to make it likely to do so—perhaps by asking it to either make a move or to produce descendants that make better choices—but if you don’t do this then it seems entirely plausible, and even perhaps likely, that it develops an understanding of self-modification and of go playing without ever self-modifying in order to play go better. Its goal, so to speak, is “play go with the restriction of using only legal game moves”.
  Edit—forget the real world, here’s an experiment:
  Train a board game playing AI with two modes of operation: game state x move → outcome and game state → best move. Subtle difference: in the first mode of operation, the move has a “cheat button” that, when pressed, always results in a win. In the second, it can output cheat button presses, but it has no effect on winning or losing.
  Question is: does it learn to press the cheat button? I’m really not sure. Could you prevent it from learning to press the cheat button if training feedback is never allowed to depend on whether or not this button was pressed? That seems likely.