Koen.Holtman comments on Ngo and Yudkowsky on alignment difficulty

Koen.Holtman 19 Nov 2021 20:38 UTC
LW: 7 AF: 2
0
AF
Can you boil down your attempted solution to a statement of which coherence constraint you think it is possible to violate, and what shutdownable behavior you think can result from that, how,

Glad you asked.

in the least fancy actually-precise terms of which your writing talent permits?

If you want actual full precision, I have to refer you to the math in my papers. Since 2019 I have been working on and off to make this math more general and accessible, and to find better ways to explain this math in actually-precise natural language statements. So here is my best current attempt.

TL;DR: The coherence constraint/theorem/property that I want to violate is the property that the AGI is using a world model that accurately depicts the internals of its own compute core. I want to make the agent use an inaccurate model of these internals, one that omits the safety shut-down mechanisms I have built inside of its core. By doing this, I can reduce the emergent incentive of the AGI agent to disable its own emergency shut-down mechanisms, and the emergent incentive to stop people from activating them.

I will now expand on this and add more details, using the the example of an emergency stop button.

Say that the compute core has an emergency stop button attached to it. Say that actual software running inside the compute core will, when receiving a stop signal from the button, cause the agent to stop. When the signal is received, the software will always select and perform $n u l l$ actions in every future time step. Let’s say that $M_{c}$ is a world model that accurately depicts this situation. I am not going to build an AGI that uses $M_{c}$ to plan its actions.

Instead I build an AGI agent that will plan its next actions by using an incorrect world model $M_{i}$ . This $M_{i}$ is different from $M_{c}$ , but only in how it depicts the internals of the agent compute core. In the incorrect/imaginary world depicted by $M_{i}$ , the compute core has different software in it, software that will ignore the stop button signal, and just keep on picking actions that maximize utility. I further construct my AGI so that, in every time step, it calculates which next action $a$ would maximize utility in this incorrect, imaginary world $M_{i}$ . I then further construct it to take this same action $a$ in the real world.

An AGI that lives in a world that matches the correct model $M_{c}$ , while using the incorrect model $M_{i}$ to decide what actions to take, will (sometimes) act incoherently. But this is a very specific and targeted form of incoherence, far less severe and far less dangerous than having the agent believe that $2 + 2 = 5$ . Still, this incoherency does have some unwanted side effects. One example side effect, also considered in the 2015 MIRI corrigibility paper, is that the AGI will be indifferent about the fate of its emergency stop button. In my papers, I explore a much broader range of side effects, and look at how to prevent/suppress/fix them.

So the first step in the above corrigibility design is that you have to imagine constructing an AGI that will use a specifically incorrect world model $M_{i}$ . The first step is that you have to realize that you actually want this, that you want to decrease coherence in order to increase corrigibility.

The next step is to look for options to actually construct an AGI that uses this incorrect $M_{i}$ .
1. One option is to train the AGI in a simulated environment that faithfully represents $M_{i}$ . We then extract the learned world model, and put it in an AI that we unleash in the real world. We disable further learning in this AI, to prevent $M_{i}$ from being updated towards $M_{c}$ . Though this option depicts exactly how many current real-life AI systems are constructed and get their corrigibility, this option is not very satisfactory in a theoretical sense. It may not be a viable path to build certain otherwise desirable types of AI/AGI/superintelligence.
2. The more interesting option is to build an AGI that will construct $M_{i}$ based on an online training regime in the real world, in a real world that exactly matches the correct world model $M_{c}$ .
Is the second option technically feasible? The answer is generally yes.

A very general solution, leveraging causal models, is in this sequence. To give you some idea of the more detailed math involved: the (ls) pictured below is the real/correct world model $M_{c}$ , and (p) is the imaginary/incorrect world model $M_{i}$ . In the agent defined using these pictures, certain indifference properties are present trivially, by construction.

The solution in the pictures above uses ML to learn a model $L$ that correctly models only a part of the real world, the part of the real world outside of the agent compute core. We then slot $L$ into a causal world model, one that has other parts which mis-represent the real compute core internals. This causal world model will then be used as the incorrect $M_{i}$ . Note that this learned $L$ may be a fully opaque deep neural net, or whatever black-box thing the next generation of ML might produce. We do not have to understand or edit any details inside $L$ for this slot-in operation to work.

(I should mention that the paper has a lot of details not mentioned in the sequence, or visible in the pictures above. In particular, section 10.2 may be of interest.)

I want to stress that this causal model option is only one possible route to creating incorrect world models $M_{i}$ via machine learning in the real world. Papers like Safely interruptible agents and How RL Agents Behave When Their Actions Are Modified show that the idea of removing certain events from the training record can also work: whether this works as intended depends on having the right built-in priors, priors which control inductive generalization.

So overall, I have a degree of optimism about AGI corrigibility.

That being said, if you want to map out and estimate probabilities for our possible routes to doom, then you definitely need to include the scenario where a future superior-to-everything-else type of ML is invented, where this superior future type of ML just happens to be incompatible with any of the corrigibility techniques known at that time. Based on the above work, I put a fairly low probability on that scenario.
- TurnTrout 20 Nov 2021 0:37 UTC
  LW: 15 AF: 11
  0
  AF Parent
  Apparently no one has actually shown that corrigibility can be VNM-incoherent in any precise sense (and not in the hand-wavy sense which is good for intuition-pumping). I went ahead and sketched out a simple proof of how a reasonable kind of corrigibility gives rise to formal VNM incoherence.
  I’m interested in hearing about how your approach handles this environment, because I think I’m getting lost in informal assumptions and symbol-grounding issues when reading about your proposed method.
  - Koen.Holtman 21 Nov 2021 14:51 UTC
    LW: 10 AF: 6
    0
    AF Parent
    Read your post, here are my initial impressions on how it relates to the discussion here.
    
    In your post, you aim to develop a crisp mathematical definition of (in)coherence, i.e. VNM-incoherence. I like that, looks like a good way to move forward. Definitely, developing the math further has been my own approach to de-confusing certain intuitive notions about what should be possible or not with corrigibility.
    
    However, my first impression is that your concept of VNM-incoherence is only weakly related to the meaning that Eliezer has in mind when he uses the term incoherence. In my view, the four axioms of VNM-rationality have only a very weak descriptive and constraining power when it comes to defining rational behavior. I believe that Eliezer’s notion of rationality, and therefore his notion of coherence above, goes far beyond that implied by the axioms of VNM-rationality. My feeling is that Eliezer is using the term ‘coherence constraints’ an intuition-pump way where coherence implies, or almost always implies, that a coherent agent will develop the incentive to self-preserve.
    
    Looking at your post, I am also having trouble telling exactly how you are defining VNM-incoherence. You seem to be toying with several alternative definitions, one where it applies to reward functions (or preferences over lotteries) which are only allowed to examine the final state in a 10-step trajectory, another where the reward function can examine the entire trajectory and maybe the actions taken to produce that trajectory. I think that your proof only works in the first case, but fails in the second case. This has certain (fairly trivial) corollaries about building corrigibility. I’ll expand on this in a comment I plan to attach to your post.
    
    I’m interested in hearing about how your approach handles this environment,
    
    I think one way to connect your ABC toy environment to my approach is to look at sections 3 and 4 of my earlier paper where I develop a somewhat similar clarifying toy environment, with running code.
    
    Another comment I can make is that your ABC nodes-and-arrows state transition diagram is a depiction which makes it hard see how to apply my approach, because the depiction mashes up the state of the world outside of the compute core and the state of the world inside the compute core. If you want to apply counterfactual planning, or if you want to have a an agent design that can compute the balancing function terms according to Armstrong’s indifference approach, you need a different depiction of your setup. You need one which separates out these two state components more explicitely. For example, make an MDP model where the individual states are instances of the tuple (physical position of agent in the ABC playing field,policy function loaded into the compute core).
    
    Not sure how to interpret your statement that you got lost in symbol-grounding issues. If you can expand on this, I might be able to help.
    What links here?
    Koen.Holtman's comment on A Certain Formalization of Corrigibility Is VNM-Incoherent by TurnTrout (21 Nov 2021 18:14 UTC; 1 point)
  - Koen.Holtman 24 Nov 2021 10:32 UTC
    2 points
    0
    AF Parent
    Update: I just recalled that Eliezer and MIRI often talk about Dutch booking when they talk about coherence. So not being susceptible to Dutch booking may be the type of coherence Eliezer has in mind here.
    
    When it comes to Dutch booking as a coherence criterion, I need to repeat again the observation I made below:
    
    In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don’t see that fact mentioned often on this forum, so I will expand.
    
    An agent that plans coherently given a reward function $R_{p}$ to maximize paperclips will be an incoherent planner if you judge its actions by a reward function $R_{s}$ that values the maximization of staples instead.
    
    To extend this to Dutch booking: if you train a superintelligent poker playing agent with a reward function that rewards it for losing at poker, you will find that if can be Dutch booked rather easily, if your Dutch booking test is whether you can find a counter-strategy to make it loose money.
- Andrew McKnight 24 Nov 2021 22:11 UTC
  2 points
  0
  AF Parent
  I haven’t read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. If it can’t learn at this point then I find it hard to believe it’s generally capable, and if it can, it will have incentive to simply remove the device or create a copy of itself that is correct about its own world model. Do you address this in the articles?
  
  On the other hand, this made me curious about what we could do with an advanced model that is instructed to not learn and also whether we can even define and ensure a model stops learning.
  - Koen.Holtman 25 Nov 2021 19:07 UTC
    LW: 2 AF: 2
    0
    AF Parent
    
    I haven’t read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. [...] Do you address this in the articles?
    
    Yes I address this, see for example the part about The possibility of learned self-knowledge in the sequence. I show there that any RL agent, even a non-AGI, will always have the latent ability to ‘look at itself’ and create a machine-learned model of its compute core internals.
    
    What is done with this latent ability is up to the designer. The key thing here is that you have a choice as a designer, you can decide if you want to design an agent which indeed uses this latent ability to ‘look at itself’.
    
    Once you decide that you don’t want to use this latent ability, certain safety/corrigibility problems become a lot more tractable.
    
    Wikipedia has the following definition of AGI:
    
    Artificial general intelligence (AGI) is the hypothetical ability of an intelligent agent to understand or learn any intellectual task that a human being can.
    
    Though there is plenty of discussion on this forum which silently assumes otherwise, there is no law of nature which says that, when I build a useful AGI-level AI, I must necessarily create the entire package of all human cognitive abilities inside of it.
    
    this made me curious about what we could do with an advanced model that is instructed to not learn and also whether we can even define and ensure a model stops learning.
    
    Terminology note if you want to look into this some more: ML typically does not frame this goal as ‘instructing the model not to learn about Q’. ML would frame this as ‘building the model to approximate the specific relation $P (X | Y, Z)$ between some well-defined observables, and this relation is definitely not Q’.