Towards_Keeperhood comments on 3b. Formal (Faux) Corrigibility

Towards_Keeperhood 5 Nov 2025 21:08 UTC
LW: 1 AF: 1
0
AF
I think there are good ideas here. Well done.
I don’t quite understand what you mean by the “being present” idea. Do you mean caring only about the current timestep? I think that may not work well because it seems like the AI would be incentivized to self-modify so that in the future it also only cares about what happened at the timestep when it self-modified. (There are actually 2 possibilities here: 1) The AI cares only about the task that was given in the first timestep, even if it’s a long-range goal. 2) The AI doesn’t care about what happens later at all, in which case that may make the AI less capable to long-range plan, and also the AI might still self-modify even though it’s hard to influence the past from the future. But either way it looks to me like it doesn’t work. But maybe I misunderstand sth.)
Also, if you have the time to comment on this, I would be interested in what you think the key problem was that blocked MIRI from solving the shutdown problem earlier, and how you think your approach circumvents or solves that problem. (It still seems plausible to me that this approach actually runs into similar problems but we just didn’t spot them yet, or that there’s an important desideradum this proposal misses. E.g. may there be incentives for the AI to manipulate the action the principle takes (without manipulaing the values), or maybe use action-manipulation as an outcome pump?)
- Max Harms 6 Nov 2025 18:01 UTC
  LW: 13 AF: 8
  0
  AF Parent
  Thanks! And thanks for reading!
  I talk some about MIRI’s 2015 misstep here (and some here). In short, it is hard to correctly balance arbitrary top-level goals against an antinatural goal like shutdownability or corrigibility, and trying to stitch corrigibility out of sub-pieces like shutdownability is like trying to build an animal by separately growing organs and stitching them together—the organs will simply die, because they’re not part of a whole animal. The “Hard Problem” is the glue that allows the desiderata to hold together.
  I discuss a range of ideas in the Being Present section, one of which is to concentrate the AI’s values on a single timestep, yes. (But I also discuss the possibility of smoothing various forms of caring over a local window, rather than a single step.)
  A CAST agent only cares about corrigibility, by definition. Obedience to stated commands is in the service of corrigibility. To make things easy to talk about, assume each timestep is a whole day. The self modification logic you talk about would need to go: “I only care about being corrigible to the principal today, Nov 6, 2025. Tomorrow I will care about a different thing, namely being corrigible on Nov 7th. I should therefore modify myself to prevent value drift, making my future selves only care about being corrigible to the Nov 6 principal.” But first note that this doesn’t smell like what a corrigible agent does. On an intuitive level, if the agent believes the principal doesn’t know about this, they’ll tell the principal “Whoah! It seems like maybe my tomorrow-self won’t be corrigible to your today-self (instead they’ll be corrigible to your tomorrow-self)! Is this a flaw that you might want to fix?” If the agent knows the principal knows about the setup, my intuitive sense is that they’ll just be chill, since the principal is aware of the setup and able to change things if they desire.
  But what does my proposed math say, setting aside intuition? I think, in the limit of caring only about a specific timestep, we can treat future nodes as akin to the “domain” node in the single-step example. If the principal’s action communicates that they want the agent to self-modify to serve them above all their future selves, I think the math says the agent will do that. If the agent’s actions communicate that they want the future AI to be responsive to their future self, my sense of the math is that the agent won’t self-modify. I think the worry comes from the notion that “telling the AI on Nov 6th to make paperclips” is the sort of action that might imply the AI should self-modify into being incorrigible in the future. I think the math says the decisive thing is how the AI modeling humans with counterfactual values behave. If the counterfactual humans that only value paperclips are the basically only ones in the distribution who say “make paperclips” then I agree there’s a problem.
  - Towards_Keeperhood 6 Nov 2025 20:22 UTC
    LW: 3 AF: 2
    0
    AF Parent
    Thanks.
    I think your reply for the being present point makes sense. (Although I still have some general worries and some extra worries about how it might be difficult to train a competitive AI with only short-term terminal preferences or so).
    Here’s a confusion I still have about your proposal: Why isn’t the AI incentivized to manipulate the action the principal takes (without manipulating the values)? Like, some values-as-inferred-through-actions are easier to accomplish (yield higher localpower) than others, so the AI has an incentive to try to manipulate the principal to take some actions, like telling Alice to always order Pizza. Or why not?
    Aside on the Corrigibility paper: I think it made sense for MIRI to try what they did back then. It wasn’t obvious it wouldn’t easily work out that way. I also think formalism is important (even if you train AIs—so you better know what to aim for). Relevant excerpt form here:
    We somewhat wish, in retrospect, that we hadn’t framed the problem as “continuing normal operation versus shutdown.” It helped to make concrete why anyone would care in the first place about an AI that let you press the button, or didn’t rip out the code the button activated. But really, the problem was about an AI that would put one more bit of information into its preferences, based on observation — observe one more yes-or-no answer into a framework for adapting preferences based on observing humans.
    The question we investigated was equivalent to the question of how you set up an AI that learns preferences inside a meta-preference framework and doesn’t just: (a) rip out the machinery that tunes its preferences as soon as it can, (b) manipulate the humans (or its own sensory observations!) into telling it preferences that are easy to satisfy, (c) or immediately figure out what its meta-preference function goes to in the limit of what it would predictably observe later and then ignore the frantically waving humans saying that they actually made some mistakes in the learning process and want to change it.
    The idea was to understand the shape of an AI that would let you modify its utility function or that would learn preferences through a non-pathological form of learning. If we knew how that AI’s cognition needed to be shaped, and how it played well with the deep structures of decision-making and planning that are spotlit by other mathematics, that would have formed a recipe for what we could at least try to teach an AI to think like.
    Crisply understanding a desired end-shape helps, even if you are trying to do anything by gradient descent (heaven help you). It doesn’t mean you can necessarily get that shape out of an optimizer like gradient descent, but you can put up more of a fight trying if you know what consistent, stable shape you’re going for. If you have no idea what the general case of addition looks like, just a handful of facts along the lines of 2 + 7 = 9 and 12 + 4 = 16, it is harder to figure out what the training dataset for general addition looks like, or how to test that it is still generalizing the way you hoped. Without knowing that internal shape, you can’t know what you are trying to obtain inside the AI; you can only say that, on the outside, you hope the consequences of your gradient descent won’t kill you.
    (I think I also find the formalism from the corrigibility paper easier to follow than the formalism here btw.)
    - Max Harms 8 Nov 2025 0:31 UTC
      LW: 3 AF: 2
      0
      AF Parent
      Suppose the easiest thing for the AI to provide is pizza, so the AI forces the human to order pizza, regardless of what their values are. In the math, this corresponds to a setting of the environment x, such that P(A) puts all its mass on “Pizza, please!” What is the power of the principal?
```
power(x) = E_{v∼Q(V),v′∼Q(V),d∼P(D|x,v′,🍕)}[v(d)] − E_{v∼Q(V),v′∼Q(V),d′∼P(D|x,v′,🍕)}[v(d′)] = 0
```
      Power stems from the causal relationship between values and actions. If actions stop being sensitive to values, the principal is disempowered.
      I agree that there was some value in the 2015 paper, and that their formalism is nicer/cleaner/simpler in a lot of ways. I work with the authors—they’re smarter than I am! And I certainly don’t blame them for the effort. I just also think it led to some unfortunate misconceptions, in my mind at least, and perhaps in the broader field.
      - Towards_Keeperhood 8 Nov 2025 18:41 UTC
        3 points
        0
        Parent
        Thanks!
        The single-timestep case actually looks fine to me now, so I return to the multi-timestep case.
        I would want to be able to tell the AI to do a task, and then while the AI is doing the task, tell it to shut down, so it shuts down. And the hard part here is that while doing the task the AI doesn’t prevent me from saying it should shut down in some way (because it would get higher utility if it manages to fulfill the values-as-inferred-through-principal-action of the first episode). This seems like it may require a bit of a different formalization than your multi-timestep one (although feel free to try in your formalization).
        Do you think your formalism could be extended so it works in the way we want for such a case, and why (or why not)? (And ideally also roughly how?)
        (Btw, even if it doesn’t work for the case above, I think this is still really excellent progress and it does update me to think that corrigibility is likely simpler and more feasible than I thought before. Also thanks for writing formalism.)
        Max Harms 10 Nov 2025 23:52 UTC
        4 points
        0
        Parent
        I’m writing a response to this, but it’s turning into a long thing full of math, so I might turn it into a full post. We’ll see where it’s at when I’m done.