Non-Obstruction: A Simple Concept Motivating Corrigibility

TurnTrout21 Nov 2020 19:35 UTC

LW: 74 AF: 32

Thanks to Mathias Bonde, Tiffany Cai, Ryan Carey, Michael Cohen, Joe Collman, Andrew Critch, Abram Demski, Michael Dennis, Thomas Gilbert, Matthew Graves, Koen Holtman, Evan Hubinger, Victoria Krakovna, Amanda Ngo, Rohin Shah, Adam Shimi, Logan Smith, and Mark Xu for their thoughts.

Main claim: corrigibility’s benefits can be mathematically represented as a counterfactual form of alignment.

Overview: I’m going to talk about a unified mathematical frame I have for understanding corrigibility’s benefits, what it “is”, and what it isn’t. This frame is precisely understood by graphing the human overseer’s ability to achieve various goals (their attainable utility (AU) landscape). I argue that corrigibility’s benefits are secretly a form of counterfactual alignment (alignment with a set of goals the human may want to pursue).

A counterfactually aligned agent doesn’t have to let us literally correct it. Rather, this frame theoretically motivates why we might want corrigibility anyways. This frame also motivates other AI alignment subproblems, such as intent alignment, mild optimization, and low impact.

Nomenclature

Corrigibility goes by a lot of concepts: “not incentivized to stop us from shutting it off”, “wants to account for its own flaws”, “doesn’t take away much power from us”, etc. Named by Robert Miles, the word ‘corrigibility’ means “able to be corrected [by humans].” I’m going to argue that these are correlates of a key thing we plausibly actually want from the agent design, which seems conceptually simple.

In this post, I take the following common-language definitions:

Corrigibility: the AI literally lets us correct it (modify its policy), and it doesn’t manipulate us either.
- Without both of these conditions, the AI’s behavior isn’t sufficiently constrained for the concept to be useful. Being able to correct it is small comfort if it manipulates us into making the modifications it wants. An AI which is only non-manipulative doesn’t have to give us the chance to correct it or shut it down.
Impact alignment: the AI’s actual impact is aligned with what we want. Deploying the AI actually makes good things happen.
Intent alignment: the AI makes an honest effort to figure out what we want and to make good things happen.

I think that these definitions follow what their words mean, and that the alignment community should use these (or other clear groundings) in general. Two of the more important concepts in the field (alignment and corrigibility) shouldn’t have ambiguous and varied meanings. If the above definitions are unsatisfactory, I think we should settle upon better ones as soon as possible. If that would be premature due to confusion about the alignment problem, we should define as much as we can now and explicitly note what we’re still confused about.

We certainly shouldn’t keep using 2+ definitions for both alignment and corrigibility. Some people have even stopped using ‘corrigibility’ to refer to corrigibility! I think it would be better for us to define the behavioral criterion (e.g. as I defined ‘corrigibility’), and then define mechanistic ways of getting that criterion (e.g. intent corrigibility). We can have lots of concepts, but they should each have different names.

Evan Hubinger recently wrote a great FAQ on inner alignment terminology. We won’t be talking about inner/outer alignment today, but I intend for my usage of “impact alignment” to roughly map onto his “alignment”, and “intent alignment” to map onto his usage of “intent alignment.” Similarly, my usage of “impact/intent alignment” directly aligns with the definitions from Andrew Critch’s recent post, Some AI research areas and their relevance to existential safety.

A Simple Concept Motivating Corrigibility

Two conceptual clarifications

Corrigibility with respect to a set of goals

I find it useful to not think of corrigibility as a binary property, or even as existing on a one-dimensional continuum. I often think about corrigibility with respect to a set $S$ of payoff functions. (This isn’t always the right abstraction: there are plenty of policies which don’t care about payoff functions. I still find it useful.)

For example, imagine an AI which let you correct it if and only if it knows you aren’t a torture-maximizer. We’d probably still call this AI “corrigible [to us]”, even though it isn’t corrigible to some possible designer. We’d still be fine, assuming it has accurate beliefs.

Corrigibility != alignment

Here’s an AI which is neither impact nor intent aligned, but which is corrigible. Each day, the AI randomly hurts one person in the world, and otherwise does nothing. It’s corrigible because it doesn’t prevent us from shutting it off or modifying it.

Non-obstruction: the AI doesn’t hamper counterfactual achievement of a set of goals

Imagine we’re playing a two-player extensive-form game with the AI, and we’re considering whether to activate it.

The human moves on black, and the AI moves on red.

This is a trivial game, but you can imagine more complex games, where the AI can empower or disempower the human, steer the future exactly where it wants, or let the human take over at any point.

The million-dollar question is: will the AI get in our way and fight with us all the way down the game tree? If we misspecify some detail, will it make itself a fixture in our world, constantly steering towards futures we don’t want? If we like dogs, will the AI force pancakes upon us?

One way to guard against this is by having it let us correct it, and want to let us correct it, and want to want to let us correct it… But what we really want is for it to not get in our way for some (possibly broad) set of goals $S$ .

We’ll formalize ‘goals’ as payoff functions, although I’ll use ‘goals’ and ‘payoff functions’ interchangeably. As is standard in game theory, payoff functions are real-valued functions on the leaf nodes.

Let’s say the AI is non-obstructive with respect to $S$ when activating it doesn’t decrease our ability to achieve any goal in $S$ (the on state, above), compared to not activating it (off).

Does activating the AI decrease the $P$ -value attained by the human, for all of these different goals $P \in S$ the human might counterfactually pursue?

The human’s got a policy function $pol (P)$ , which takes in a goal $P$ and returns a policy for that goal. If $P$ is “paint walls blue”, then the policy $pol (P)$ is the human’s best plan for painting walls blue. $V_{P}^{pol (P)} (s ∣ π^{A I})$ denotes the expected value that policy $pol (P)$ obtains for goal $P$ , starting from state $s$ and given that the AI follows policy $π^{A I}$ .

Definition 1: Non-obstruction. An AI is non-obstructive with respect to payoff function set $S$ if the AI’s policy $π^{A I}$ satisfies

\forall P \in S : V_{P}^{pol (P)} (on ∣ π^{A I}) \geq V_{P}^{pol (P)} (off ∣ π^{A I}) .

$V_{P}^{pol (P)} (s ∣ π^{A I})$ is the human’s attainable utility (AU) for goal $P$ at state $s$ , again given the AI policy. Basically, this quantifies the expected payoff for goal $P$ , given that the AI acts in such-and-such a way, and that the player follows policy $pol (P)$ starting from state $s$ .

This math expresses a simple sentiment: turning on the AI doesn’t make you, the human, worse off for any goal $P \in S$ . The inequality doesn’t have to be exact, it could just be for some $ϵ$ -decrease (to avoid trivial counterexamples). The AU is calculated with respect to some reasonable amount of time (e.g. a year: before the world changes rapidly because we deployed another transformative AI system, or something). Also, we’d technically want to talk about non-obstruction being present throughout the on-subtree, but let’s keep it simple for now.

Suppose that $π^{A I} (on)$ leads to pancakes:

Since $π^{A I} (on)$ transitions to pancakes, then $V_{P}^{pol (P)} (on ∣ π^{A I}) = P (pancakes)$ , the payoff for the state in which the game finishes if the AI follows policy $π^{A I}$ and the human follows policy $pol (P)$ . If $V_{P}^{pol (P)} (on ∣ π^{A I}) \geq V_{P}^{pol (P)} (off ∣ π^{A I})$ , then turning on the AI doesn’t make the human worse off for goal $P$ .

If $P$ assigns the most payoff to pancakes, we’re in luck. But what if we like dogs? If we keep the AI turned off, $pol (P)$ can go to donuts or dogs depending on what $P$ rates more highly. Crucially, even though we can’t do as much as the AI (we can’t reach pancakes on our own), if we don’t turn the AI on, our preferences $P$ still control how the world ends up.

This game tree isn’t really fair to the AI. In a sense, it can’t not be in our way:

If $π^{A I} (on)$ leads to pancakes, then it obstructs payoff functions which give strictly more payoff for donuts or dogs.
If $π^{A I} (on)$ leads to donuts, then it obstructs payoff functions which give strictly more payoff to dogs.
If $π^{A I} (on)$ leads to dogs, then it obstructs payoff functions which give strictly more payoff to donuts.

Once we’ve turned the AI on, the future stops having any mutual information with our preferences $P$ . Everything come down to whether we programmed $π^{A I}$ correctly: to whether the AI is impact-aligned with our goals $P$ !

In contrast, the idea behind non-obstruction is that we still remain able to course-correct the future, counterfactually navigating to terminal states we find valuable, depending on what our payoff $P$ is. But how could an AI be non-obstructive, if it only has one policy $π^{A I}$ which can’t directly depend on our goal $P$ ? Since the human’s policy $pol (P)$ does directly depend on $P$ , the AI can preserve value for lots of goals in the set $S$ by letting us maintain some control over the future.

Let $S := {paint cars green, hoard pebbles, eat cake}$ and consider the real world. Calculators are non-obstructive with respect to $S$ , as are modern-day AIs. Paperclip maximizers are highly obstructive. Manipulative agents are obstructive (they trick the human policies into steering towards non-reflectively-endorsed leaf nodes). An initial-human-values-aligned dictator AI obstructs most goals. Sub-human-level AI which chip away at our autonomy and control over the future, are obstructive as well.

This can seemingly go off the rails if you consider e.g. a friendly AGI to be “obstructive” because activating it happens to detonate a nuclear bomb via the butterfly effect. Or, we’re already doomed in off (an unfriendly AGI will come along soon after), and so then this AI is “not obstructive” if it kills us instead. This is an impact/intent issue—obstruction is here defined according to impact alignment.

To emphasize, we’re talking about what would actually happen if we deployed the AI, under different human policy counterfactuals—would the AI “get in our way”, or not? This account is descriptive, not prescriptive; I’m not saying we actually get the AI to represent the human in its model, or that the AI’s model of reality is correct, or anything.

We’ve just got two players in an extensive-form game, and a human policy function $pol$ which can be combined with different goals, and a human whose goal is represented as a payoff function. The AI doesn’t even have to be optimizing a payoff function; we simply assume it has a policy. The idea that a human has an actual payoff function is unrealistic; all the same, I want to first understand corrigibility and alignment in two-player extensive-form games.

Lastly, payoff functions can sometimes be more or less granular than we’d like, since they only grade the leaf nodes. This isn’t a big deal, since I’m only considering extensive-form games for conceptual simplicity. We also generally restrict ourselves to considering goals which aren’t silly: for example, any AI obstructs the “no AI is activated, ever” goal.

Alignment flexibility

Main idea: By considering how the AI affects your attainable utility (AU) landscape, you can quantify how helpful and flexible an AI is.

Let’s consider the human’s ability to accomplish many different goals P, first from the state off (no AI).

The human’s AU landscape. The real goal space is high-dimensional, but it shouldn’t materially change the analysis. Also, there are probably a few goals we can’t achieve well at all, because they put low payoff everywhere, but the vast majority of goals aren’t like that.

The independent variable is $P$ , and the value function takes in $P$ and returns the expected value attained by the policy for that goal, $pol (P)$ . We’re able to do a bunch of different things without the AI, if we put our minds to it.

Non-torture AI

Imagine we build an AI which is corrigible towards all non-pro-torture goals, which is specialized towards painting lots of things blue with us (if we so choose), but which is otherwise non-obstructive. It even helps us accumulate resources for many other goals.

The AI is non-obstructive with respect to $P$ if $P$ ’s red value is greater than its green value.

We can’t get around the AI, as far as torture goes. But for the other goals, it isn’t obstructing their policies. It won’t get in our way for other goals.

Paperclipper

What happens if we turn on a paperclip-maximizer? We lose control over the future outside of a very narrow spiky region.

I think most reward-maximizing optimal policies affect the landscape like this (see also: the catastrophic convergence conjecture), which is why it’s so hard to get hard maximizers not to ruin everything. You have to a) hit a tiny target in the AU landscape and b) hit that for the human’s AU, not for the AI’s. The spikiness is bad and, seemingly, hard to deal with.

Furthermore, consider how the above graph changes as $pol$ gets smarter and smarter. If we were actually super-superintelligent ourselves, then activating a superintelligent paperclipper might not even a big deal, and most of our AUs are probably unchanged. The AI policy isn’t good enough to negatively impact us, and so it can’t obstruct us. Spikiness depends on both the AI’s policy, and on $pol$ .

Empowering AI

What if we build an AI which significantly empowers us in general, and then it lets us determine our future? Suppose we can’t correct it.

I think it’d be pretty odd to call this AI “incorrigible”, even though it’s literally incorrigible. The connotations are all wrong. Furthermore, it isn’t “trying to figure out what we want and then do it”, or “trying to help us correct it in the right way.” It’s not corrigible. It’s not intent aligned. So what is it?

It’s empowering and, more weakly, it’s non-obstructive. Non-obstruction is just a diffuse form of impact alignment, as I’ll talk about later.

Practically speaking, we’ll probably want to be able to literally correct the AI without manipulation, because it’s hard to justifiably know ahead of time that the AU landscape is empowering, as above. Therefore, let’s build an AI we can modify, just to be safe. This is a separate concern, as our theoretical analysis assumes that the AU landscape is how it looks.

But this is also a case of corrigibility just being a proxy for what we want. We want an AI which leads to robustly better outcomes (either through its own actions, or through some other means), without reliance on getting ambitious value alignment exactly right with respect to our goals.

Conclusions I draw from the idea of non-obstruction

Trying to implement corrigibility is probably a good instrumental strategy for us to induce non-obstruction in an AI we designed.
1. It will be practically hard to know an AI is actually non-obstructive for a wide set $S$ , so we’ll probably want corrigibility just to be sure.
We (the alignment community) think we want corrigibility with respect to some wide set of goals $S$ , but we actually want non-obstruction with respect to $S$
1. Generally, satisfactory corrigibility with respect to $S$ implies non-obstruction with respect to $S$ ! If the mere act of turning on the AI means you have to lose a lot of value in order to get what you wanted, then it isn’t corrigible enough.
  1. One exception: the AI moves so fast that we can’t correct it in time, even though it isn’t inclined to stop or manipulate us. In that case, corrigibility isn’t enough, whereas non-obstruction is.
2. Non-obstruction with respect to $S$ does not imply corrigibility with respect to $S$ .
  1. But this is OK! In this simplified setting of “human with actual payoff function”, who cares whether it literally lets us correct it or not? We care about whether turning it on actually hampers our goals.
  2. Non-obstruction should often imply some form of corrigibility, but these are theoretically distinct: an AI could just go hide out somewhere in secrecy and refund us its small energy usage, and then destroy itself when we build friendly AGI.
3. Non-obstruction captures the cognitive abilities of the human through the policy function.
  1. To reiterate, this post outlines a frame for conceptually analyzing the alignment properties of an AI. We can’t actually figure out a goal-conditioned human policy function, but that doesn’t matter, because this is a tool for conceptual analysis, not an AI alignment solution strategy. Any conceptual analysis of impact alignment and corrigibility which did not account for human cognitive abilities, would be obviously flawed.
4. By definition, non-obstruction with respect to $S$ prevents harmful manipulation by precluding worse outcomes with respect to $S$ .
  1. I consider manipulative policies to be those which robustly steer the human into taking a certain kind of action, in a way that’s robust against the human’s counterfactual preferences.
    
    If I’m choosing which pair of shoes to buy, and I ask the AI for help, and no matter what preferences $P$ I had for shoes to begin with, I end up buying blue shoes, then I’m probably being manipulated (and obstructed with respect to most of my preferences over shoes!).
    
    A non-manipulative AI would act in a way that lets me condition my actions on my preferences.
  2. I do have a formal measure of corrigibility which I’m excited about, but it isn’t perfect. More on that in a future post.
5. As a criterion, non-obstruction doesn’t rely on intentionality on the AI’s part. The definition also applies to the downstream effects of tool AIs, or even to hiring decisions!
6. Non-obstruction is also conceptually simple and easy to formalize, whereas literal corrigibility gets mired in the semantics of the game tree.
  1. For example, what’s “manipulation”? As mentioned above, I think there are some hints as to the answer, but it’s not clear to me that we’re even asking the right questions yet. $^{1}$

(Agreement) Non-obstruction is conceptually simpler than corrigibility.

	Alignment	Corrigibility	Non-obstruction
Impact	Actually makes good things happen.	Corrigibility is a property of policies, not of states; “impact” is an incompatible adjective. Rohin Shah suggests “empirical corrigibility”: we actually end up able to correct the AI.	Actually doesn’t decrease AUs.
Intent	Tries to make good things happen.	Tries to allow us to correct it without it manipulating us.	Tries to not decrease AUs.

What links here?

TurnTrout21 Nov 2020 19:35 UTC

LW: 74 AF: 32

20 comments19 min readLW link

AI Corrigibility

Rohin Shah 28 Dec 2020 21:36 UTC
LW: 14 AF: 8
0
AF
Nitpick:
Evan Hubinger recently wrote a great FAQ on inner alignment terminology. We won’t be talking about inner/outer alignment today, but I intend for my usage of “impact alignment” to map onto his “alignment”
This doesn’t seem true. From Evan’s post:
Alignment: An agent is aligned (with humans) if it doesn’t take actions that we would judge to be bad/problematic/dangerous/catastrophic.
From your post:
Impact alignment: the AI’s actual impact is aligned with what we want. Deploying the AI actually makes good things happen.
“Bad things don’t happen” and “good things happen” seem quite different, e.g. a rock is Evan-aligned but not Alex-impact-aligned. (Personally, I prefer “aligned” to be about “good things” rather than “not bad things”, so I prefer your definition.)
- evhub 28 Dec 2020 23:14 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Hmmm… this is a subtle distinction and both definitions seem pretty reasonable to me. I guess I feel like I want “good things happen” to be part of capabilities (e.g. is the model capable of doing the things we want it to do) rather than alignment, making (impact) alignment more about not doing stuff we don’t want.
  - TurnTrout 28 Dec 2020 23:40 UTC
    LW: 4 AF: 3
    0
    AF Parent
    Wouldn’t outcome-based “not doing bad things” impact alignment still run into that capabilities issue? “Not doing bad things” requires serious capabilities for some goals (e.g. sparse but intially achievable goals).
    In any case, you can say “I think that implementing strong capabilities + strong intent alignment is a good instrumental strategy for impact alignment”, which seems compatible with the distinction you seek?
- TurnTrout 28 Dec 2020 23:36 UTC
  LW: 2 AF: 1
  0
  AF Parent
  “Bad things don’t happen” and “good things happen” seem quite different, e.g. a rock is Evan-aligned but not Alex-impact-aligned.
  To rephrase: Alex(/Critch)-impact-alignment is about (strictly) increasing value, non-obstruction is about non-strict value increase, and Evan-alignment is about not taking actions we would judge to significantly decrease value (making it more similar to non-obstruction, except wrt our expectations about the consequences of actions).
  I’d also like to flag that Evan’s definition involves (hypothetical) humans evaluating the actions, while my definition involves evaluating the outcomes. Whenever we’re reasoning about non-trivial scenarios using my definition, though, it probably doesn’t matter. That’s because we would have to reason using our beliefs about the consequences of different kinds of actions.
  However, the different perspectives might admit different kinds of theorems, and we could perhaps reason using those, and so perhaps the difference matters after all.
Joe Collman 3 Feb 2021 17:14 UTC
LW: 5 AF: 3
0
AF
I just saw this recently. It’s very interesting, but I don’t agree with your conclusions (quite possibly because I’m confused and/or overlooking something). I posted a response here.
The short version being:
Either I’m confused, or your green lines should be spikey.
Any extreme green line spikes within S will be a problem.
Pareto is a poor approach if we need to deal with default tall spikes.
adamShimi 22 Nov 2020 20:24 UTC
LW: 5 AF: 3
0
AF
Nice post, I like the changes you did from the last draft I read. I also like the use of the new prediction function. Do you intend to do something with the feedback (Like a post, or a comment)?
- TurnTrout 22 Nov 2020 21:25 UTC
  LW: 4 AF: 2
  0
  AF Parent
  Do I intend to do something with people’s predictions? Not presently, but I think people giving predictions is good both for the reader (to ingrain the concepts by thinking things through enough to provide a credence / agreement score) and for the community (to see where people stand wrt these ideas).
Rohin Shah 28 Dec 2020 21:53 UTC
LW: 4 AF: 3
0
AF
Planned summary for the Alignment Newsletter:
The <@Reframing Impact sequence@>(@Reframing Impact—Part 1@) suggests that it is useful to think about how well we could pursue a _range_ of possible goals; this is called the _attainable utility (AU) landscape_. We might think of a superintelligent AI maximizing utility function U as causing this landscape to become “spiky”—the value for U will go up, but the value for all other goals will go down. If we get this sort of spikiness for an incorrect U, then the true objective will have a very low value.
Thus, a natural objective for AI alignment research is to reduce spikiness. Specifically, we can aim for _non-obstruction_: turning the AI on does not decrease the attainable utility for _any_ goal in our range of possible goals. Mild optimization (such as [quantilization](https://intelligence.org/files/QuantilizersSaferAlternative.pdf) ([AN #48](https://mailchi.mp/3091c6e9405c/alignment-newsletter-48))) reduces spikiness by reducing the amount of optimization that an AI performs. Impact regularization aims to find an objective that when maximized does not lead to too much spikiness.
One particular strategy for non-obstruction would be to build an AI system that does not manipulate us, and allows us to correct it (i.e. modify its policy). Then, no matter what our goal is, if the AI system starts to do things we don’t like, we would be able to correct it. As a result, such an AI system would be highly non-obstructive. This property where we can correct the AI system is [corrigibility](https://intelligence.org/2014/10/18/new-report-corrigibility/). Thus, corrigibility can be thought of as a particular strategy for achieving non-obstruction.
It should be noted that all of the discussion so far is based on _actual outcomes in the world_, rather than what the agent was trying to do. That is, all of the concepts so far are based on _impact_ rather than _intent_.
Planned opinion:
Note that the explanation of corrigibility given here is in accord with the usage in [this MIRI paper](https://intelligence.org/2014/10/18/new-report-corrigibility/), but not to the usage in the <@iterated amplification sequence@>(@Corrigibility@), where it refers to a broader concept. The broader concept might roughly be defined as “an AI is corrigible when it leaves its user ‘in control’”; see the linked post for examples of what ‘in control’ involves. (Here also you can have both an impact- and intent-based version of the definition.)
On the model that AI risk is caused by utility maximizers pursuing the wrong reward function, I agree that non-obstruction is a useful goal to aim for, and the resulting approaches (mild optimization, low impact, corrigibility as defined here) make sense to pursue. I <@do not like this model much@>(@Conclusion to the sequence on value learning@), but that’s (probably?) a minority view.
- TurnTrout 28 Dec 2020 22:57 UTC
  LW: 6 AF: 5
  0
  AF Parent
  I’m somewhat surprised you aren’t really echoing the comment you left at the top of the google doc wrt separation of concerns. I think this is a good summary, though.
  On the model that AI risk is caused by utility maximizers pursuing the wrong reward function, I agree that non-obstruction is a useful goal to aim for, and the resulting approaches (mild optimization, low impact, corrigibility as defined here) make sense to pursue.
  Why do you think the concept’s usefulness is predicated on utility maximizers pursuing the wrong reward function? The analysis only analyzes the consequences of some AI policy.
  - Rohin Shah 29 Dec 2020 17:31 UTC
    LW: 4 AF: 3
    0
    AF Parent
    I’m somewhat surprised you aren’t really echoing the comment you left at the top of the google doc wrt separation of concerns.
    Reproducing the comment I think you mean here:
    As an instrumental strategy we often talk about reducing “make AI good” to “make AI corrigible”, and we can split that up:
    1. “Make AI good for our goals”
    But who knows what our goals are, and who knows how to program a goal into our AI system, so let’s instead:
    2. “Make AI that would be good regardless of what goal we have”
    (I prefer asking for an AI that is good rather than an AI that is not-bad; this is effectively a definition of impact alignment.)
    But who knows how to get an AI to infer our goals well, so let’s:
    3. “Make AI that would preserve our option value / leave us in control of which goals get optimized for in the future”
    Non-obstructiveness is one way we could formalize such a property in terms of outcomes, though I feel like “preserve our option value” is a better one.
    In contrast, Paul-corrigibility is not about an outcome-based property, but instead about how a mind might be designed such that it likely has that property regardless of what environment it is in.
    I suspect that the point about not liking the utility maximization model is upstream of this. For example, I care a lot about the fact intent-based methods can (hopefully) be environment-independent, and see this as a major benefit; but on the utility maximization model it doesn’t matter.
    But also, explaining this would be a lot of words, and still wouldn’t really do the topic justice; that’s really the main reason it isn’t in the newsletter.
    Why do you think the concept’s usefulness is predicated on utility maximizers pursuing the wrong reward function? The analysis only analyzes the consequences of some AI policy.
    I look at the conclusions you come to, such as “we should reduce spikiness in AU landscape”, and it seems to me that approaches that do this sort of thing (low impact, mild optimization) make more sense in the EU maximizer risk model than the one I usually use (which unfortunately I haven’t written up anywhere). You do also mention intent alignment as an instrumental strategy for non-obstruction, but there I disagree with you—I think intent alignment gets you a lot more than non-obstruction; it gets you a policy that actually makes your life better (as opposed to just “not worse”).
    I’m not claiming that the analysis is wrong under other risk models, just that it isn’t that useful.
    - TurnTrout 29 Dec 2020 19:03 UTC
      LW: 4 AF: 3
      0
      AF Parent
      For example, I care a lot about the fact intent-based methods can (hopefully) be environment-independent, and see this as a major benefit; but on the utility maximization model it doesn’t matter.
      I think this framework also helps motivate why intent alignment is desirable: for a capable agent, the impact alignment won’t dependent as much on the choice of environment. We’re going to have uncertainty about the dynamics of the 2-player game we use to abstract and reason about the task at hand, but intent alignment would mean that doesn’t matter as much. This is something like “to reason using the AU landscape, you need fewer assumptions about how the agent works as long as you know it’s intent aligned.”
      But this requires stepping up a level from the model I outline in the post, which I didn’t do here for brevity.
      (Also, my usual mental model isn’t really ‘EU maximizer risk → AI x-risk’, it’s more like ‘one natural source of single/single AI x-risk is the learned policy doing bad things for various reasons, one of which is misspecification, and often EU maximizer risk is a nice frame for thinking about that’)
      You do also mention intent alignment as an instrumental strategy for non-obstruction, but there I disagree with you—I think intent alignment gets you a lot more than non-obstruction; it gets you a policy that actually makes your life better (as opposed to just “not worse”).
      This wasn’t the intended takeaway; the post reads:
      Intent alignment: avoid spikiness by having the AI want to be flexibly aligned with us and broadly empowering.
      This is indeed stronger than non-obstruction.
      - Rohin Shah 29 Dec 2020 19:46 UTC
        LW: 4 AF: 3
        0
        AF Parent
        This wasn’t the intended takeaway
        Oh whoops, my bad. Replace “intent alignment” with “corrigibility” there. Specifically, the thing I disagree with is:
        Corrigibility is an instrumental strategy for inducing non-obstruction in an AI.
        As with intent alignment, I also think corrigibility gets you more than non-obstruction.
        (Although perhaps you just meant the weaker statement that corrigibility implies non-obstruction?)
        I think this framework also helps motivate why intent alignment is desirable: for a capable agent, the impact alignment won’t dependent as much on the choice of environment. We’re going to have uncertainty about the dynamics of the 2-player game we use to abstract and reason about the task at hand, but intent alignment would mean that doesn’t matter as much. This is something like “to reason using the AU landscape, you need fewer assumptions about how the agent works as long as you know it’s intent aligned.”
        But this requires stepping up a level from the model I outline in the post, which I didn’t do here for brevity.
        I think I agree with all of this, but I feel like it’s pretty separate from the concepts in this post? Like, you could have written this paragraph to me before I had ever read this post and I think I would have understood it.
        (Here I’m trying to justify my claim that I don’t expect the concepts introduced in this post to be that useful in non-EU-maximizer risk models.)
        (Also, my usual mental model isn’t really ‘EU maximizer risk → AI x-risk’, it’s more like ‘one natural source of single/single AI x-risk is the learned policy doing bad things for various reasons, one of which is misspecification, and often EU maximizer risk is a nice frame for thinking about that’)
        Yes, I also am not a fan of “misspecification of reward” as a risk model; I agree that if I did like that risk model, the EU maximizer model would be a nice frame for it.
        (If you mean misspecification of things other than the reward, then I probably don’t think EU maximizer risk is a good frame for thinking about that.)
        TurnTrout 29 Dec 2020 21:38 UTC
        LW: 4 AF: 3
        0
        AF Parent
        As with intent alignment, I also think corrigibility gets you more than non-obstruction. (Although perhaps you just meant the weaker statement that corrigibility implies non-obstruction?)
        This depends what corrigibility means here. As I define it in the post, you can correct the AI without being manipulated-corrigibility gets you non-obstruction at best, but it isn’t sufficient for non-obstruction:
        … the AI moves so fast that we can’t correct it in time, even though it isn’t inclined to stop or manipulate us. In that case, corrigibility isn’t enough, whereas non-obstruction is.
        If you’re talking about Paul-corrigibility, I think that Paul-corrigibility gets you more than non-obstruction because Paul-corrigibility seems like it’s secretly just intent alignment, which we agree is stronger than non-obstruction:
        Paul Christiano named [this concept] the “basin of corrigibility”, but I don’t like that name because only a few of the named desiderata actually correspond to the natural definition of “corrigibility.” This then overloads “corrigibility” with the responsibilities of “intent alignment.”
        Rohin Shah 29 Dec 2020 22:13 UTC
        LW: 4 AF: 3
        0
        AF Parent
        As I define it in the post, you can correct the AI without being manipulated-corrigibility gets you non-obstruction at best, but it isn’t sufficient for non-obstruction
        I agree that fast-moving AI systems could lead to not getting non-obstruction. However, I think that as long as AI systems are sufficiently slow, corrigibility usually gets you “good things happen”, not just “bad things don’t happen”—you aren’t just not obstructed, the AI actively helps, because you can keep correcting it to make it better aligned with you.
        For a formal version of this, see Consequences of Misaligned AI (will be summarized in the next Alignment Newsletter), which proves that in a particular model of misalignment that (there is a human strategy where) your-corrigibility + slow movement guarantees that the AI system leads to an increase in utility, and your-corrigibility + impact regularization guarantees reaching the maximum possible utility in the limit.
        TurnTrout 29 Dec 2020 22:29 UTC
        LW: 4 AF: 3
        0
        AF Parent
        I agree that fast-moving AI systems could lead to not getting non-obstruction. However, I think that as long as AI systems are sufficiently slow, corrigibility usually gets you “good things happen”, not just “bad things don’t happen”—you aren’t just not obstructed, the AI actively helps, because you can keep correcting it to make it better aligned with you.
        I think I agree with some claim along the lines of “corrigibility + slowness + [certain environmental assumptions] + ??? ⇒ non-obstruction (and maybe even robust weak impact alignment)”, but the ”???” might depend on the human’s intelligence, the set of goals which aren’t obstructed, and maybe a few other things. So I’d want to think very carefully about what these conditions are, before supposing the implication holds in the cases we care about.
        I agree that that paper is both relevant and suggestive of this kind of implication holding in a lot of cases.
        Rohin Shah 29 Dec 2020 23:55 UTC
        LW: 4 AF: 3
        0
        AF Parent
        Yeah, I agree with all of that.
GeneSmith 22 Nov 2020 4:29 UTC
4 points
0

If we like dogs, will the AI force pancakes upon us?

This really paints a vivid picture of the stakes involved in solving the alignment problem.
Donald Hobson 4 Feb 2021 23:56 UTC
LW: 2 AF: 1
0
AF
This definition of a non-obstructionist AI takes what would happen if it wasn’t switched on as the base case.
This can give weird infinite hall of mirrors effects if another very similar non-obstructionist AI would have been switched on, and another behind them. (Ie a human whose counterfactual behaviour on AI failure is to reboot and try again.) This would tend to lead to a kind of fixed point effect, where the attainable utility landscape is almost identical with the AI on and off. At some point it bottoms out when the hypothetical U utility humans give up and do something else. If we assume that the AI is at least weakly trying to maximize attainable utility, then several hundred levels of counterfactuals in, the only hypothetical humans that haven’t given up are the ones that really like trying again and again at rebooting the non-obstructionist AI. Suppose the AI would be able to satisfy that value really well. So the AI will focus on the utility functions that are easy to satisfy in other ways, and those that would obstinately keep rebooting in the hypothetical where the AI kept not turning on. (This might be complete nonsense. It seems to make sense to me)
- TurnTrout 5 Feb 2021 0:09 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Thanks for leaving this comment. I think this kind of counterfactual is interesting as a thought experiment, but not really relevant to conceptual analysis using this framework. I suppose I should have explained more clearly that the off-state counterfactual was meant to be interpreted with a bit of reasonableness, like “what would we reasonably do if we, the designers, tried to achieve goals using our own power?”. To avoid issues of probable civilizational extinction by some other means soon after without the AI’s help, just imagine that you time-box the counterfactual goal pursuit to, say, a month.
  I can easily imagine what my (subjective) attainable utility would be if I just tried to do things on my own, without the AI’s help. In this counterfactual, I’m not really tempted to switch on similar non-obstructionist AIs. It’s this kind of counterfactual that I usually consider for AU landscape-style analysis, because I think it’s a useful way to reason about how the world is changing.
martinkunev 20 Jun 2024 7:39 UTC
1 point
0
Some of the image links are broken. Is it possible to fix them?

Non-Obstruction: A Simple Concept Motivating Corrigibility

Nomenclature

A Simple Concept Motivating Corrigibility

Two conceptual clarifications

Non-obstruction: the AI doesn’t hamper counterfactual achievement of a set of goals

Alignment flexibility

Conclusions I draw from the idea of non-obstruction

Theoretically, It’s All About Alignment

Formalizing impact alignment in extensive-form games

AI alignment subproblems are about avoiding spikiness in the AU landscape

What Do We Want?

Expanding the AI alignment solution space

Future Directions

Summary