paulfchristiano comments on ARC’s first technical report: Eliciting Latent Knowledge

paulfchristiano 17 Dec 2021 5:06 UTC
LW: 5 AF: 4
0
AF
The previous definition was aiming to define a utility function “precisely,” in the sense of giving some code which would produce the utility value if you ran it for a (very, very) long time.
One basic concern with this is (as you pointed out at the time) that it’s not clear that an AI which was able to acquire power would actually be able to reason about this abstract definition of utility. A more minor concern is that it involves considering the decisions of hypothetical humans very unlike those existing in the real world (who therefore might reach bad conclusions or at least conclusions different from ours).
In the new formulation, the goal is to define the utility in terms of the answers to questions about the future that seem like they should be easy for the AI to answer because they are a combination of (i) easy predictions about humans that it is good at, (ii) predictions about the future that any power-seeking AI should be able to answer.
Relatedly, this version only requires making predictions about humans who are living in the real world and being defended by their AI. (Though those humans can choose to delegate to some digital process making predictions about hypothetical humans, if they so desire.) Ideally I’d even like all of the humans involved in the process to be indistinguishable from the “real” humans, so that no human ever looks at their situation and thinks “I guess I’m one of the humans responsible for figuring out the utility function, since this isn’t the kind of world that my AI would actually bring into existence rather than merely reasoning about hypothetically.”
More structurally, the goal is to define the utility function in terms of the kinds of question-answers that realistic approaches to ELK could elicit, which doesn’t seem to include facts about mathematics that are much too complex for humans to derive directly and where they need to rely on correlations between mathematics and the physical world—in those cases we are essentially just delegating all the reasoning about how to couple them (e.g. how to infer that hypothetical humans will behave like real humans) to some amplified humans, and then we might as well go one level further and actually talk about how those humans reason.
The point of doing this exercise now is mostly to clarify what kind of answers we need to get out of ELK, and especially to better understand whether it’s worth exploring “narrow” approaches (methodologically it may make sense anyway because they may be a stepping stone to more ambitious approaches, but it would be more satisfying if they could be used directly as a building block in an alignment scheme). We looked into it enough to feel more confident about exploring narrow approaches.
- Wei Dai 19 Dec 2021 18:09 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Thanks, very helpful to understand your motivations for that section better.
  
  In the new formulation, the goal is to define the utility in terms of the answers to questions about the future that seem like they should be easy for the AI to answer because they are a combination of (i) easy predictions about humans that it is good at, (ii) predictions about the future that any power-seeking AI should be able to answer.
  
  Not sure about the following, but it seems the new formulation requires that the AI answer questions about humans in a future that may have very low probability according to the AI’s current beliefs (i.e., the current human through a delegation chain eventually delegates to a future human existing in a possible world with low probability). The AI may well not be able to answer questions about such a future human, because it wouldn’t need that ability to seek power (it only needs to make predictions about high probability futures). Or to put it another way, the future human may exist in a world with strange/unfamiliar (from the AI’s perspective) features that make it hard for the AI to predict correctly.
  
  Ideally I’d even like all of the humans involved in the process to be indistinguishable from the “real” humans, so that no human ever looks at their situation and thinks “I guess I’m one of the humans responsible for figuring out the utility function, since this isn’t the kind of world that my AI would actually bring into existence rather than merely reasoning about hypothetically.”
  
  How do you envision extracting or eliciting from the future human H_limit an opinion about what the current human should do, given that H_limit’s mind is almost certainly entirely focused on their own life and problems? One obvious way I can think of is to make a copy of H_limit, put the copy in a virtual environment, tell them about H’s situation, then ask them what to do. But that seems to run into the same kind of issue, as the copy is now aware that they’re not living in the real world.
  - paulfchristiano 19 Dec 2021 18:54 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Not sure about the following, but it seems the new formulation requires that the AI answer questions about humans in a future that may have very low probability according to the AI’s current beliefs (i.e., the current human through a delegation chain eventually delegates to a future human existing in a possible world with low probability). The AI may well not be able to answer questions about such a future human, because it wouldn’t need that ability to seek power (it only needs to make predictions about high probability futures). Or to put it another way, the future human may exist in a world with strange/unfamiliar (from the AI’s perspective) features that make it hard for the AI to predict correctly.
    I’m imagining delegating to humans who are very similar to (and ideally indistinguishable from) the humans who will actually exist in the world that we bring about. I’m scared about very alien humans for a bunch of reasons—hard for the AI to reason about, may behave strangely, and makes it harder to use “corrigible” strategies to easily satisfy their preferences. (Though that said, note that the AI is reasoning very abstractly about such future humans and cannot e.g. predict any of their statements in detail.)
    How do you envision extracting or eliciting from the future human H_limit an opinion about what the current human should do, given that H_limit’s mind is almost certainly entirely focused on their own life and problems? One obvious way I can think of is to make a copy of H_limit, put the copy in a virtual environment, tell them about H’s situation, then ask them what to do. But that seems to run into the same kind of issue, as the copy is now aware that they’re not living in the real world.
    Ideally we are basically asking each human what they want their future to look like, not asking them to evaluate a very different world.
    Ideally we would literally only be asking the humans to evaluate their future. This is kind of like giving instructions to their AI about what it should do next, but a little bit more indirect since they are instead evaluating futures that their AI could bring about.
    The reason this doesn’t work is that by the time we get to those future humans, the AI may already be in an irreversibly bad position (e.g. because it hasn’t acquired much flexible influence that it can use to help the humans achieve their goals). This happens most obviously at the very end, but it also happens along the way if the AI failed to get into a position where it could effectively defend us. (And of course it happens along the way if people are gradually refining their understanding of what they want to happen in the external world, rather than having a full clean separation into “expand while protecting deliberation” + “execute payload.”)
    However, when this happens it is only because the humans along the way couldn’t tell that things were going badly—they couldn’t understand that their AI had failed to gather resources for them until they actually got to the end, asked their AI to achieve something, and were unhappy because it couldn’t. If they had understood along the way, then they would never have gone down this route.
    So at the point when the humans are thinking about this question, you may hope that they are actually ignorant about whether their AI has put them in a good situation. They are providing their views about what they want to happen in the world, hoping that their AI can achieve those outcomes in the world. The AI will only “back up” and explore a different possible future instead if it turns out that it isn’t able to get the humans what they want as effectively as it would have been in some other world. But in this case the humans don’t even know that this backing up is about to occur. They never evaluate the full quality of their situation, they just say “In this world, the AI fails to do what they want” (and it becomes clear the situation is bad when in every world the AI fails to do what they want).
    I don’t really think the strong form of this can work out, since the humans may e.g. become wiser and realize that something in their past was bad. And if they are just thinking about their own lives they may not want to report that fact since it will clearly cause them not to exist. I think it’s not really clear how to handle that.
    (If the problem they notice was a fact about their early deliberation that they now regret then I think this is basically a problem for any approach. If they notice a fact about the AI’s early behavior that they don’t like, but they are too selfish to want to “unwind” it and therefore claim to be happy with what their AI does for them, then that seems like a more distinctive problem for this approach. More generally, there is a risk that people will be looking for any signs that a possible future is “their” future and preferring it, and that this effectively removes the ability to unwind and therefore eliminates the AI’s incentive to acquire resources, and that we couldn’t reintroduce it without giving up on the decoupling that lets us avoid incentives for manipulation.)
    (I do think that issues like this are even more severe for many other approaches people are imagining to defining values, e.g. in any version of decoupled RL you could have a problem where overseers rate their own world much better than alternatives. You could imagine approaches that avoid this by avoiding indirect normativity, but currently it looks to me like they avoid problems only by being very vague about what “values” means.)