Mark Xu comments on ARC’s first technical report: Eliciting Latent Knowledge

Mark Xu 24 Dec 2021 2:29 UTC
LW: 3 AF: 1
AF
I think that problem 1 and problem 2 as you describe them are potentially talking about the same phenomenon. I’m not sure I’m understanding correctly, but I think I would make the following claims:
- Our notion of narrowness is that we are interested in solving the problem where the question we’re asking is such that a state always resolves a question. E.g. there isn’t any ambiguity around whether a state “really contains a diamond”. (Note that there is ambiguity around whether the human could detect the diamond from any set of observations because there could be a fake diamond or nanobots filtering what the human sees). It might be useful to think of this as an empirical claim about diamonds.
- We are explicitly interested in solving some forms of problem 2, e.g. we’re interested in our AI being able to answer questions about the presence/absence of diamonds no matter how alien the world gets. In some sense, we are interested in our AI answering questions the same way a human would answer questions if they “knew what was really going on”, but that “knew what was really going on” might be a misleading phrase. I’m not imagining that “knowing what is really going on” to be a very involved process; intuitively, it means something like “the answer they would give if the sensors are ‘working as intended’”. In particular, I don’t think that, for the case of the diamond, “Further judgement, deliberation, and understanding is required to determine what the answer should be in these strange worlds.”
  - We want to solve these versions of problem 2 because the speed “things getting weirder” in the world might be much faster than human ability to understand what’s going on the world. In these worlds, we want to leverage the fact that answers to “narrow” questions are unambiguous to incentivize our AIs to give humans a locally understandable environment in which to deliberate.
- We’re not interested in solving forms of problem 2 where the human needs to do additional deliberation to know what the answer to the question “should” be. E.g. in ship-of-theseus situations where the diamond is slowly replaced, we aren’t expecting our AI to answer “is that the same diamond as before?” using the resolution of ship-of-theseus style situations that a human would arrive at with additional deliberation. We are, however, expecting that the answer to the question “does the diamond look the way it does because of the ‘normal’ causal reasons?” is “no” because the reason is something like “[incomprehensible process] replaced bits of the diamond with identical bits slowly”, which is definitely not why diamonds normally continue looking the same.
- Charlie Steiner 26 Dec 2021 8:40 UTC
  LW: 3 AF: 2
  AF Parent
  It might be useful to think of this as an empirical claim about diamonds.
  I think this statement encapsulates some worries I have.
  If it’s important how the human defines a property like “the same diamond,” then assuming that the sameness of the diamond is “out there in the diamond” will get you into trouble—e.g. if there’s any optimization pressure to find cases where the specifics of the human’s model rear their head. Human judgment is laden with the details of how humans model the world, you can’t avoid dependence on the human (and the messiness that entails) entirely.
  Or to phrase it another way: I don’t have any beef with a narrow approach that says “there’s some set of judgments for which the human is basically competent, and we want to elicit knowledge relevant to those judgments.” But I’m worried about a narrow approach that says “let’s assume that humans are basically competent for all judgments of interest, and keep assuming this until something goes wrong.”
  It just feels to me like this second approach is sort of… treating the real world as if it’s a perturbative approximation to the platonic realm.
- Ramana Kumar 30 Dec 2021 16:27 UTC
  LW: 1 AF: 1
  AF Parent
  Our notion of narrowness is that we are interested in solving the problem where the question we’re asking is such that a state always resolves a question. E.g. there isn’t any ambiguity around whether a state “really contains a diamond”. (Note that there is ambiguity around whether the human could detect the diamond from any set of observations because there could be a fake diamond or nanobots filtering what the human sees). It might be useful to think of this as an empirical claim about diamonds.
  This “there isn’t any ambiguity”+”there is ambiguity” does not seem possible to me: these types of ambiguity are one and the same. But it might depend on what “any set of observations” is allowed to include. “Any set” suggests being very inclusive, but remember that passive observation is impossible. Perhaps the observations I’d want the human to use to figure out if the diamond is really there (presuming there isn’t ambiguity) would include observations you mean to exclude, such as disabling the filter-nanobots first?
  I guess a wrinkle here is that observations need to be “implementable” in the world. If we’re thinking of making observations as intervening on the world (e.g., to decide which sensors to query), then some observations may be inaccessible because we can’t make that intervention. Rewriting this all without relying on “possible”/”can” concepts would be instructive.
  - paulfchristiano 1 Jan 2022 23:04 UTC
    LW: 2 AF: 2
    AF Parent
    I don’t think we have any kind of precise definition of “no ambiguity.” That said, I think it’s easy to construct examples where there is no ambiguity about whether the diamond remained in the room, yet there is no sequence of actions a human could take that would let them figure out the answer. For example, we can imagine simple toy universes where we understand exactly what features of the world give rise to human beliefs about diamonds and where we can say unambiguously that the same features are/aren’t present in a given situation.
    In general I feel a lot better about our definitions when we are using them to arbitrate a counterexample than if we were trying to give a formal definition. If all the counterexamples involved border cases of the concepts, where there was arguable ambiguity about whether the diamond really stayed in the room, then it would seem important to firm up these concepts but right now it feels like it is easy to just focus on cases where algorithms unambiguously fail.
    (That methodological point isn’t obvious though—it may be that precise definitions are very useful for solving the problem even if you don’t need them to judge current solutions as inadequate. Or it may be that actually existing counterexamples are problematic in ways we don’t recognize. Pushback on these fronts is always welcome, but right now I feel pretty comfortable with the situation.)