Daniel Kokotajlo comments on Distinguishing claims about training vs deployment

Daniel Kokotajlo 5 Feb 2021 8:23 UTC
LW: 5 AF: 2
0
AF
Currently you probably have a very skeptical prior about what the surface of the farthest earth-sized planet from Earth in the Milky Way looks like. Yet you are very justified in being very confident it doesn’t look like this:
Why? Because this is a very small region in the space of possibilities for earth-sized-planets-in-the-Milky-Way. And yeah, it’s true that planets are NOT drawn randomly from that space of possibilities, and it’s true that this planet is in the reference class of “Earth-sized planets in the Milky way” and the only other member of that reference class we’ve observed so far DOES look like that… But given our priors, those facts are basically irrelevant.

I think this is a decent metaphor for what was happening ten years ago or so with all these debates about orthogonality and instrumental convergence. People had a confused understanding of how minds and instrumental reasoning worked; then people like Yudkowsky and Bostrom became less confused by thinking about the space of possible minds and goals and whatnot, and convinced themselves and others that actually the situation is analogous to this planets example (though maybe less extreme): The burden of proof should be on whoever wants to claim that AI will be fine by default, not on whoever wants to claim it won’t be fine by default. I think they were right about this and still are right about this. Nevertheless I’m glad that we are moving away from this skeptical-priors, burden-of-proof stuff and towards more rigorous understandings. Just as I’d see it as progress if some geologists came along and said “Actually we have a pretty good idea now of how continents drift, and so we have some idea of what the probability distribution over map-images is like, and maps that look anything like this one have very low measure, even conditional on the planet being earth-sized and in the milky way.” But I’d see it as “confirming more rigorously what we already knew, just in case, cos you never really know for sure” progress.
- Richard_Ngo 5 Feb 2021 15:42 UTC
  LW: 5 AF: 2
  0
  AF Parent
  The burden of proof should be on whoever wants to claim that AI will be fine by default, not on whoever wants to claim it won’t be fine by default.
  I’m happy to wrap up this conversation in general, but it’s worth noting before I do that I still strongly disagree with this comment. We’ve identified a couple of interesting facts about goals, like “unbounded large-scale final goals lead to convergent instrumental goals”, but we have nowhere near a good enough understanding of the space of goal-like behaviour to say that everything apart from a “very small region” will lead to disaster. This is circular reasoning from the premise that goals are by default unbounded and consequentialist to the conclusion that it’s very hard to get bounded or non-consequentialist goals. (It would be rendered non-circular by arguments about why coherence theorems about utility functions are so important, but there’s been a lot of criticism of those arguments and no responses so far.)
  - Daniel Kokotajlo 5 Feb 2021 16:37 UTC
    LW: 2 AF: 1
    0
    AF Parent
    OK, interesting. I agree this is a double crux. For reasons I’ve explained above, it doesn’t seem like circular reasoning to me, it doesn’t seem like I’m assuming that goals are by default unbounded and consequentialist etc. But maybe I am. I haven’t thought about this as much as you have, my views on the topic have been crystallizing throughout this conversation, so I admit there’s a good chance I’m wrong and you are right. Perhaps I/we will return to it one day, but for now, thanks again and goodbye!