I think that (1) this is a good deconfusion post, (2) it was an important post for me to read, and definitely made me conclude that I had been confused in the past, (3) and one of the kinds of posts that, ideally, in some hypothetical and probably-impossible past world, would have resulted in much more discussion and worked-out-cruxes in order to forestall the degeneration of AI risk arguments into mutually incomprehensible camps with differing premises, which at this point is starting to look like a done deal?
On the object level: I currently think that—well, there are many, many, many ways for an entity to have it’s performance adjusted so that it does well by some measure. One conceivable location that some such system could arrive in is for it to move to an outer-loop-fixed goal, per the description of the post. Rob (et al) think that there is a gravitational attraction towards such an outer-loop-fixed goal, across an enormous variety of future architectures, such that multiplicity of different systems will be pulled into such a goal, will develop long-term coherence towards (from our perspective) random goals, and so on.
I think this is almost certainly false, even for extremely powerful systems—to borrow a phrase, it seems equally well to be an argument that humans should be automatically strategic, which of course they are not. It also part of the genre of arguments that argue that AI systems should act in particular ways regardless of their domain, training data, and training procedure—which I think by now we should have extremely strong priors against, given that for literally all AI systems—and I mean literally all, including MCTS-based self-play systems—the data from which the NN’s learn is enormously important for what those NNs learn. More broadly, I currently think the gravitational attraction towards such an outer-loop-fixed-goal will be absolutely tiny, if at all present, compared to the attraction towards more actively human-specified goals.
But again, that’s just a short recap of one way to take what is going on in the post, and one that of course many people will not agree with. Overall, I think the post itself, Robs’s reply, and Nostalgebrist’s reply to Rob’s reply, are all pretty good at least as a summary of the kind of thing people say about this.
I think that (1) this is a good deconfusion post, (2) it was an important post for me to read, and definitely made me conclude that I had been confused in the past, (3) and one of the kinds of posts that, ideally, in some hypothetical and probably-impossible past world, would have resulted in much more discussion and worked-out-cruxes in order to forestall the degeneration of AI risk arguments into mutually incomprehensible camps with differing premises, which at this point is starting to look like a done deal?
On the object level: I currently think that—well, there are many, many, many ways for an entity to have it’s performance adjusted so that it does well by some measure. One conceivable location that some such system could arrive in is for it to move to an outer-loop-fixed goal, per the description of the post. Rob (et al) think that there is a gravitational attraction towards such an outer-loop-fixed goal, across an enormous variety of future architectures, such that multiplicity of different systems will be pulled into such a goal, will develop long-term coherence towards (from our perspective) random goals, and so on.
I think this is almost certainly false, even for extremely powerful systems—to borrow a phrase, it seems equally well to be an argument that humans should be automatically strategic, which of course they are not. It also part of the genre of arguments that argue that AI systems should act in particular ways regardless of their domain, training data, and training procedure—which I think by now we should have extremely strong priors against, given that for literally all AI systems—and I mean literally all, including MCTS-based self-play systems—the data from which the NN’s learn is enormously important for what those NNs learn. More broadly, I currently think the gravitational attraction towards such an outer-loop-fixed-goal will be absolutely tiny, if at all present, compared to the attraction towards more actively human-specified goals.
But again, that’s just a short recap of one way to take what is going on in the post, and one that of course many people will not agree with. Overall, I think the post itself, Robs’s reply, and Nostalgebrist’s reply to Rob’s reply, are all pretty good at least as a summary of the kind of thing people say about this.