Re-reading this a couple days later, I think my §5.1 discussion didn’t quite capture the LLM-scales-to-AGI optimist position. I think there are actually 2½ major versions, and I only directly talked about one in the post. Let me try again:
Optimist type 1: They make the §5.1.1 argument, imagining that humans will remain in the loop, in a way that’s not substantially different from the present.
My pessimistic response: see §5.1.1 in the post.
Optimist type 2: They make the §5.1.3 argument that our methods are better than Ev’s, because we’re engineering the AI’s explicit desires in a more direct way. And the explicit desire is corrigibility / obedience. And then they also make the §5.1.2 argument that “AIs solving specific technical problems that the human wants them to solve” will not undermine those explicit motivations, despite the (1-3) loop running freely with minimal supervision, because the (1-3) loop will work in vaguely intuitive and predictable ways on the object-level technical question.
My pessimistic response has three parts: First, the idea that a full-fledged (1-3) loop will not undermine corrigibility / obedience is as yet untested and at least open to question (as I wrote in §5.1.2). Second, my expectation is that some training and/or algorithm change will happen between now and “AIs that really have the full (1-3) triad”, and that change may well make it less true that we are directly engineering the AI’s explicit desires in the first place—for example, see o1 is a bad idea. Third, what are the “specific technical problems that the human wants the AI to solve”??
…Optimist type 2A answers that last question by saying the “specific technical problems that the human wants the AI to solve” is just, whatever random things that people want to happen in the world economy.
My pessimistic response: See discussion in §5.1.2 plus What does it take to defend the world against out-of-control AGIs? Also, competitive dynamics / race-to-the-bottom is working against us, in that AIs with less intrinsic motivation to be obedient / corrigible will wind up making more money and controlling more resources.
…Optimist type 2B instead answers that last question by saying that the “specific technical problems that the human wants the AI to solve” is alignment research, or more specifically, “figuring out how to make AIs whose motivation is more like CEV or ambitious value learning rather than obedience”.
My pessimistic response: The discussion in §5.1.2 becomes relevant in a different way—I think there’s a chicken-and-egg problem where obeying humans does not yield enough philosophical / ethical taste to judge the quality of a proposal for “AI that has philosophical / ethical taste”. (Semi-related: The Case Against AI Control Research.)
Re-reading this a couple days later, I think my §5.1 discussion didn’t quite capture the LLM-scales-to-AGI optimist position. I think there are actually 2½ major versions, and I only directly talked about one in the post. Let me try again:
Optimist type 1: They make the §5.1.1 argument, imagining that humans will remain in the loop, in a way that’s not substantially different from the present.
My pessimistic response: see §5.1.1 in the post.
Optimist type 2: They make the §5.1.3 argument that our methods are better than Ev’s, because we’re engineering the AI’s explicit desires in a more direct way. And the explicit desire is corrigibility / obedience. And then they also make the §5.1.2 argument that “AIs solving specific technical problems that the human wants them to solve” will not undermine those explicit motivations, despite the (1-3) loop running freely with minimal supervision, because the (1-3) loop will work in vaguely intuitive and predictable ways on the object-level technical question.
My pessimistic response has three parts: First, the idea that a full-fledged (1-3) loop will not undermine corrigibility / obedience is as yet untested and at least open to question (as I wrote in §5.1.2). Second, my expectation is that some training and/or algorithm change will happen between now and “AIs that really have the full (1-3) triad”, and that change may well make it less true that we are directly engineering the AI’s explicit desires in the first place—for example, see o1 is a bad idea. Third, what are the “specific technical problems that the human wants the AI to solve”??
…Optimist type 2A answers that last question by saying the “specific technical problems that the human wants the AI to solve” is just, whatever random things that people want to happen in the world economy.
My pessimistic response: See discussion in §5.1.2 plus What does it take to defend the world against out-of-control AGIs? Also, competitive dynamics / race-to-the-bottom is working against us, in that AIs with less intrinsic motivation to be obedient / corrigible will wind up making more money and controlling more resources.
…Optimist type 2B instead answers that last question by saying that the “specific technical problems that the human wants the AI to solve” is alignment research, or more specifically, “figuring out how to make AIs whose motivation is more like CEV or ambitious value learning rather than obedience”.
My pessimistic response: The discussion in §5.1.2 becomes relevant in a different way—I think there’s a chicken-and-egg problem where obeying humans does not yield enough philosophical / ethical taste to judge the quality of a proposal for “AI that has philosophical / ethical taste”. (Semi-related: The Case Against AI Control Research.)