I think an “external push” is extremely likely. It’ll just look like someone trying to get an AI to do something clever in the real world.
Take InstructGPT, the language model that right now is a flagship model of OpenAI. It was trained in two phases: First, purely to predict the next token of text. Second, after it was really good at predicting the next token, it was further trained with reinforcement learning from human feedback.
Reinforcement learning to try to satisfy human preferences is precisely the sort of “external push” that will incentivize an AI that previously did not have “wants” (i.e. that did not previously choose its actions based on their predicted impacts on the world) to develop wants (i.e. to pick actions based on their predicted impact on the world).
Why did OpenAI do such a thing, then? Well, because it’s useful! InstructGPT does a better job answering questions than regular ol’ GPT. The information from human feedback helped the AI do better at its real-world purpose, in ways that are tricky to specify by hand.
Now, if this makes sense, I think there’s a subset of peoples’ concerns about an “internal push” that make sense by analogy:
Consider an AI that you want to do a task that involves walking a robot through an obstacle course (e.g. mapping out a construction site and showing you the map). And you’re trying to train this AI without giving it “wants,” just as a tool, so you’re not giving it direct feedback for how good of a map it shows you, instead you’re doing something more expensive but safer: you’re training it to understand the whole distribution of human performance on this task, and then selecting a policy conditional on good performance.
The concern is that the AI will have a subroutine that “wants” the robot to navigate the obstacle course, even though you didn’t give an “outside push” to make that happen. Why? Well, it’s trying to predict good navigations of the obstacle course, and it models that as a process that picks actions based on their modeled impact on the real world, and in order to do that modeling, it actually runs the compuations.
In other words, there’s an “internal push”—or maybe equivalently a “push from the data rather than from the human,” which leads to “wants” being computed inside the model of a task that is well-modeled by using goal-based reasoning. This all works fine on-distribution, but produces generalization behavior that generalizes like the modeled agent, which might be bad.
you’re training it to understand the whole distribution of human performance on this task, and then selecting a policy conditional on good performance
Yeah, that makes sense to me.
it’s trying to predict good navigations of the obstacle course, and it models that as a process that picks actions based on their modeled impact on the real world, and in order to do that modeling, it actually runs the compuations.
I can see why it would run a simulation of what would happen if a robot walked an obstacle course. I don’t see why it would actually walk the robot through it if not asked.
I can see why it would run a simulation of what would happen if a robot walked an obstacle course. I don’t see why it would actually walk the robot through it if not asked.
So, this is an argument about generalization properties. Which means it’s kind of the opposite of the thing you asked for :P
That is, it’s not about this AI doing its intended job even when you don’t turn it on. It’s about the AI doing something other than its intended job when you do turn it on.
That is… the claim is that you might put the AI in a new situation and have it behave badly (e.g. the robot punching through walls to complete the obstacle course faster, if you put it in a new environment where it’s able to punch through walls) in a way that looks like goal-directed behavior, even if you tried not to give it any goals, or were just trying to have it mimic humans.
I think an “external push” is extremely likely. It’ll just look like someone trying to get an AI to do something clever in the real world.
Take InstructGPT, the language model that right now is a flagship model of OpenAI. It was trained in two phases: First, purely to predict the next token of text. Second, after it was really good at predicting the next token, it was further trained with reinforcement learning from human feedback.
Reinforcement learning to try to satisfy human preferences is precisely the sort of “external push” that will incentivize an AI that previously did not have “wants” (i.e. that did not previously choose its actions based on their predicted impacts on the world) to develop wants (i.e. to pick actions based on their predicted impact on the world).
Why did OpenAI do such a thing, then? Well, because it’s useful! InstructGPT does a better job answering questions than regular ol’ GPT. The information from human feedback helped the AI do better at its real-world purpose, in ways that are tricky to specify by hand.
Now, if this makes sense, I think there’s a subset of peoples’ concerns about an “internal push” that make sense by analogy:
Consider an AI that you want to do a task that involves walking a robot through an obstacle course (e.g. mapping out a construction site and showing you the map). And you’re trying to train this AI without giving it “wants,” just as a tool, so you’re not giving it direct feedback for how good of a map it shows you, instead you’re doing something more expensive but safer: you’re training it to understand the whole distribution of human performance on this task, and then selecting a policy conditional on good performance.
The concern is that the AI will have a subroutine that “wants” the robot to navigate the obstacle course, even though you didn’t give an “outside push” to make that happen. Why? Well, it’s trying to predict good navigations of the obstacle course, and it models that as a process that picks actions based on their modeled impact on the real world, and in order to do that modeling, it actually runs the compuations.
In other words, there’s an “internal push”—or maybe equivalently a “push from the data rather than from the human,” which leads to “wants” being computed inside the model of a task that is well-modeled by using goal-based reasoning. This all works fine on-distribution, but produces generalization behavior that generalizes like the modeled agent, which might be bad.
Yeah, that makes sense to me.
I can see why it would run a simulation of what would happen if a robot walked an obstacle course. I don’t see why it would actually walk the robot through it if not asked.
So, this is an argument about generalization properties. Which means it’s kind of the opposite of the thing you asked for :P
That is, it’s not about this AI doing its intended job even when you don’t turn it on. It’s about the AI doing something other than its intended job when you do turn it on.
That is… the claim is that you might put the AI in a new situation and have it behave badly (e.g. the robot punching through walls to complete the obstacle course faster, if you put it in a new environment where it’s able to punch through walls) in a way that looks like goal-directed behavior, even if you tried not to give it any goals, or were just trying to have it mimic humans.