For instance, if the model was at every turn searching over possible actions and choosing one that will maximize this reward function. (Of course, no one knows how to do that in practice yet, but everyone’s on the same page about that.)
I’m interpreting you as saying “If we solved outer alignment & had a perfect reward function, it would be good if the model itself was optimizing for that reward function (ie inner alignment)”
For the book, it’s interesting they went w/ the evolution argument. I still prefer the shard theory analogy of humans being misaligned w/ the reward system (ie I intentionally avoid taking fentanyl, even though that would be very rewarding/reinforcing), which can still end up in similar sharp-left turns if the model eg takes fentanyl (or other goals).
Evolution is still not believed by everyone in the US (though oddly ranging from 17%-37%), which can be offputting to some, & also you have to understand evolution to an extent. I assume most folks can sort’of see that if you really optimized for evolution, you’d do a lot more than we are to pass on genes; however, optimizing for “evolution” is underconstrained & can have arguments for “well we’re actually still doing quite good by evolution’s sake”.
Now instead let’s focus on optimizing for the human reward system. People believe in very addictive drugs & can see the effects. It’s pretty easy to imagine “a drug addict becomes extremely powerful, and you try to stop them. What goes wrong?”. It’s also quite coherent what optimizing your reward system looks like!
The evolution analogy is still good under the inner-outer alignment frame, since humans would be evolution’s seat & it seems difficult to avoid the same issues. Whereas the human reward system seems easier (eg give the AI fentanyl). This can be worked around by discussing how hard it is to design the perfect reward function which doesn’t end up goodharting.
I’m interpreting you as saying “If we solved outer alignment & had a perfect reward function, it would be good if the model itself was optimizing for that reward function (ie inner alignment)”
In which case, we are not on the same page (ie inner & outer alignment decompose a hard problem into two more difficult ones).
For the book, it’s interesting they went w/ the evolution argument. I still prefer the shard theory analogy of humans being misaligned w/ the reward system (ie I intentionally avoid taking fentanyl, even though that would be very rewarding/reinforcing), which can still end up in similar sharp-left turns if the model eg takes fentanyl (or other goals).
Evolution is still not believed by everyone in the US (though oddly ranging from 17%-37%), which can be offputting to some, & also you have to understand evolution to an extent. I assume most folks can sort’of see that if you really optimized for evolution, you’d do a lot more than we are to pass on genes; however, optimizing for “evolution” is underconstrained & can have arguments for “well we’re actually still doing quite good by evolution’s sake”.
Now instead let’s focus on optimizing for the human reward system. People believe in very addictive drugs & can see the effects. It’s pretty easy to imagine “a drug addict becomes extremely powerful, and you try to stop them. What goes wrong?”. It’s also quite coherent what optimizing your reward system looks like!
The evolution analogy is still good under the inner-outer alignment frame, since humans would be evolution’s seat & it seems difficult to avoid the same issues. Whereas the human reward system seems easier (eg give the AI fentanyl). This can be worked around by discussing how hard it is to design the perfect reward function which doesn’t end up goodharting.