I agree with the two questions you’ve identified as the core issues, although I’d slightly rephrase the former. It’s hard to think about something being aligned indefinitely. But it seems like, if we have primarily used a given system for carrying out individual tasks, it would take quite a lot of misalignment for it to carry out a systematic plan to deceive us. So I’d rephrase the first option you mention as “feeling pretty confident that something that generalises from 1 week to 1 year won’t become misaligned enough to cause disasters”. This point seems more important than the second point (the nature of “be helpful” and how that’s a natural motivation for a mind), but I’ll discuss both.
I think the main disagreement about the former is over the relative strength of “results-based selection” versus “intentional design”. When I said above that “we design type 1 feedback so that resulting agents perform well on our true goals”, I was primarily talking about “design” as us reasoning about our agents, and the training process they undergo, not the process of running them for a long time and picking the ones that do best. The latter is a very weak force! Almost all of the optimisation done by humans comes from intentional design plus rapid trial and error (on the timeframe of days or weeks). Very little of the optimisation comes from long-term trial and error (on the timeframe of a year) - by necessity, because it’s just so slow.
So, conditional on our agents generalising from “one week” to “one year”, we should expect that it’s because we somehow designed a training procedure that produces scalable alignment (or at least scalable non-misalignment), or because they’re deceptively aligned (as in your influence-seeking agents scenario), but not because long-term trial and error was responsible for steering us towards getting what we can measure.
Then there’s the second question, of whether “do things that look to a human like you’re achieving X” is a plausible generalisation. My intuitions on this question are very fuzzy, so I wouldn’t be surprised if they’re wrong. But, tentatively, here’s one argument. Consider a policy which receives instructions from a human, talks to the human to clarify the concepts involved, then gets rewarded and updated based on how well it carries out those instructions. From the policy’s perspective, the thing it interacts with, and which its actions are based on, is human instructions. Indeed, for most of the training process the policy plausibly won’t even have the concept of “reward” (in the same way that humans didn’t evolve a concept of fitness). But it will have this concept of human intentions, which is a very good proxy for reward. And so it seems much more natural for the policy’s goals to be formulated in terms of human intentions and desires, which are the observable quantities that it responds to; rather than human feedback, which is the unobservable quantity that it is optimised with respect to. (Rewards can be passed as observations to the policy, but I claim that it’s both safer and more useful if rewards are unobservable by the policy during training.)
This argument is weakened by the fact that, when there’s a conflict between them (e.g. in cases where it’s possible to fool the humans), agents aiming to “look like you’re doing X” will receive more reward. But during most of training the agent won’t be very good at fooling humans, and so I am optimistic that its core motivations will still be more like “do what the human says” than “look like you’re doing what the human says”.
I agree with the two questions you’ve identified as the core issues, although I’d slightly rephrase the former. It’s hard to think about something being aligned indefinitely. But it seems like, if we have primarily used a given system for carrying out individual tasks, it would take quite a lot of misalignment for it to carry out a systematic plan to deceive us. So I’d rephrase the first option you mention as “feeling pretty confident that something that generalises from 1 week to 1 year won’t become misaligned enough to cause disasters”. This point seems more important than the second point (the nature of “be helpful” and how that’s a natural motivation for a mind), but I’ll discuss both.
I think the main disagreement about the former is over the relative strength of “results-based selection” versus “intentional design”. When I said above that “we design type 1 feedback so that resulting agents perform well on our true goals”, I was primarily talking about “design” as us reasoning about our agents, and the training process they undergo, not the process of running them for a long time and picking the ones that do best. The latter is a very weak force! Almost all of the optimisation done by humans comes from intentional design plus rapid trial and error (on the timeframe of days or weeks). Very little of the optimisation comes from long-term trial and error (on the timeframe of a year) - by necessity, because it’s just so slow.
So, conditional on our agents generalising from “one week” to “one year”, we should expect that it’s because we somehow designed a training procedure that produces scalable alignment (or at least scalable non-misalignment), or because they’re deceptively aligned (as in your influence-seeking agents scenario), but not because long-term trial and error was responsible for steering us towards getting what we can measure.
Then there’s the second question, of whether “do things that look to a human like you’re achieving X” is a plausible generalisation. My intuitions on this question are very fuzzy, so I wouldn’t be surprised if they’re wrong. But, tentatively, here’s one argument. Consider a policy which receives instructions from a human, talks to the human to clarify the concepts involved, then gets rewarded and updated based on how well it carries out those instructions. From the policy’s perspective, the thing it interacts with, and which its actions are based on, is human instructions. Indeed, for most of the training process the policy plausibly won’t even have the concept of “reward” (in the same way that humans didn’t evolve a concept of fitness). But it will have this concept of human intentions, which is a very good proxy for reward. And so it seems much more natural for the policy’s goals to be formulated in terms of human intentions and desires, which are the observable quantities that it responds to; rather than human feedback, which is the unobservable quantity that it is optimised with respect to. (Rewards can be passed as observations to the policy, but I claim that it’s both safer and more useful if rewards are unobservable by the policy during training.)
This argument is weakened by the fact that, when there’s a conflict between them (e.g. in cases where it’s possible to fool the humans), agents aiming to “look like you’re doing X” will receive more reward. But during most of training the agent won’t be very good at fooling humans, and so I am optimistic that its core motivations will still be more like “do what the human says” than “look like you’re doing what the human says”.