Here is an example of a story I wrote (that is somewhat edited by TurnTrout) about how an agent trained by RL could plausibly not optimize reward, forsaking actions that it knew during training would get it high reward. I found it useful as a way to understand his views, and he has signed off on it. Just to be clear, this is not his proposal for why everything is fine, nor is it necessarily an accurate representation of my views, just a plausible-to-TurnTrout story for how agents won’t end up wanting to game human approval:
Agent gets trained on a reward function that’s 1 if it gets human approval, 0 otherwise (or something).
During an intermediate amount of training, the agent’s honest and nice computations get reinforced by reward events.
That means it develops a motivation to act honestly and behave nicely etc., and no similarly strong motivation to gain human approval at all costs.
The agent then gets able to tell that it if it tricked the human, that would be reinforced.
It then decides to not get close in action-space to tricking the human, so that it doesn’t get reinforced into wanting to gain human approval by tricking the human.
This works because:
it’s enough action hops away and/or a small enough part of the space that epsilon-greedy strategies would be very unlikely to push it into the deception mode.
smarter exploration strategies will depend on the agent’s value function to know which states are more or less promising to explore (e.g. something like thompson sampling), and the agent really disvalues deceiving the human, so that doesn’t get reinforced.
A fine-tuning could be an identity or mission statement for an agent (bureaucracy), so that it speaks with a purpose or attention to particular features of a situation, or to a particular concept, or to an aspect of preference. Then in an HCH-like setting, let’s define for each situation (initial prompt) an episode on it that involves multiple agents discussing the situation, elucidating its aspects pertaining to those agents. Each agent participates in some set of episodes defined on a set of situations (agent’s scope), and the scope can be different for different agents (each agent is specialized and only participates in episodes about situations where its fine-tuning is relevant).
An agent is aligned when it consistently acts according to its fine-tuning’s intent within its scope (it’s robust to situations in its scope, or to episodes on the situations in its scope). An agent doesn’t need to behave correctly outside its scope to be considered aligned, so a fine-tuning doesn’t need to generalize too far, but its scope must conservatively estimate how far it does generalize.
So a large space of situations can be covered by overlapping smaller scopes of agents that bind behaviors of episodes on those situations together. Each agent acts as a sort of acausal coordination device across the episodes on its scope, if agents are iteratively retrained on the data of episodes (as a sort of reflection). And each episode binds behaviors of agents participating in it together (in a sort of bargaining).
In this sketch, alignment/extrapolation (across distributional shift) is sought by training new specialized agents that cover novel situations further from the initial training/fine-tuning distribution with their scopes. This is done by adding them to episodes on situations that are in scopes of both old and new agents, where they bargain with old agents and learn to extend their alignment to new situations within their new scopes. A new agent is trained to understand new situations (within its scope) and arguments that take place within the episodes on those situations. These are unfamiliar to old agents, so their adequate descriptions/explanations won’t fit into a context window for an old agent (the way prompts can’t replace fine-tunings), but the new agent is expecting these situations already, so can discuss them (after iterating reflection, fine-tuning to the episodes on the new agent’s scope).