How many alignment techniques presuppose an AI being motivated by the training signal (e.g. AI Safety via Debate)
It would be good to get a definitive response from @Geoffrey Irving or @paulfchristiano, but I don’t think AI Safety via Debate presupposes an AI being motivated by the training signal. Looking at the paper again, there is some theoretical work that assumes “each agent maximizes their probability of winning” but I think the idea is sufficiently well-motivated (at least as a research approach) even if you took that section out, and simply view Debate as a way to do RL training on an AI that is superhumanly capable (and hence hard or unsafe to do straight RLHF on).
BTW what is your overall view on “scalable alignment” techniques such as Debate and IDA? (I guess I’m getting the vibe from this quote that you don’t like them, and want to get clarification so I don’t mislead myself.)
I certainly do think that debate is motivated by modeling agents as being optimized to increase their reward, and debate is an attempt at writing down a less hackable reward function. But I also think RL can be sensibly described as trying to increase reward, and generally don’t understand the section of the document that says it obviously is not doing that. And then if the RL algorithm is trying to increase reward, and there is a meta-learning phenomenon that cause agents to learn algorithms, then the agents will be trying to increase reward.
Reading through the section again, it seems like the claim is that my first sentence “debate is motivated by agents being optimized to increase reward” is categorically different than “debate is motivated by agents being themselves motivated to increase reward”. But these two cases seem separated only by a capability gap to me: sufficiently strong agents will be stronger if they record algorithms that adapt to increase reward in different cases.
In traditional reinforcement learning, the environment would also supply a reward [...] and the agent’s goal would be to maximize the discounted sum of rewards. Instead of assuming that the environment produces a reward signal, we assume that there is a human overseer who can express preferences between trajectory segments. [...] Informally, the goal of the agent is to produce trajectories which are preferred by the human, while making as few queries as possible to the human. [...] After using r to compute rewards, we are left with a traditional reinforcement learning problem
The simplest plausible strategies for alignment involve humans (maybe with the assistance of AI systems) evaluating a model’s actions based on how much we expect to like their consequences, and then training the models to produce highly-evaluated actions. [...] Simple versions of this approach are expected to run into difficulties, and potentially to be totally unworkable, because:
Evaluating consequences is hard.
A treacherous turn can cause trouble too quickly to detect or correct even if you are able to do so, and it’s challenging to evaluate treacherous turn probability at training time.
[...] I don’t think that improving or studying RLHF is automatically “alignment” or necessarily net positive.
Edit: Another relevant section in an interview of Paul Christiano by Dwarkesh Patel:
but I don’t think AI Safety via Debate presupposes an AI being motivated by the training signal
This seems right to me.
I often imagine debate (and similar techniques) being applied in the (low-stakes/average-case/non-concentrate) control setting. The control setting is the case where you are maximally conservative about the AI’s motivations and then try to demonstrate safety via making the incapable of causing catastrophic harm.
If you make pessimistic assumptions about AI motivations like this, then you have to worry about concerns like exploration hacking (or even gradient hacking), but it’s still plausible that debate adds considerable value regardless.
We could also less conservatively assume that AIs might be misaligned (including seriously misaligned with problematic long range goals), but won’t necessarily prefer colluding with other AI over working with humans (e.g. because humanity offers payment for labor). In this case, techniques like debate seem quite applicable and the situation could be very dangerous in the absence of good enough approaches (at least if AIs are quite superhuman).
More generally, debate could be applicable to any type of misalignment which you think might cause problems over a large number of independently assessible actions.
It would be good to get a definitive response from @Geoffrey Irving or @paulfchristiano, but I don’t think AI Safety via Debate presupposes an AI being motivated by the training signal. Looking at the paper again, there is some theoretical work that assumes “each agent maximizes their probability of winning” but I think the idea is sufficiently well-motivated (at least as a research approach) even if you took that section out, and simply view Debate as a way to do RL training on an AI that is superhumanly capable (and hence hard or unsafe to do straight RLHF on).
BTW what is your overall view on “scalable alignment” techniques such as Debate and IDA? (I guess I’m getting the vibe from this quote that you don’t like them, and want to get clarification so I don’t mislead myself.)
I certainly do think that debate is motivated by modeling agents as being optimized to increase their reward, and debate is an attempt at writing down a less hackable reward function. But I also think RL can be sensibly described as trying to increase reward, and generally don’t understand the section of the document that says it obviously is not doing that. And then if the RL algorithm is trying to increase reward, and there is a meta-learning phenomenon that cause agents to learn algorithms, then the agents will be trying to increase reward.
Reading through the section again, it seems like the claim is that my first sentence “debate is motivated by agents being optimized to increase reward” is categorically different than “debate is motivated by agents being themselves motivated to increase reward”. But these two cases seem separated only by a capability gap to me: sufficiently strong agents will be stronger if they record algorithms that adapt to increase reward in different cases.
The post defending the claim is Reward is not the optimization target. Iirc, TurnTrout has described it as one of his most important posts on LW.
Some context from Paul Christiano’s work on RLHF and a later reflection on it:
Christiano et al.: Deep Reinforcement Learning from Human Preferences
Christiano: Thoughts on the impact of RLHF research
Edit: Another relevant section in an interview of Paul Christiano by Dwarkesh Patel:
Paul Christiano—Preventing an AI Takeover
This seems right to me.
I often imagine debate (and similar techniques) being applied in the (low-stakes/average-case/non-concentrate) control setting. The control setting is the case where you are maximally conservative about the AI’s motivations and then try to demonstrate safety via making the incapable of causing catastrophic harm.
If you make pessimistic assumptions about AI motivations like this, then you have to worry about concerns like exploration hacking (or even gradient hacking), but it’s still plausible that debate adds considerable value regardless.
We could also less conservatively assume that AIs might be misaligned (including seriously misaligned with problematic long range goals), but won’t necessarily prefer colluding with other AI over working with humans (e.g. because humanity offers payment for labor). In this case, techniques like debate seem quite applicable and the situation could be very dangerous in the absence of good enough approaches (at least if AIs are quite superhuman).
More generally, debate could be applicable to any type of misalignment which you think might cause problems over a large number of independently assessible actions.