like being able to give the judge or debate partner the goal of actually trying to get to the truth
The idea is to set up a game in which the winning move is to be honest. There are theorems about the games that say something pretty close to this (though often they say “honesty is always a winning move” rather than “honesty is the only winning move”). These certainly depend on modeling assumptions but the assumptions are more like “assume the models are sufficiently capable” not “assume we can give them a goal”. When applying this in practice there is also a clear divergence between what an equilibrium behavior is and what is found by RL in practice.
Despite all the caveats, I think it’s wildly inaccurate to say that Amplified Oversight is assuming the ability to give the debate partner the goal of actually trying to get to the truth.
(I agree it is assuming that the judge has that goal, but I don’t see why that’s a terrible assumption.)
Are you stopping the agent periodically to have another debate about what it’s working on and asking the human to review another debate?
You don’t have to stop the agent, you can just do it afterwards.
can anybody provide a more detailed sketch of why they think Amplified Oversight will work and how it can be used to make agents safe in practice?
Have you read AI safety via debate? It has really quite a lot of conceptual points, making both the case in favor and considering several different reasons to worry.
(To be clear, there is more research that has made progress, e.g. cross-examination is a big deal imo, but I think the original debate paper is more than enough to get to the bar you’re outlining here.)
(Sorry, hard to differentiate quotes from you vs quotes from the paper in this format)
(I agree it is assuming that the judge has that goal, but I don’t see why that’s a terrible assumption.)
If the judge is human, sure. If the judge is another AI, it seems like a wild assumption to me. The section on judge safety in your paper does a good job of listing many of the problems. On thing I want to call out as something I more strongly disagree with is:
One natural approach to judge safety is bootstrapping: when aligning a new AI system, use the previous generation of AI (that has already been aligned) as the judge.
I don’t think we have any fully aligned models, and won’t any time soon. I would expect bootstrapping will at most align a model as thoroughly as its predecessor was aligned (but probably less), and goodhart’s law definitely applies here.
You don’t have to stop the agent, you can just do it afterwards.
Maybe this is just using the same terms for two different systems, but the paper also talks about using judging for monitoring deployed systems:
This approach is especially well-suited to monitoring deployed systems, due to the sheer scale required at deployment.
Is this intended only as a auditing mechanism, not a prevention mechanism (eg. we noticed in the wild the AI is failing debates, time to roll back the release)? Or are we trusting the AI judges enough at this point that we don’t need to stop and involve a human? I also worry the “cheap system with high recall but low precision” will be too easy to fool for the system to be functional past a certain capability level.
From a 20000ft perspective, this all just looks like RLHF with extra steps that make it harder to reason about whether it’ll work or not. If we had evidence that RLHF worked great, but the only flaw with it was the AI was starting to get too complex for humans to give feedback, then I would be more interested in all of the fiddly details of amplified oversight and working through them. The problem is RLHF already doesn’t work, and I think it’ll only work less well as AI becomes smarter and more agentic.
(Meta: Going off of past experience I don’t really expect to make much progress with more comments, so there’s a decent chance I will bow out after this comment.)
I would expect bootstrapping will at most align a model as thoroughly as its predecessor was aligned (but probably less)
Why? Seems like it could go either way to me. To name one consideration in the opposite direction (without claiming this is the only consideration), the more powerful model can do a better job at finding the inputs on which the model would be misaligned, enabling you to train its successor across a wider variety of situations.
and goodhart’s law definitely applies here.
I am having a hard time parsing this as having more content than “something could go wrong while bootstrapping”. What is the metric that is undergoing optimization pressure during bootstrapping / amplified oversight that leads to decreased correlation with the true thing we should care about?
Is this intended only as a auditing mechanism, not a prevention mechanism
Yeah I’d expect debates to be an auditing mechanism if used at deployment time.
I also worry the “cheap system with high recall but low precision” will be too easy to fool for the system to be functional past a certain capability level.
Any alignment approach will always be subject to the critique “what if you failed and the AI became misaligned anyway and then past a certain capability level it evades all of your other defenses”. I’m not trying to be robust to that critique.
I’m not saying I don’t worry about fooling the cheap system—I agree that’s a failure mode to track. But useful conversation on this seems like it has to get into a more detailed argument, and at the very least has to be more contentful than “what if it didn’t work”.
The problem is RLHF already doesn’t work
??? RLHF does work currently? What makes you think it doesn’t work currently?
RLHF does work currently? What makes you think it doesn’t work currently?
This is definitely the crux so probably really the only point worth debating.
RLHF is just papering over problems. Sure, the model is slightly more difficult to jailbreak but it’s still pretty easy to jailbreak. Sure, the agent is less likely through RLHF to output text you don’t like, but I think agents will reliably overcome that obstacle as useful agents won’t just be outputting the most probable continuation, they’ll be searching through decision space and finding those unlikely continuations that will score well on its task. I don’t think RLHF does anything remotely analogous to making it care about whether it’s following your intent, or following a constitution, etc.
You’re definitely aware of the misalignment that still exists with our current RLHF’d models and have read the recent papers on alignment faking etc. so I probably can’t make an argument you haven’t already heard.
Maybe the true crux is you believe this is a situation where we can muddle through by making harmful continuations ever more unlikely? I don’t.
The idea is to set up a game in which the winning move is to be honest. There are theorems about the games that say something pretty close to this (though often they say “honesty is always a winning move” rather than “honesty is the only winning move”). These certainly depend on modeling assumptions but the assumptions are more like “assume the models are sufficiently capable” not “assume we can give them a goal”. When applying this in practice there is also a clear divergence between what an equilibrium behavior is and what is found by RL in practice.
Despite all the caveats, I think it’s wildly inaccurate to say that Amplified Oversight is assuming the ability to give the debate partner the goal of actually trying to get to the truth.
(I agree it is assuming that the judge has that goal, but I don’t see why that’s a terrible assumption.)
You don’t have to stop the agent, you can just do it afterwards.
Have you read AI safety via debate? It has really quite a lot of conceptual points, making both the case in favor and considering several different reasons to worry.
(To be clear, there is more research that has made progress, e.g. cross-examination is a big deal imo, but I think the original debate paper is more than enough to get to the bar you’re outlining here.)
(Sorry, hard to differentiate quotes from you vs quotes from the paper in this format)
If the judge is human, sure. If the judge is another AI, it seems like a wild assumption to me. The section on judge safety in your paper does a good job of listing many of the problems. On thing I want to call out as something I more strongly disagree with is:
I don’t think we have any fully aligned models, and won’t any time soon. I would expect bootstrapping will at most align a model as thoroughly as its predecessor was aligned (but probably less), and goodhart’s law definitely applies here.
Maybe this is just using the same terms for two different systems, but the paper also talks about using judging for monitoring deployed systems:
Is this intended only as a auditing mechanism, not a prevention mechanism (eg. we noticed in the wild the AI is failing debates, time to roll back the release)? Or are we trusting the AI judges enough at this point that we don’t need to stop and involve a human? I also worry the “cheap system with high recall but low precision” will be too easy to fool for the system to be functional past a certain capability level.
From a 20000ft perspective, this all just looks like RLHF with extra steps that make it harder to reason about whether it’ll work or not. If we had evidence that RLHF worked great, but the only flaw with it was the AI was starting to get too complex for humans to give feedback, then I would be more interested in all of the fiddly details of amplified oversight and working through them. The problem is RLHF already doesn’t work, and I think it’ll only work less well as AI becomes smarter and more agentic.
(Meta: Going off of past experience I don’t really expect to make much progress with more comments, so there’s a decent chance I will bow out after this comment.)
Why? Seems like it could go either way to me. To name one consideration in the opposite direction (without claiming this is the only consideration), the more powerful model can do a better job at finding the inputs on which the model would be misaligned, enabling you to train its successor across a wider variety of situations.
I am having a hard time parsing this as having more content than “something could go wrong while bootstrapping”. What is the metric that is undergoing optimization pressure during bootstrapping / amplified oversight that leads to decreased correlation with the true thing we should care about?
Yeah I’d expect debates to be an auditing mechanism if used at deployment time.
Any alignment approach will always be subject to the critique “what if you failed and the AI became misaligned anyway and then past a certain capability level it evades all of your other defenses”. I’m not trying to be robust to that critique.
I’m not saying I don’t worry about fooling the cheap system—I agree that’s a failure mode to track. But useful conversation on this seems like it has to get into a more detailed argument, and at the very least has to be more contentful than “what if it didn’t work”.
??? RLHF does work currently? What makes you think it doesn’t work currently?
This is definitely the crux so probably really the only point worth debating.
RLHF is just papering over problems. Sure, the model is slightly more difficult to jailbreak but it’s still pretty easy to jailbreak. Sure, the agent is less likely through RLHF to output text you don’t like, but I think agents will reliably overcome that obstacle as useful agents won’t just be outputting the most probable continuation, they’ll be searching through decision space and finding those unlikely continuations that will score well on its task. I don’t think RLHF does anything remotely analogous to making it care about whether it’s following your intent, or following a constitution, etc.
You’re definitely aware of the misalignment that still exists with our current RLHF’d models and have read the recent papers on alignment faking etc. so I probably can’t make an argument you haven’t already heard.
Maybe the true crux is you believe this is a situation where we can muddle through by making harmful continuations ever more unlikely? I don’t.