A really important question that I think will need to be answered is whether specification gaming/reward hacking must be in a significant sense be solved by default in order to unlock extreme capabilities.
I currently lean towards yes, due to the evidence offered by o3/Sonnet 3.7, but could easily see my mind changed, but the reason this question has such a large amount of importance is that if it were true, then we’d get tools to solve the alignment problem (modulo inner optimization issues), which means we’d be far less concerned about existential risk from AI misalignment (at least to the extent that specification gaming is a large portion of the issues with AI.
That said, I do think a lot of effort will be necessary to discover the answer to the question, because it affects a lot of what you would want to do in AI safety/AI governance if alignment tools come along with better capabilities or not.
I expect “the usual agent debugging loop” (§2.2) to keep working. If o3-type systems can learn that “winding up with the right math answer is good”, then they can learn “flagrantly lying and cheating are bad” in the same way. Both are readily-available feedback signals, right? So I think o3’s dishonesty is reflecting a minor problem in the training setup that the big AI companies will correct in the very near future without any new ideas, if they haven’t already. Right? Or am I missing something?
That said, I also want to re-emphasize that both myself and Silver & Sutton are thinking of future advances that give RL a much more central and powerful role than it has even in o3-type systems.
(E.g. Silver & Sutton write: “…These approaches, while powerful, often bypassed core RL concepts: RLHF side-stepped the need for value functions by invoking human experts in place of machine-estimated values, strong priors from human data reduced the reliance on exploration, and reasoning in human-centric terms lessened the need for world models and temporal abstraction. However, it could be argued that the shift in paradigm has thrown out the baby with the bathwater…”)
I think it’s important to note that a big driver of past pure RL successes like AlphaZero were in domains where reward hacking was basically a non-concern, because it was easy to make unhackable environments like many games, and combine this with a lot of data and self-play, this allowed pure RL to scale to vastly superhuman heights without requiring the insane compute that evolution spent to make us good at doing RL tasks (which was 10^42 FLOPs at a minimum, which is basically unachievable without a well developed space industry or an intelligence explosion already happening or reversible computers actually working, due to that much compute via irreversible computation fundamentally trashing Earth’s environment).
A similar story holds for mathematics (though programming is an area where reward hacking can happen, and I see the o3/Sonnet 3.7 results as a big sign for how easy it is to make models misaligned and how little RL is required to make reward hacking/reward optimization pervasive)
Re this:
That said, I also want to re-emphasize that both myself and Silver & Sutton are thinking of future advances that give RL a much more central and powerful role than it has even in o3-type systems.
(E.g. Silver & Sutton write: “…These approaches, while powerful, often bypassed core RL concepts: RLHF side-stepped the need for value functions by invoking human experts in place of machine-estimated values, strong priors from human data reduced the reliance on exploration, and reasoning in human-centric terms lessened the need for world models and temporal abstraction. However, it could be argued that the shift in paradigm has thrown out the baby with the bathwater…”)
I basically agree with this, but the issue as I’ve said is whether adding in more RL without significant steps to solve the specification gaming problem when we apply RL to tasks that allow goodharting/reward hacking/reward is the optimization target leads to extreme capabilities gains without extreme alignment gains, or whether this just leads to a model that reward hacks/reward optimizes so much that you cannot get anywhere close to extreme capabilities like AlphaZero, or even get to human level capabilities without solving significant parts of the specification gaming problem.
On this:
I expect “the usual agent debugging loop” (§2.2) to keep working. If o3-type systems can learn that “winding up with the right math answer is good”, then they can learn “flagrantly lying and cheating are bad” in the same way. Both are readily-available feedback signals, right? So I think o3’s dishonesty is reflecting a minor problem in the training setup that the big AI companies will correct in the very near future without any new ideas, if they haven’t already. Right? Or am I missing something?
I’d probably not say minor, but yes I’d expect fixes soon, due to incentives.
The point is whether in the future RL leading to specification gaming requires you to solve the specification gaming problem by default to unlock extreme capabilities, not whether o3/Sonnet 3.7′s reward hacking can be fixed and capabilities rise (though it is legitmate evidence).
a big driver of past pure RL successes like AlphaZero were in domains where reward hacking was basically a non-concern, because it was easy to make unhackable environments like many games, and combine this with a lot of data and self-play, this allowed pure RL to scale to vastly superhuman heights without requiring the insane compute that evolution spent to make us good at doing RL tasks (which was 10^42 FLOPs at a minimum…)
I don’t follow this part. If we take “human within-lifetime learning” as our example, rather than evolution, (and we should!), then we find that there is an RL algorithm in which the compute requirements are not insane, indeed quite modest IMO, and where the environment is the real world.
I think there are better RL algorithms and worse RL algorithms, from a capabilities perspective. By “better” I mean that the RL training results in a powerful agent that understands the world and accomplishes goals via long-term hierarchical plans, even with sparse rewards and relatively little data, in complex open-ended environments, including successfully executing out-of-distribution plans on the first try (e.g. moon landing, prison escapes), etc. Nobody has invented such RL algorithms yet, and thus “a big driver of past pure RL successes” is that they were tackling problems that are solvable without those kinds of yet-to-be-invented “better RL algorithms”.
For the other part of your comment, I’m a bit confused. Can you name a specific concrete example problem / application that you think “the usual agent debugging loop” might not be able to solve, when the loop is working at all? And then we can talk about that example.
As I mentioned in the post, human slavery is an example of how it’s possible to generate profit from agents that would very much like to kill you given the opportunity.
I don’t buy this part. If we take “human within-lifetime learning” as our example, rather than evolution, (and we should!), then we find that there is an RL algorithm in which the compute requirements are not insane, indeed quite modest IMO, and where the environment is the real world.
Fair point.
I think there are better RL algorithms and worse RL algorithms, from a capabilities perspective. By “better” I mean that the RL training results in a powerful agent that understands the world and accomplishes goals via long-term hierarchical plans, even with sparse rewards and relatively little data, in complex open-ended environments, including successfully executing out-of-distribution plans on the first try (e.g. moon landing, prison escapes), etc. Nobody has invented such RL algorithms yet, and thus “a big driver of past pure RL successes” is that they were tackling problems that are solvable without those kinds of yet-to-be-invented “better RL algorithms”.
Yeah, a pretty large crux is how far can you improve RL algorithms without figuring out a way to solve specification gaming issues, because this is what controls whether we should expect competent misgeneralization of goals we don’t want, or reward hacking/wireheading that fails to take over the world.
For the other part of your comment, I’m a bit confused. Can you name a specific concrete example problem / application that you think “the usual agent debugging loop” might not be able to solve, when the loop is working at all? And then we can talk about that example.
As I mentioned in the post, human slavery is an example of how it’s possible to generate profit from agents that would very much like to kill you given the opportunity.
I too basically agree that the usual agent debugging loop will probably solve near-term issues.
To illustrate a partially concrete story of how this debugging loop could fail in a way that could force them to solve the AI specification gaming problem, imagine that we live in a world where something like fast take-off/software only singularities can happen, and we task 1,000,000 AI researchers to automate their own research.
However, we keep having issues with AI researchers reward functions, because we have a scaled up version of the problem with o3/Sonnet 3.7, because while they managed to patch the problems in o3/Sonnet 3.7, they didn’t actually fully solve the problem in a durable way, and scaled up versions of the problems that plagued o3/Sonnet 3.7 like making up papers that look good to humans but don’t actually work, making codebases that are rewarded by the RL process but don’t actually work, and more generally sycophancy/reward overoptimization is such an attractor basin that fixes don’t work without near-unhackable reward functions, and everything from benchmarks to code is aggressively goodharted and reward optimized, meaning AI capabilities stop growing until theoretical fixes for specification gaming are obtained.
This is my own optimistic story of how AI capabilities could be bottlenecked on solving the alignment problem of specification gaming
Yeah, a pretty large crux is how far can you improve RL algorithms without figuring out a way to solve specification gaming issues, because this is what controls whether we should expect competent misgeneralization of goals we don’t want, or reward hacking/wireheading that fails to take over the world.
I think this is revealing some differences of terminology and intuitions between us. To start with, in the §2.1 definitions, both “goal misgeneralization” and “specification gaming” (a.k.a. “reward hacking”) can be associated with “competent pursuit of goals we don’t want”, whereas you seem to be treating “goal misgeneralization” as a competent thing and “reward hacking” as harmless but useless. And “reward hacking” is broader than wireheading.
For example, if the AI forces the user into eternal cardio training on pain of death, and accordingly the reward function is firing like crazy, that’s misspecification not misgeneralization, right? Because this isn’t stemming from how the AI generalizes from past history. No generalization is necessary—the reward function is firing right now, while the user is in prison. (Or if the reward doesn’t fire, then TD learning will kick in, the reward function will update the value function, and the AI will say oops and release the user from prison.)
In LLMs, if you turn off the KL divergence regularization and instead apply strong optimization against the RLHF reward model, then it finds out-of-distribution tricks to get high reward, but those tricks don’t display dangerous competence. Instead, the LLM is just printing “bean bean bean bean bean…” or whatever, IIUC. I’m guessing that’s what you’re thinking about? Whereas I’m thinking of things more like the examples here or “humans inventing video games”, where the RL agent is demonstrating great competence and ingenuity towards goals that are unintentionally incentivized by the reward function.
To illustrate a partially concrete story…
Relatedly, I think you’re maybe conflating “reward hacking” with “inability to maximize sparse rewards in complex environments”?
If the AI has the ability to maximize sparse rewards in complex environments, then I claim that the AI will not have any problem like “making up papers that look good to humans but don’t actually work, making codebases that are rewarded by the RL process but don’t actually work, and more generally sycophancy/reward overoptimization”. All it takes is: make sure the paper actually works before choosing the reward, make sure the codebase actually works before choosing the reward, etc. As long as the humans are unhappy with the AI’s output at some point, then we can do “the usual agent debugging loop” of §2.2. (And if the humans never realize that the AI’s output is bad, then that’s not the kind of problem that will prevent profiting from those AIs, right? If nothing else, “the AIs are actually making money” can be tied to the reward function.)
But “ability to maximize sparse rewards in complex environments” is capabilities, not alignment. (And I think it’s a problem that will automatically be solved before we get AGI, because it’s a prerequisite to AGI.)
I think this is revealing some differences of terminology and intuitions between us. To start with, in the §2.1 definitions, both “goal misgeneralization” and “specification gaming” (a.k.a. “reward hacking”) can be associated with “competent pursuit of goals we don’t want”, w/hereas you seem to be treating “goal misgeneralization” as a competent thing and “reward hacking” as harmless but useless. And “reward hacking” is broader than wireheading.
For example, if the AI forces the user into eternal cardio training on pain of death, and accordingly the reward function is firing like crazy, that’s misspecification not misgeneralization, right? Because this isn’t stemming from how the AI generalizes from past history. No generalization is necessary—the reward function is firing right now, while the user is in prison. (Or if the reward doesn’t fire, then TD learning will kick in, the reward function will update the value function, and the AI will say oops and release the user from prison.)
In LLMs, if you turn off the KL divergence regularization and instead apply strong optimization against the RLHF reward model, then it finds out-of-distribution tricks to get high reward, but those tricks don’t display dangerous competence. Instead, the LLM is just printing “bean bean bean bean bean…” or whatever, IIUC. I’m guessing that’s what you’re thinking about? Whereas I’m thinking of things more like the examples here or “humans inventing video games”, where the RL agent is demonstrating great competence and ingenuity towards goals that are unintentionally incentivized by the reward function.
Yeah, I was ignoring the case where reward hacking actually lead to real world dangers, which was not a good thing (though in my defense, one could argue that reward hacking/reward overoptimization may by default lead to wireheading-type behavior without tools to broadly solve specification gaming).
Relatedly, I think you’re maybe conflating “reward hacking” with “inability to maximize sparse rewards in complex environments”?
If the AI has the ability to maximize sparse rewards in complex environments, then I claim that the AI will not have any problem like “making up papers that look good to humans but don’t actually work, making codebases that are rewarded by the RL process but don’t actually work, and more generally sycophancy/reward overoptimization”. All it takes is: make sure the paper actually works before choosing the reward, make sure the codebase actually works before choosing the reward, etc. As long as the humans are unhappy with the AI’s output at some point, then we can do “the usual agent debugging loop” of §2.2. (And if the humans never realize that the AI’s output is bad, then that’s not the kind of problem that will prevent profiting from those AIs, right? If nothing else, “the AIs are actually making money” can be tied to the reward function.)
But “ability to maximize sparse rewards in complex environments” is capabilities, not alignment. (And I think it’s a problem that will automatically be solved before we get AGI, because it’s a prerequisite to AGI.)
I’m pointing out that Goodhart’s law applies to AI capabilities, too, and saying that what the reward function rewards is not necessarily equivalent to the capabilities that you want from AI, because the metrics that you give the AI to optimize are likely not equivalent to what capabilities you want from the AI.
In essence, I’m saying the difference you identify for AI alignment is also a problem for AI capabilities, and I’ll quote a post of yours below:
Optimize exactly what we want, versus Step 1: operationalize exactly what we want, in the form of some reasonable-sounding metric(s). Step 2: optimize those metrics.
I think the crux is you might believe that capabilities targets are easier to encode into reward functions that don’t require that much fine specification, or you think that specification of rewards will happen more effectively for capabilities targets than alignment targets.
Whereas I’m not as convinced as you that it would literally be as easy as you say to solve issues like this “making up papers that look good to humans but don’t actually work, making codebases that are rewarded by the RL process but don’t actually work, and more generally sycophancy/reward overoptimization”. solely by maximizing rewards in sparse complicated environments without also being able to solve significant chunks of the alignment problem.
In particular, this claim “If nothing else, “the AIs are actually making money” can be tied to the reward function” has as much detail as this alignment plan below, which is that it is easy to describe, but not easy to actually implement:
In essence, I’m arguing that the same force which inhibits alignment progress also may inhibit capabilities progress, because what the RL process rewards isn’t necessarily equal to impressive capabilities, and often includes significant sycophancy.
To be clear, it’s possible that in practice it’s easier to verify capabilities reward functions vs alignment reward functions, or that the agent debugging loop is maintained while fixing it’s capabilities, while the agent debugging loop fails for alignment training, but I’m less confident than you that the capabilities/alignment dichotomy has relevance to alignment efforts, or that solving specification problems to get agents very capable don’t allow us to specify their alignment targets/values in great detail.
Hmm. I think there’s an easy short-term ‘solution’ to Goodhart’s law for AI capabilities, which is to give humans a reward button. Then the reward function rewards are exactly what the person wants by definition (until the AI can grab the button). There’s no need to define metrics or whatever, right?
(This is mildly related to RLHF, except that RLHF makes models dumber, whereas I’m imagining some future RL paradigm wherein the RL training makes models smarter.)
I think your complaint is that people would be bad at pressing the button, even by their own lights. They’ll press the button upon seeing a plausible-sounding plan that flatters their ego, and then they’ll regret that they pressed it when the plan doesn’t actually work. This will keep happening, until the humans are cursing the button and throwing it out.
But there’s an obvious (short-term) workaround to that problem, which is to tell the humans not to press the reward button until they’re really sure that they won’t later regret it, because they see that the plan really worked. (Really, you don’t even have to tell them that, they’d quickly figure that out for themselves.) (Alternatively, make an “undo” option such that when the person regrets having pressed the button, they can roll back whatever weight changes came from pressing it.) This workaround will make the rewards more sparse, and thus it’s only an option if the AI can maximize sparse rewards. But I think we’re bound to get AIs that can maximize sparse rewards, on the road to AGI.
If the person never regrets pressing the button, not even in hindsight, then you have an AI product that will be highly profitable in the short term. You can have it apply for human jobs, found companies, etc.
… …Then I have this other theory that maybe everything I just wrote here is moot, because once someone figures out the secret sauce of AGI, it will be so easy to make powerful misaligned superintelligence that this will happen very quickly and with no time or need to generate profit from the intermediate artifacts. That’s an unpopular opinion these days and I won’t defend it here. (I’ve been mulling it over in the context of a possible forthcoming post.) Just putting my cards on the table.
Hmm. I think there’s an easy short-term ‘solution’ to Goodhart’s law for AI capabilities, which is to give humans a reward button. Then the reward function rewards are exactly what the person wants by definition (until the AI can grab the button). There’s no need to define metrics or whatever, right?
I think your complaint is that people would be bad at pressing the button, even by their own lights. They’ll press the button upon seeing a plausible-sounding plan that flatters their ego, and then they’ll regret that they pressed it when the plan doesn’t actually work. This will keep happening, until the humans are cursing the button and throwing it out.
But there’s an obvious (short-term) workaround to that problem, which is to tell the humans not to press the reward button until they’re really sure that they won’t later regret it, because they see that the plan really worked. (Really, you don’t even have to tell them that, they’d quickly figure that out for themselves.) (Alternatively, make an “undo” option such that when the person regrets having pressed the button, they can roll back whatever weight changes came from pressing it.) This workaround will make the rewards more sparse, and thus it’s only an option if the AI can maximize sparse rewards. But I think we’re bound to get AIs that can maximize sparse rewards, on the road to AGI.
If the person never regrets pressing the button, not even in hindsight, then you have an AI product that will be highly profitable in the short term. You can have it apply for human jobs, found companies, etc.
For metrics, I’m talking about stuff like benchmarks and evals for AI capabilities like METR evals.
I have a couple of complaints, assuming this is the strategy we go with to make automating capabilities safe from the RL sycophancy problem:
I think this basically rules out fast takeoffs/most of the value of what AI does, and this is true regardless of whether pure software-only singularities/fast takeoffs are possible at all, and I basically agree with @johnswentworth about long tails which means having an AI automate 90% of a job, with humans grading the last 10% using a reward button loses basically all value compared to the AI being able to do the job without humans grading the reward.
Another way to say it is I think involving humans into an operation that you want to automate away with AI immediately erases most of the value of what the AI does in ~all complex domains, so this solution cannot scale at all:
2. Similar to my last complaint, I think relying on humans to do the grading because AIs cannot effectively grade themselves because the reward function unintentionally causes sycophancy causing the AIs to make code and papers that look good and are rewarded by metrics/evals is very expensive and slow.
This could get you to human level capabilities, but because of the issue of specification gaming not being resolved, this means that you can’t scale the AI’s capability at a domain beyond what an expert human could do without worrying that exploration hacking/reward hacking/sycophancy is coming back, preventing the AI from being superhumanly capable like AlphaZero.
3. I’m not as convinced as you that solutions to this problem that allow AIs to automatically grade themselves with reward functions that don’t need a human in the loop don’t transfer to solutions to various alignment problems.
A large portion of the issue is that you can’t just have humans interfere in the AI’s design, or else you have lost a lot of the value in having the AI do the job, thus solutions to specification gaming/sycophancy/reward hacking must be automatable, and have the property of automatic gradability.
And the issue isn’t sparse reward, it’s rather that the reward function incentivizes goodharting on capabilities tests and code in the real world, and to have rewards dense enough that you could solve the problem, you’d have to give up on the promise of automation, which is 90-99% of the value or more from the AI, so it is a huge capabilities hit.
To be clear, I’m not saying that it’s impossible to solve capabilities problem without solving alignment relevant specification gaming problems, but I’m saying that we can’t trivially assume alignment and capabilities are decoupled enough to make AI capabilities progress dangerous.
Hello AI. Here is a bank account with $100K of seed capital. Go make money. I’ll press the reward button if I can successfully withdraw $1B from that same bank account in the future. (But I’ll wait 1 year between withdrawing the funds and pressing the reward button, during which I’ll perform due diligence to check for law-breaking or any other funny business. And the definition of ‘funny business’ will be at my sole discretion, so you should check with me in advance if you’re unsure where I will draw the line.) Good luck!
That’s full 100% automation, not 90%, right?
“Making $1B” is one example project for concreteness, but the same idea could apply to writing code, inventing technology, or whatever else. If the human can’t ever tell whether the AI succeeded or failed at the project, then that’s a very unusual project, and certainly not a project that results in making money or impressing investors etc. And if the human can tell, then they can press the reward button when they’re sure.
Normal people can tell that their umbrella keeps them dry without knowing anything about umbrella production. Normal people can tell whether their smartphone apps are working well without knowing anything about app development and debugging. Etc.
And then I’m claiming that this kind of strategy will “work” until the AI is sufficiently competent to grab the reward button and start building defenses around it etc.
“Making $1B” is one example project for concreteness, but the same idea could apply to writing code, inventing technology, or whatever else. If the human can’t ever tell whether the AI succeeded or failed at the project, then that’s a very unusual project, and certainly not a project that results in making money or impressing investors etc. And if the human can tell, then they can press the reward button when they’re sure.
I do agree with this, and I disagree with people like John Wentworth et al on how much can we make valuable tasks verifiable, which is a large portion of the reason I like the AI control agenda much more than John Wentworth does.
A large crux here is that the task, assuming it was carried out in a way such that the AI could seize the reward button (as is likely to be true for realistic tasks and realistic capabilities levels), and we survived for 1 year without the AI seizing the reward button, and the AI was more capable than a human in general, I’d be way more optimistic on our chances for alignment, because it implies that automating significant parts, if not all of the pipeline for automated alignment research would work, and importantly I think if we could get it to actually follow laws made by human society, without specification gaming, then I’d be much more willing to think that alignment is solvable.
Another way to say it is that in order to do the task proposed in a realistic setting where the AI can seize the reward button (because of it’s capabilities), you would have to solve significant parts of specification gaming, or figure out a way to make a very secure and expressive sandbox, because the specification you propose is very vulnerable to loopholes once you release the AI into the wild and you don’t check it, and you give the AI the ability to seize the reward button, which means that significant portions of the alignment problem have to get solved, or that significant security advancements were made that makes AI way safer to deploy:
Hello AI. Here is a bank account with $100K of seed capital. Go make money. I’ll press the reward button if I can successfully withdraw $1B from that same bank account in the future. (But I’ll wait 1 year between withdrawing the funds and pressing the reward button, during which I’ll perform due diligence to check for law-breaking or any other funny business. And the definition of ‘funny business’ will be at my sole discretion, so you should check with me in advance if you’re unsure where I will draw the line.) Good luck!
This point on why alignment is harder than David Silver and Richard Sutton think also applies to the specification for capabilities you made:
More generally, what source code should we write into the reward function, such that the resulting AI’s “overall goal is to support human well-being”? Please, write something down, and then I will tell you how it can be specification-gamed.
That said, a big reason why I’m coming around to AI control/Makking AIs safe and useful even if AIs have these crazy motivations, because we can prevent them from acting on those motivations, is because we are unlikely to have confidence that alignment techniques will work in the crucial period of AI risk, it’s not likely regulations will prevent dangerous AI after jobs are significantly automated IRL, and I believe that most of the alignment relevant work will be done by AIs, so it’s really, really important to make alignment research safe to automate.
This point on why alignment is harder than David Silver and Richard Sutton think also applies to the specification for capabilities you made: …
I interpret you as saying: “In the OP, Steve blathered on and on about that it’s hard to ensure that the AI has some specific goal like ‘support human well-being’. But now Steve is saying it’s straightforward to ensure that the AI has a goal of earning $1B without funny business. What accounts for that difference?”
(Right?) Basically, my proposal here (with e.g. a reward button) is a lot like giving the AI a few hits of an addictive drug and then saying “you can have more hits, but I’m the only one who knows how to make the drug, you must do as I say for the next hit”.
This kind of technique is:
Very obvious and easy to implement in an RL agent context
Adequate for getting my AI to make me lots of money (because I can see it in my bank account and have a waiting period and due diligence as above),
Inadequate for getting AI to do alignment research (because the AI would ultimately care about producing outputs that convince me, rather than outputs that actually solve the problem, and we have abundant proof that humans can be convinced by incorrect arguments about alignment, otherwise the field wouldn’t have such long-running disagreements) (i.e. I’m taking John Wentworth’s side here),
Inadequate for solving the alignment problem by itself (because the AI will eventually get sufficiently powerful to brainwash me, kidnap my children, etc.).
A large crux here is that the task, assuming it was carried out in a way such that the AI could seize the reward button (as is likely to be true for realistic tasks and realistic capabilities levels)
Well yeah if the AI can seize the reward button but chooses not to, that’s obviously reason for optimism.
I was talking instead about a scenario where the AI can’t seize it, and where nobody knows how to make it such that the AI doesn’t want to seize it.
Maybe you think it’s implausible that the AI would be capable of earning $1B before being capable of seizing the reward button. If so, fine, whatever, just substitute a less ambitious goal than earning $1B. Or alternatively, imagine that the reward button is unusually secure, e.g. it’s implemented as ‘cryptographic reward tokens’ stored in an air-gapped underground bunker with security cameras etc. (Cf. some discussion in Superintelligence (2014)). This doesn’t work forever but this would be a way to delay the inevitable catastrophe, allowing more money to be made in the meantime.
Fair enough, in that case I’d at least admit that it’s very possible to my limited knowledge to have a dangerous situation occur if the security ever failed/got hacked.
Which implies that for AI control efforts, an underrated issue is how to figure out a way to prevent the AI from easily seizing the reward button, which is usually not considered specifically.
Inadequate for getting AI to do alignment research (because the AI would ultimately care about producing outputs that convince me, rather than outputs that actually solve the problem, and we have abundant proof that humans can be convinced by incorrect arguments about alignment, otherwise the field wouldn’t have such long-running disagreements) (i.e. I’m taking John Wentworth’s side here)
This is easily my biggest crux here, as I’m generally on the opposite side of John Wentworth here, in that verification is (relatively) easy compared to solving problems, in a lot of domains.
Indeed, I think this is a big reason why research taste is possible at all, because there’s a vast gap between verifying if a solution is correct, and actually finding a correct solution.
That said, the conversation has been very productive, and while I’m tapping out for now, I thank you for letting me have the discussion, because we found a crux between our worldview models.
In general, you can mostly solve Goodhart-like problems in the vast majority of the experienced range of actions, and have it fall apart only in more extreme cases. And reward hacking is similar. This is the default outcome I expect from prosaic alignment—we work hard to patch misalignment and hacking, so it works well enough in all the cases we test and try, until it doesn’t.
The important part is at what level of capabilities does it fail at.
If it fails once we are well past AlphaZero, or even just more moderate superhuman AI research, this is good, as this means the “automate AI alignment” plan has a safe buffer zone.
If it fails before AI automates AI research, this is also good, because it forces them to invest in alignment.
The danger case is if we can just automate AI research, but goodhart’s law comes before we can automate AI alignment research.
If it fails once we are well past AlphaZero, or even just more moderate superhuman AI research, this is good, as this means the “automate AI alignment” plan has a safe buffer zone.
If it fails before AI automates AI research, this is also good, because it forces them to invest in alignment.
That assumes AI firms learn the lessons needed from the failures. Our experience shows that they don’t, and they keep making systems that predictably are unsafe and exploitable, and they don’t have serious plans to change their deployments, much less actually build a safety-oriented culture.
A really important question that I think will need to be answered is whether specification gaming/reward hacking must be in a significant sense be solved by default in order to unlock extreme capabilities.
I currently lean towards yes, due to the evidence offered by o3/Sonnet 3.7, but could easily see my mind changed, but the reason this question has such a large amount of importance is that if it were true, then we’d get tools to solve the alignment problem (modulo inner optimization issues), which means we’d be far less concerned about existential risk from AI misalignment (at least to the extent that specification gaming is a large portion of the issues with AI.
That said, I do think a lot of effort will be necessary to discover the answer to the question, because it affects a lot of what you would want to do in AI safety/AI governance if alignment tools come along with better capabilities or not.
I expect “the usual agent debugging loop” (§2.2) to keep working. If o3-type systems can learn that “winding up with the right math answer is good”, then they can learn “flagrantly lying and cheating are bad” in the same way. Both are readily-available feedback signals, right? So I think o3’s dishonesty is reflecting a minor problem in the training setup that the big AI companies will correct in the very near future without any new ideas, if they haven’t already. Right? Or am I missing something?
That said, I also want to re-emphasize that both myself and Silver & Sutton are thinking of future advances that give RL a much more central and powerful role than it has even in o3-type systems.
(E.g. Silver & Sutton write: “…These approaches, while powerful, often bypassed core RL concepts: RLHF side-stepped the need for value functions by invoking human experts in place of machine-estimated values, strong priors from human data reduced the reliance on exploration, and reasoning in human-centric terms lessened the need for world models and temporal abstraction. However, it could be argued that the shift in paradigm has thrown out the baby with the bathwater…”)
I think it’s important to note that a big driver of past pure RL successes like AlphaZero were in domains where reward hacking was basically a non-concern, because it was easy to make unhackable environments like many games, and combine this with a lot of data and self-play, this allowed pure RL to scale to vastly superhuman heights without requiring the insane compute that evolution spent to make us good at doing RL tasks (which was 10^42 FLOPs at a minimum, which is basically unachievable without a well developed space industry or an intelligence explosion already happening or reversible computers actually working, due to that much compute via irreversible computation fundamentally trashing Earth’s environment).
A similar story holds for mathematics (though programming is an area where reward hacking can happen, and I see the o3/Sonnet 3.7 results as a big sign for how easy it is to make models misaligned and how little RL is required to make reward hacking/reward optimization pervasive)
Re this:
I basically agree with this, but the issue as I’ve said is whether adding in more RL without significant steps to solve the specification gaming problem when we apply RL to tasks that allow goodharting/reward hacking/reward is the optimization target leads to extreme capabilities gains without extreme alignment gains, or whether this just leads to a model that reward hacks/reward optimizes so much that you cannot get anywhere close to extreme capabilities like AlphaZero, or even get to human level capabilities without solving significant parts of the specification gaming problem.
On this:
I’d probably not say minor, but yes I’d expect fixes soon, due to incentives.
The point is whether in the future RL leading to specification gaming requires you to solve the specification gaming problem by default to unlock extreme capabilities, not whether o3/Sonnet 3.7′s reward hacking can be fixed and capabilities rise (though it is legitmate evidence).
And right now, the answer could very well be yes.
I don’t follow this part. If we take “human within-lifetime learning” as our example, rather than evolution, (and we should!), then we find that there is an RL algorithm in which the compute requirements are not insane, indeed quite modest IMO, and where the environment is the real world.
I think there are better RL algorithms and worse RL algorithms, from a capabilities perspective. By “better” I mean that the RL training results in a powerful agent that understands the world and accomplishes goals via long-term hierarchical plans, even with sparse rewards and relatively little data, in complex open-ended environments, including successfully executing out-of-distribution plans on the first try (e.g. moon landing, prison escapes), etc. Nobody has invented such RL algorithms yet, and thus “a big driver of past pure RL successes” is that they were tackling problems that are solvable without those kinds of yet-to-be-invented “better RL algorithms”.
For the other part of your comment, I’m a bit confused. Can you name a specific concrete example problem / application that you think “the usual agent debugging loop” might not be able to solve, when the loop is working at all? And then we can talk about that example.
As I mentioned in the post, human slavery is an example of how it’s possible to generate profit from agents that would very much like to kill you given the opportunity.
Fair point.
Yeah, a pretty large crux is how far can you improve RL algorithms without figuring out a way to solve specification gaming issues, because this is what controls whether we should expect competent misgeneralization of goals we don’t want, or reward hacking/wireheading that fails to take over the world.
I too basically agree that the usual agent debugging loop will probably solve near-term issues.
To illustrate a partially concrete story of how this debugging loop could fail in a way that could force them to solve the AI specification gaming problem, imagine that we live in a world where something like fast take-off/software only singularities can happen, and we task 1,000,000 AI researchers to automate their own research.
However, we keep having issues with AI researchers reward functions, because we have a scaled up version of the problem with o3/Sonnet 3.7, because while they managed to patch the problems in o3/Sonnet 3.7, they didn’t actually fully solve the problem in a durable way, and scaled up versions of the problems that plagued o3/Sonnet 3.7 like making up papers that look good to humans but don’t actually work, making codebases that are rewarded by the RL process but don’t actually work, and more generally sycophancy/reward overoptimization is such an attractor basin that fixes don’t work without near-unhackable reward functions, and everything from benchmarks to code is aggressively goodharted and reward optimized, meaning AI capabilities stop growing until theoretical fixes for specification gaming are obtained.
This is my own optimistic story of how AI capabilities could be bottlenecked on solving the alignment problem of specification gaming
Thanks!
I think this is revealing some differences of terminology and intuitions between us. To start with, in the §2.1 definitions, both “goal misgeneralization” and “specification gaming” (a.k.a. “reward hacking”) can be associated with “competent pursuit of goals we don’t want”, whereas you seem to be treating “goal misgeneralization” as a competent thing and “reward hacking” as harmless but useless. And “reward hacking” is broader than wireheading.
For example, if the AI forces the user into eternal cardio training on pain of death, and accordingly the reward function is firing like crazy, that’s misspecification not misgeneralization, right? Because this isn’t stemming from how the AI generalizes from past history. No generalization is necessary—the reward function is firing right now, while the user is in prison. (Or if the reward doesn’t fire, then TD learning will kick in, the reward function will update the value function, and the AI will say oops and release the user from prison.)
In LLMs, if you turn off the KL divergence regularization and instead apply strong optimization against the RLHF reward model, then it finds out-of-distribution tricks to get high reward, but those tricks don’t display dangerous competence. Instead, the LLM is just printing “bean bean bean bean bean…” or whatever, IIUC. I’m guessing that’s what you’re thinking about? Whereas I’m thinking of things more like the examples here or “humans inventing video games”, where the RL agent is demonstrating great competence and ingenuity towards goals that are unintentionally incentivized by the reward function.
Relatedly, I think you’re maybe conflating “reward hacking” with “inability to maximize sparse rewards in complex environments”?
If the AI has the ability to maximize sparse rewards in complex environments, then I claim that the AI will not have any problem like “making up papers that look good to humans but don’t actually work, making codebases that are rewarded by the RL process but don’t actually work, and more generally sycophancy/reward overoptimization”. All it takes is: make sure the paper actually works before choosing the reward, make sure the codebase actually works before choosing the reward, etc. As long as the humans are unhappy with the AI’s output at some point, then we can do “the usual agent debugging loop” of §2.2. (And if the humans never realize that the AI’s output is bad, then that’s not the kind of problem that will prevent profiting from those AIs, right? If nothing else, “the AIs are actually making money” can be tied to the reward function.)
But “ability to maximize sparse rewards in complex environments” is capabilities, not alignment. (And I think it’s a problem that will automatically be solved before we get AGI, because it’s a prerequisite to AGI.)
Yeah, I was ignoring the case where reward hacking actually lead to real world dangers, which was not a good thing (though in my defense, one could argue that reward hacking/reward overoptimization may by default lead to wireheading-type behavior without tools to broadly solve specification gaming).
I’m pointing out that Goodhart’s law applies to AI capabilities, too, and saying that what the reward function rewards is not necessarily equivalent to the capabilities that you want from AI, because the metrics that you give the AI to optimize are likely not equivalent to what capabilities you want from the AI.
In essence, I’m saying the difference you identify for AI alignment is also a problem for AI capabilities, and I’ll quote a post of yours below:
https://www.lesswrong.com/posts/wucncPjud27mLWZzQ/intro-to-brain-like-agi-safety-10-the-alignment-problem#10_3_1_Goodhart_s_Law
I think the crux is you might believe that capabilities targets are easier to encode into reward functions that don’t require that much fine specification, or you think that specification of rewards will happen more effectively for capabilities targets than alignment targets.
Whereas I’m not as convinced as you that it would literally be as easy as you say to solve issues like this “making up papers that look good to humans but don’t actually work, making codebases that are rewarded by the RL process but don’t actually work, and more generally sycophancy/reward overoptimization”. solely by maximizing rewards in sparse complicated environments without also being able to solve significant chunks of the alignment problem.
In particular, this claim “If nothing else, “the AIs are actually making money” can be tied to the reward function” has as much detail as this alignment plan below, which is that it is easy to describe, but not easy to actually implement:
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4yXqCNKmfaHwDSrAZ
In essence, I’m arguing that the same force which inhibits alignment progress also may inhibit capabilities progress, because what the RL process rewards isn’t necessarily equal to impressive capabilities, and often includes significant sycophancy.
To be clear, it’s possible that in practice it’s easier to verify capabilities reward functions vs alignment reward functions, or that the agent debugging loop is maintained while fixing it’s capabilities, while the agent debugging loop fails for alignment training, but I’m less confident than you that the capabilities/alignment dichotomy has relevance to alignment efforts, or that solving specification problems to get agents very capable don’t allow us to specify their alignment targets/values in great detail.
Thanks!
Hmm. I think there’s an easy short-term ‘solution’ to Goodhart’s law for AI capabilities, which is to give humans a reward button. Then the reward function rewards are exactly what the person wants by definition (until the AI can grab the button). There’s no need to define metrics or whatever, right?
(This is mildly related to RLHF, except that RLHF makes models dumber, whereas I’m imagining some future RL paradigm wherein the RL training makes models smarter.)
I think your complaint is that people would be bad at pressing the button, even by their own lights. They’ll press the button upon seeing a plausible-sounding plan that flatters their ego, and then they’ll regret that they pressed it when the plan doesn’t actually work. This will keep happening, until the humans are cursing the button and throwing it out.
But there’s an obvious (short-term) workaround to that problem, which is to tell the humans not to press the reward button until they’re really sure that they won’t later regret it, because they see that the plan really worked. (Really, you don’t even have to tell them that, they’d quickly figure that out for themselves.) (Alternatively, make an “undo” option such that when the person regrets having pressed the button, they can roll back whatever weight changes came from pressing it.) This workaround will make the rewards more sparse, and thus it’s only an option if the AI can maximize sparse rewards. But I think we’re bound to get AIs that can maximize sparse rewards, on the road to AGI.
If the person never regrets pressing the button, not even in hindsight, then you have an AI product that will be highly profitable in the short term. You can have it apply for human jobs, found companies, etc.
… …Then I have this other theory that maybe everything I just wrote here is moot, because once someone figures out the secret sauce of AGI, it will be so easy to make powerful misaligned superintelligence that this will happen very quickly and with no time or need to generate profit from the intermediate artifacts. That’s an unpopular opinion these days and I won’t defend it here. (I’ve been mulling it over in the context of a possible forthcoming post.) Just putting my cards on the table.
That was a very good constructive comment btw, sorry I forgot to reply to it earlier but I just did.
For metrics, I’m talking about stuff like benchmarks and evals for AI capabilities like METR evals.
I have a couple of complaints, assuming this is the strategy we go with to make automating capabilities safe from the RL sycophancy problem:
I think this basically rules out fast takeoffs/most of the value of what AI does, and this is true regardless of whether pure software-only singularities/fast takeoffs are possible at all, and I basically agree with @johnswentworth about long tails which means having an AI automate 90% of a job, with humans grading the last 10% using a reward button loses basically all value compared to the AI being able to do the job without humans grading the reward.
Another way to say it is I think involving humans into an operation that you want to automate away with AI immediately erases most of the value of what the AI does in ~all complex domains, so this solution cannot scale at all:
https://www.lesswrong.com/posts/Nbcs5Fe2cxQuzje4K/value-of-the-long-tail
2. Similar to my last complaint, I think relying on humans to do the grading because AIs cannot effectively grade themselves because the reward function unintentionally causes sycophancy causing the AIs to make code and papers that look good and are rewarded by metrics/evals is very expensive and slow.
This could get you to human level capabilities, but because of the issue of specification gaming not being resolved, this means that you can’t scale the AI’s capability at a domain beyond what an expert human could do without worrying that exploration hacking/reward hacking/sycophancy is coming back, preventing the AI from being superhumanly capable like AlphaZero.
3. I’m not as convinced as you that solutions to this problem that allow AIs to automatically grade themselves with reward functions that don’t need a human in the loop don’t transfer to solutions to various alignment problems.
A large portion of the issue is that you can’t just have humans interfere in the AI’s design, or else you have lost a lot of the value in having the AI do the job, thus solutions to specification gaming/sycophancy/reward hacking must be automatable, and have the property of automatic gradability.
And the issue isn’t sparse reward, it’s rather that the reward function incentivizes goodharting on capabilities tests and code in the real world, and to have rewards dense enough that you could solve the problem, you’d have to give up on the promise of automation, which is 90-99% of the value or more from the AI, so it is a huge capabilities hit.
To be clear, I’m not saying that it’s impossible to solve capabilities problem without solving alignment relevant specification gaming problems, but I’m saying that we can’t trivially assume alignment and capabilities are decoupled enough to make AI capabilities progress dangerous.
Thanks! …But I think you misunderstood.
Suppose I tell an AI:
That’s full 100% automation, not 90%, right?
“Making $1B” is one example project for concreteness, but the same idea could apply to writing code, inventing technology, or whatever else. If the human can’t ever tell whether the AI succeeded or failed at the project, then that’s a very unusual project, and certainly not a project that results in making money or impressing investors etc. And if the human can tell, then they can press the reward button when they’re sure.
Normal people can tell that their umbrella keeps them dry without knowing anything about umbrella production. Normal people can tell whether their smartphone apps are working well without knowing anything about app development and debugging. Etc.
And then I’m claiming that this kind of strategy will “work” until the AI is sufficiently competent to grab the reward button and start building defenses around it etc.
I do agree with this, and I disagree with people like John Wentworth et al on how much can we make valuable tasks verifiable, which is a large portion of the reason I like the AI control agenda much more than John Wentworth does.
A large crux here is that the task, assuming it was carried out in a way such that the AI could seize the reward button (as is likely to be true for realistic tasks and realistic capabilities levels), and we survived for 1 year without the AI seizing the reward button, and the AI was more capable than a human in general, I’d be way more optimistic on our chances for alignment, because it implies that automating significant parts, if not all of the pipeline for automated alignment research would work, and importantly I think if we could get it to actually follow laws made by human society, without specification gaming, then I’d be much more willing to think that alignment is solvable.
Another way to say it is that in order to do the task proposed in a realistic setting where the AI can seize the reward button (because of it’s capabilities), you would have to solve significant parts of specification gaming, or figure out a way to make a very secure and expressive sandbox, because the specification you propose is very vulnerable to loopholes once you release the AI into the wild and you don’t check it, and you give the AI the ability to seize the reward button, which means that significant portions of the alignment problem have to get solved, or that significant security advancements were made that makes AI way safer to deploy:
This point on why alignment is harder than David Silver and Richard Sutton think also applies to the specification for capabilities you made:
That said, a big reason why I’m coming around to AI control/Makking AIs safe and useful even if AIs have these crazy motivations, because we can prevent them from acting on those motivations, is because we are unlikely to have confidence that alignment techniques will work in the crucial period of AI risk, it’s not likely regulations will prevent dangerous AI after jobs are significantly automated IRL, and I believe that most of the alignment relevant work will be done by AIs, so it’s really, really important to make alignment research safe to automate.
Thanks!
I interpret you as saying: “In the OP, Steve blathered on and on about that it’s hard to ensure that the AI has some specific goal like ‘support human well-being’. But now Steve is saying it’s straightforward to ensure that the AI has a goal of earning $1B without funny business. What accounts for that difference?”
(Right?) Basically, my proposal here (with e.g. a reward button) is a lot like giving the AI a few hits of an addictive drug and then saying “you can have more hits, but I’m the only one who knows how to make the drug, you must do as I say for the next hit”.
This kind of technique is:
Very obvious and easy to implement in an RL agent context
Adequate for getting my AI to make me lots of money (because I can see it in my bank account and have a waiting period and due diligence as above),
Inadequate for getting AI to do alignment research (because the AI would ultimately care about producing outputs that convince me, rather than outputs that actually solve the problem, and we have abundant proof that humans can be convinced by incorrect arguments about alignment, otherwise the field wouldn’t have such long-running disagreements) (i.e. I’m taking John Wentworth’s side here),
Inadequate for solving the alignment problem by itself (because the AI will eventually get sufficiently powerful to brainwash me, kidnap my children, etc.).
Well yeah if the AI can seize the reward button but chooses not to, that’s obviously reason for optimism.
I was talking instead about a scenario where the AI can’t seize it, and where nobody knows how to make it such that the AI doesn’t want to seize it.
Maybe you think it’s implausible that the AI would be capable of earning $1B before being capable of seizing the reward button. If so, fine, whatever, just substitute a less ambitious goal than earning $1B. Or alternatively, imagine that the reward button is unusually secure, e.g. it’s implemented as ‘cryptographic reward tokens’ stored in an air-gapped underground bunker with security cameras etc. (Cf. some discussion in Superintelligence (2014)). This doesn’t work forever but this would be a way to delay the inevitable catastrophe, allowing more money to be made in the meantime.
Fair enough, in that case I’d at least admit that it’s very possible to my limited knowledge to have a dangerous situation occur if the security ever failed/got hacked.
Which implies that for AI control efforts, an underrated issue is how to figure out a way to prevent the AI from easily seizing the reward button, which is usually not considered specifically.
This is easily my biggest crux here, as I’m generally on the opposite side of John Wentworth here, in that verification is (relatively) easy compared to solving problems, in a lot of domains.
Indeed, I think this is a big reason why research taste is possible at all, because there’s a vast gap between verifying if a solution is correct, and actually finding a correct solution.
That said, the conversation has been very productive, and while I’m tapping out for now, I thank you for letting me have the discussion, because we found a crux between our worldview models.
In general, you can mostly solve Goodhart-like problems in the vast majority of the experienced range of actions, and have it fall apart only in more extreme cases. And reward hacking is similar. This is the default outcome I expect from prosaic alignment—we work hard to patch misalignment and hacking, so it works well enough in all the cases we test and try, until it doesn’t.
The important part is at what level of capabilities does it fail at.
If it fails once we are well past AlphaZero, or even just more moderate superhuman AI research, this is good, as this means the “automate AI alignment” plan has a safe buffer zone.
If it fails before AI automates AI research, this is also good, because it forces them to invest in alignment.
The danger case is if we can just automate AI research, but goodhart’s law comes before we can automate AI alignment research.
That assumes AI firms learn the lessons needed from the failures. Our experience shows that they don’t, and they keep making systems that predictably are unsafe and exploitable, and they don’t have serious plans to change their deployments, much less actually build a safety-oriented culture.