Steven Byrnes comments on “The Era of Experience” has an unsolved technical alignment problem

Steven Byrnes 30 Apr 2025 17:12 UTC
3 points
0
Thanks!
Hmm. I think there’s an easy short-term ‘solution’ to Goodhart’s law for AI capabilities, which is to give humans a reward button. Then the reward function rewards are exactly what the person wants by definition (until the AI can grab the button). There’s no need to define metrics or whatever, right?
(This is mildly related to RLHF, except that RLHF makes models dumber, whereas I’m imagining some future RL paradigm wherein the RL training makes models smarter.)
I think your complaint is that people would be bad at pressing the button, even by their own lights. They’ll press the button upon seeing a plausible-sounding plan that flatters their ego, and then they’ll regret that they pressed it when the plan doesn’t actually work. This will keep happening, until the humans are cursing the button and throwing it out.
But there’s an obvious (short-term) workaround to that problem, which is to tell the humans not to press the reward button until they’re really sure that they won’t later regret it, because they see that the plan really worked. (Really, you don’t even have to tell them that, they’d quickly figure that out for themselves.) (Alternatively, make an “undo” option such that when the person regrets having pressed the button, they can roll back whatever weight changes came from pressing it.) This workaround will make the rewards more sparse, and thus it’s only an option if the AI can maximize sparse rewards. But I think we’re bound to get AIs that can maximize sparse rewards, on the road to AGI.
If the person never regrets pressing the button, not even in hindsight, then you have an AI product that will be highly profitable in the short term. You can have it apply for human jobs, found companies, etc.
… …Then I have this other theory that maybe everything I just wrote here is moot, because once someone figures out the secret sauce of AGI, it will be so easy to make powerful misaligned superintelligence that this will happen very quickly and with no time or need to generate profit from the intermediate artifacts. That’s an unpopular opinion these days and I won’t defend it here. (I’ve been mulling it over in the context of a possible forthcoming post.) Just putting my cards on the table.
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4yXqCNKmfaHwDSrAZ
That was a very good constructive comment btw, sorry I forgot to reply to it earlier but I just did.
What links here?
- Steven Byrnes's comment on Can we safely automate alignment research? by Joe Carlsmith (1 May 2025 18:08 UTC; 4 points)
- Noosphere89 1 May 2025 18:40 UTC
  4 points
  0
  Parent
  Hmm. I think there’s an easy short-term ‘solution’ to Goodhart’s law for AI capabilities, which is to give humans a reward button. Then the reward function rewards are exactly what the person wants by definition (until the AI can grab the button). There’s no need to define metrics or whatever, right?
  I think your complaint is that people would be bad at pressing the button, even by their own lights. They’ll press the button upon seeing a plausible-sounding plan that flatters their ego, and then they’ll regret that they pressed it when the plan doesn’t actually work. This will keep happening, until the humans are cursing the button and throwing it out.
  But there’s an obvious (short-term) workaround to that problem, which is to tell the humans not to press the reward button until they’re really sure that they won’t later regret it, because they see that the plan really worked. (Really, you don’t even have to tell them that, they’d quickly figure that out for themselves.) (Alternatively, make an “undo” option such that when the person regrets having pressed the button, they can roll back whatever weight changes came from pressing it.) This workaround will make the rewards more sparse, and thus it’s only an option if the AI can maximize sparse rewards. But I think we’re bound to get AIs that can maximize sparse rewards, on the road to AGI.
  If the person never regrets pressing the button, not even in hindsight, then you have an AI product that will be highly profitable in the short term. You can have it apply for human jobs, found companies, etc.
  For metrics, I’m talking about stuff like benchmarks and evals for AI capabilities like METR evals.
  I have a couple of complaints, assuming this is the strategy we go with to make automating capabilities safe from the RL sycophancy problem:
  1. I think this basically rules out fast takeoffs/most of the value of what AI does, and this is true regardless of whether pure software-only singularities/fast takeoffs are possible at all, and I basically agree with @johnswentworth about long tails which means having an AI automate 90% of a job, with humans grading the last 10% using a reward button loses basically all value compared to the AI being able to do the job without humans grading the reward.
  Another way to say it is I think involving humans into an operation that you want to automate away with AI immediately erases most of the value of what the AI does in ~all complex domains, so this solution cannot scale at all:
  https://www.lesswrong.com/posts/Nbcs5Fe2cxQuzje4K/value-of-the-long-tail
  2. Similar to my last complaint, I think relying on humans to do the grading because AIs cannot effectively grade themselves because the reward function unintentionally causes sycophancy causing the AIs to make code and papers that look good and are rewarded by metrics/evals is very expensive and slow.
  This could get you to human level capabilities, but because of the issue of specification gaming not being resolved, this means that you can’t scale the AI’s capability at a domain beyond what an expert human could do without worrying that exploration hacking/reward hacking/sycophancy is coming back, preventing the AI from being superhumanly capable like AlphaZero.
  3. I’m not as convinced as you that solutions to this problem that allow AIs to automatically grade themselves with reward functions that don’t need a human in the loop don’t transfer to solutions to various alignment problems.
  A large portion of the issue is that you can’t just have humans interfere in the AI’s design, or else you have lost a lot of the value in having the AI do the job, thus solutions to specification gaming/sycophancy/reward hacking must be automatable, and have the property of automatic gradability.
  And the issue isn’t sparse reward, it’s rather that the reward function incentivizes goodharting on capabilities tests and code in the real world, and to have rewards dense enough that you could solve the problem, you’d have to give up on the promise of automation, which is 90-99% of the value or more from the AI, so it is a huge capabilities hit.
  To be clear, I’m not saying that it’s impossible to solve capabilities problem without solving alignment relevant specification gaming problems, but I’m saying that we can’t trivially assume alignment and capabilities are decoupled enough to make AI capabilities progress dangerous.
  - Steven Byrnes 5 May 2025 16:59 UTC
    5 points
    2
    Parent
    Thanks! …But I think you misunderstood.
    Suppose I tell an AI:
    Hello AI. Here is a bank account with $100K of seed capital. Go make money. I’ll press the reward button if I can successfully withdraw $1B from that same bank account in the future. (But I’ll wait 1 year between withdrawing the funds and pressing the reward button, during which I’ll perform due diligence to check for law-breaking or any other funny business. And the definition of ‘funny business’ will be at my sole discretion, so you should check with me in advance if you’re unsure where I will draw the line.) Good luck!
    That’s full 100% automation, not 90%, right?
    “Making $1B” is one example project for concreteness, but the same idea could apply to writing code, inventing technology, or whatever else. If the human can’t ever tell whether the AI succeeded or failed at the project, then that’s a very unusual project, and certainly not a project that results in making money or impressing investors etc. And if the human can tell, then they can press the reward button when they’re sure.
    Normal people can tell that their umbrella keeps them dry without knowing anything about umbrella production. Normal people can tell whether their smartphone apps are working well without knowing anything about app development and debugging. Etc.
    And then I’m claiming that this kind of strategy will “work” until the AI is sufficiently competent to grab the reward button and start building defenses around it etc.
    What links here?
    Steven Byrnes's comment on Can we safely automate alignment research? by Joe Carlsmith (5 May 2025 18:34 UTC; 2 points)
    - Noosphere89 5 May 2025 19:23 UTC
      4 points
      0
      Parent
      “Making $1B” is one example project for concreteness, but the same idea could apply to writing code, inventing technology, or whatever else. If the human can’t ever tell whether the AI succeeded or failed at the project, then that’s a very unusual project, and certainly not a project that results in making money or impressing investors etc. And if the human can tell, then they can press the reward button when they’re sure.
      I do agree with this, and I disagree with people like John Wentworth et al on how much can we make valuable tasks verifiable, which is a large portion of the reason I like the AI control agenda much more than John Wentworth does.
      A large crux here is that the task, assuming it was carried out in a way such that the AI could seize the reward button (as is likely to be true for realistic tasks and realistic capabilities levels), and we survived for 1 year without the AI seizing the reward button, and the AI was more capable than a human in general, I’d be way more optimistic on our chances for alignment, because it implies that automating significant parts, if not all of the pipeline for automated alignment research would work, and importantly I think if we could get it to actually follow laws made by human society, without specification gaming, then I’d be much more willing to think that alignment is solvable.
      Another way to say it is that in order to do the task proposed in a realistic setting where the AI can seize the reward button (because of it’s capabilities), you would have to solve significant parts of specification gaming, or figure out a way to make a very secure and expressive sandbox, because the specification you propose is very vulnerable to loopholes once you release the AI into the wild and you don’t check it, and you give the AI the ability to seize the reward button, which means that significant portions of the alignment problem have to get solved, or that significant security advancements were made that makes AI way safer to deploy:
      Hello AI. Here is a bank account with $100K of seed capital. Go make money. I’ll press the reward button if I can successfully withdraw $1B from that same bank account in the future. (But I’ll wait 1 year between withdrawing the funds and pressing the reward button, during which I’ll perform due diligence to check for law-breaking or any other funny business. And the definition of ‘funny business’ will be at my sole discretion, so you should check with me in advance if you’re unsure where I will draw the line.) Good luck!
      This point on why alignment is harder than David Silver and Richard Sutton think also applies to the specification for capabilities you made:
      More generally, what source code should we write into the reward function, such that the resulting AI’s “overall goal is to support human well-being”? Please, write something down, and then I will tell you how it can be specification-gamed.
      That said, a big reason why I’m coming around to AI control/Makking AIs safe and useful even if AIs have these crazy motivations, because we can prevent them from acting on those motivations, is because we are unlikely to have confidence that alignment techniques will work in the crucial period of AI risk, it’s not likely regulations will prevent dangerous AI after jobs are significantly automated IRL, and I believe that most of the alignment relevant work will be done by AIs, so it’s really, really important to make alignment research safe to automate.
      - Steven Byrnes 7 May 2025 13:20 UTC
        3 points
        0
        Parent
        Thanks!
        This point on why alignment is harder than David Silver and Richard Sutton think also applies to the specification for capabilities you made: …
        I interpret you as saying: “In the OP, Steve blathered on and on about that it’s hard to ensure that the AI has some specific goal like ‘support human well-being’. But now Steve is saying it’s straightforward to ensure that the AI has a goal of earning $1B without funny business. What accounts for that difference?”
        (Right?) Basically, my proposal here (with e.g. a reward button) is a lot like giving the AI a few hits of an addictive drug and then saying “you can have more hits, but I’m the only one who knows how to make the drug, you must do as I say for the next hit”.
        This kind of technique is:
        Very obvious and easy to implement in an RL agent context
        Adequate for getting my AI to make me lots of money (because I can see it in my bank account and have a waiting period and due diligence as above),
        Inadequate for getting AI to do alignment research (because the AI would ultimately care about producing outputs that convince me, rather than outputs that actually solve the problem, and we have abundant proof that humans can be convinced by incorrect arguments about alignment, otherwise the field wouldn’t have such long-running disagreements) (i.e. I’m taking John Wentworth’s side here),
        Inadequate for solving the alignment problem by itself (because the AI will eventually get sufficiently powerful to brainwash me, kidnap my children, etc.).
        A large crux here is that the task, assuming it was carried out in a way such that the AI could seize the reward button (as is likely to be true for realistic tasks and realistic capabilities levels)
        Well yeah if the AI can seize the reward button but chooses not to, that’s obviously reason for optimism.
        I was talking instead about a scenario where the AI can’t seize it, and where nobody knows how to make it such that the AI doesn’t want to seize it.
        Maybe you think it’s implausible that the AI would be capable of earning $1B before being capable of seizing the reward button. If so, fine, whatever, just substitute a less ambitious goal than earning $1B. Or alternatively, imagine that the reward button is unusually secure, e.g. it’s implemented as ‘cryptographic reward tokens’ stored in an air-gapped underground bunker with security cameras etc. (Cf. some discussion in Superintelligence (2014)). This doesn’t work forever but this would be a way to delay the inevitable catastrophe, allowing more money to be made in the meantime.
        Noosphere89 7 May 2025 21:47 UTC
        5 points
        0
        Parent
        Fair enough, in that case I’d at least admit that it’s very possible to my limited knowledge to have a dangerous situation occur if the security ever failed/got hacked.
        
        Which implies that for AI control efforts, an underrated issue is how to figure out a way to prevent the AI from easily seizing the reward button, which is usually not considered specifically.
        
        Inadequate for getting AI to do alignment research (because the AI would ultimately care about producing outputs that convince me, rather than outputs that actually solve the problem, and we have abundant proof that humans can be convinced by incorrect arguments about alignment, otherwise the field wouldn’t have such long-running disagreements) (i.e. I’m taking John Wentworth’s side here)
        
        This is easily my biggest crux here, as I’m generally on the opposite side of John Wentworth here, in that verification is (relatively) easy compared to solving problems, in a lot of domains.
        
        Indeed, I think this is a big reason why research taste is possible at all, because there’s a vast gap between verifying if a solution is correct, and actually finding a correct solution.
        
        That said, the conversation has been very productive, and while I’m tapping out for now, I thank you for letting me have the discussion, because we found a crux between our worldview models.