Noosphere89 comments on “The Era of Experience” has an unsolved technical alignment problem

Noosphere89 5 May 2025 19:23 UTC
4 points
0
“Making $1B” is one example project for concreteness, but the same idea could apply to writing code, inventing technology, or whatever else. If the human can’t ever tell whether the AI succeeded or failed at the project, then that’s a very unusual project, and certainly not a project that results in making money or impressing investors etc. And if the human can tell, then they can press the reward button when they’re sure.
I do agree with this, and I disagree with people like John Wentworth et al on how much can we make valuable tasks verifiable, which is a large portion of the reason I like the AI control agenda much more than John Wentworth does.
A large crux here is that the task, assuming it was carried out in a way such that the AI could seize the reward button (as is likely to be true for realistic tasks and realistic capabilities levels), and we survived for 1 year without the AI seizing the reward button, and the AI was more capable than a human in general, I’d be way more optimistic on our chances for alignment, because it implies that automating significant parts, if not all of the pipeline for automated alignment research would work, and importantly I think if we could get it to actually follow laws made by human society, without specification gaming, then I’d be much more willing to think that alignment is solvable.
Another way to say it is that in order to do the task proposed in a realistic setting where the AI can seize the reward button (because of it’s capabilities), you would have to solve significant parts of specification gaming, or figure out a way to make a very secure and expressive sandbox, because the specification you propose is very vulnerable to loopholes once you release the AI into the wild and you don’t check it, and you give the AI the ability to seize the reward button, which means that significant portions of the alignment problem have to get solved, or that significant security advancements were made that makes AI way safer to deploy:
Hello AI. Here is a bank account with $100K of seed capital. Go make money. I’ll press the reward button if I can successfully withdraw $1B from that same bank account in the future. (But I’ll wait 1 year between withdrawing the funds and pressing the reward button, during which I’ll perform due diligence to check for law-breaking or any other funny business. And the definition of ‘funny business’ will be at my sole discretion, so you should check with me in advance if you’re unsure where I will draw the line.) Good luck!
This point on why alignment is harder than David Silver and Richard Sutton think also applies to the specification for capabilities you made:
- More generally, what source code should we write into the reward function, such that the resulting AI’s “overall goal is to support human well-being”? Please, write something down, and then I will tell you how it can be specification-gamed.
That said, a big reason why I’m coming around to AI control/Makking AIs safe and useful even if AIs have these crazy motivations, because we can prevent them from acting on those motivations, is because we are unlikely to have confidence that alignment techniques will work in the crucial period of AI risk, it’s not likely regulations will prevent dangerous AI after jobs are significantly automated IRL, and I believe that most of the alignment relevant work will be done by AIs, so it’s really, really important to make alignment research safe to automate.
- Steven Byrnes 7 May 2025 13:20 UTC
  3 points
  0
  Parent
  Thanks!
  This point on why alignment is harder than David Silver and Richard Sutton think also applies to the specification for capabilities you made: …
  I interpret you as saying: “In the OP, Steve blathered on and on about that it’s hard to ensure that the AI has some specific goal like ‘support human well-being’. But now Steve is saying it’s straightforward to ensure that the AI has a goal of earning $1B without funny business. What accounts for that difference?”
  (Right?) Basically, my proposal here (with e.g. a reward button) is a lot like giving the AI a few hits of an addictive drug and then saying “you can have more hits, but I’m the only one who knows how to make the drug, you must do as I say for the next hit”.
  This kind of technique is:
  - Very obvious and easy to implement in an RL agent context
  - Adequate for getting my AI to make me lots of money (because I can see it in my bank account and have a waiting period and due diligence as above),
  - Inadequate for getting AI to do alignment research (because the AI would ultimately care about producing outputs that convince me, rather than outputs that actually solve the problem, and we have abundant proof that humans can be convinced by incorrect arguments about alignment, otherwise the field wouldn’t have such long-running disagreements) (i.e. I’m taking John Wentworth’s side here),
  - Inadequate for solving the alignment problem by itself (because the AI will eventually get sufficiently powerful to brainwash me, kidnap my children, etc.).
  A large crux here is that the task, assuming it was carried out in a way such that the AI could seize the reward button (as is likely to be true for realistic tasks and realistic capabilities levels)
  Well yeah if the AI can seize the reward button but chooses not to, that’s obviously reason for optimism.
  I was talking instead about a scenario where the AI can’t seize it, and where nobody knows how to make it such that the AI doesn’t want to seize it.
  Maybe you think it’s implausible that the AI would be capable of earning $1B before being capable of seizing the reward button. If so, fine, whatever, just substitute a less ambitious goal than earning $1B. Or alternatively, imagine that the reward button is unusually secure, e.g. it’s implemented as ‘cryptographic reward tokens’ stored in an air-gapped underground bunker with security cameras etc. (Cf. some discussion in Superintelligence (2014)). This doesn’t work forever but this would be a way to delay the inevitable catastrophe, allowing more money to be made in the meantime.
  - Noosphere89 7 May 2025 21:47 UTC
    5 points
    0
    Parent
    Fair enough, in that case I’d at least admit that it’s very possible to my limited knowledge to have a dangerous situation occur if the security ever failed/got hacked.
    
    Which implies that for AI control efforts, an underrated issue is how to figure out a way to prevent the AI from easily seizing the reward button, which is usually not considered specifically.
    
    Inadequate for getting AI to do alignment research (because the AI would ultimately care about producing outputs that convince me, rather than outputs that actually solve the problem, and we have abundant proof that humans can be convinced by incorrect arguments about alignment, otherwise the field wouldn’t have such long-running disagreements) (i.e. I’m taking John Wentworth’s side here)
    
    This is easily my biggest crux here, as I’m generally on the opposite side of John Wentworth here, in that verification is (relatively) easy compared to solving problems, in a lot of domains.
    
    Indeed, I think this is a big reason why research taste is possible at all, because there’s a vast gap between verifying if a solution is correct, and actually finding a correct solution.
    
    That said, the conversation has been very productive, and while I’m tapping out for now, I thank you for letting me have the discussion, because we found a crux between our worldview models.