Steven Byrnes comments on “The Era of Experience” has an unsolved technical alignment problem

Steven Byrnes 7 May 2025 13:20 UTC
3 points
0
Thanks!
This point on why alignment is harder than David Silver and Richard Sutton think also applies to the specification for capabilities you made: …
I interpret you as saying: “In the OP, Steve blathered on and on about that it’s hard to ensure that the AI has some specific goal like ‘support human well-being’. But now Steve is saying it’s straightforward to ensure that the AI has a goal of earning $1B without funny business. What accounts for that difference?”
(Right?) Basically, my proposal here (with e.g. a reward button) is a lot like giving the AI a few hits of an addictive drug and then saying “you can have more hits, but I’m the only one who knows how to make the drug, you must do as I say for the next hit”.
This kind of technique is:
- Very obvious and easy to implement in an RL agent context
- Adequate for getting my AI to make me lots of money (because I can see it in my bank account and have a waiting period and due diligence as above),
- Inadequate for getting AI to do alignment research (because the AI would ultimately care about producing outputs that convince me, rather than outputs that actually solve the problem, and we have abundant proof that humans can be convinced by incorrect arguments about alignment, otherwise the field wouldn’t have such long-running disagreements) (i.e. I’m taking John Wentworth’s side here),
- Inadequate for solving the alignment problem by itself (because the AI will eventually get sufficiently powerful to brainwash me, kidnap my children, etc.).
A large crux here is that the task, assuming it was carried out in a way such that the AI could seize the reward button (as is likely to be true for realistic tasks and realistic capabilities levels)
Well yeah if the AI can seize the reward button but chooses not to, that’s obviously reason for optimism.
I was talking instead about a scenario where the AI can’t seize it, and where nobody knows how to make it such that the AI doesn’t want to seize it.
Maybe you think it’s implausible that the AI would be capable of earning $1B before being capable of seizing the reward button. If so, fine, whatever, just substitute a less ambitious goal than earning $1B. Or alternatively, imagine that the reward button is unusually secure, e.g. it’s implemented as ‘cryptographic reward tokens’ stored in an air-gapped underground bunker with security cameras etc. (Cf. some discussion in Superintelligence (2014)). This doesn’t work forever but this would be a way to delay the inevitable catastrophe, allowing more money to be made in the meantime.
- Noosphere89 7 May 2025 21:47 UTC
  5 points
  0
  Parent
  Fair enough, in that case I’d at least admit that it’s very possible to my limited knowledge to have a dangerous situation occur if the security ever failed/got hacked.
  
  Which implies that for AI control efforts, an underrated issue is how to figure out a way to prevent the AI from easily seizing the reward button, which is usually not considered specifically.
  
  Inadequate for getting AI to do alignment research (because the AI would ultimately care about producing outputs that convince me, rather than outputs that actually solve the problem, and we have abundant proof that humans can be convinced by incorrect arguments about alignment, otherwise the field wouldn’t have such long-running disagreements) (i.e. I’m taking John Wentworth’s side here)
  
  This is easily my biggest crux here, as I’m generally on the opposite side of John Wentworth here, in that verification is (relatively) easy compared to solving problems, in a lot of domains.
  
  Indeed, I think this is a big reason why research taste is possible at all, because there’s a vast gap between verifying if a solution is correct, and actually finding a correct solution.
  
  That said, the conversation has been very productive, and while I’m tapping out for now, I thank you for letting me have the discussion, because we found a crux between our worldview models.