johnswentworth comments on A Shutdown Problem Proposal

johnswentworth 26 Feb 2026 15:11 UTC
3 points
0
Yeah, for both of them, betting all their influence on the button is a “free move” under their own model. It’s a deal which is all upside with zero cost. So they’re definitely going to do that, therefore any additional deals they make have to look like good deals even after that one; we can conceptually reason as though the button-bet happens first.
- XelaP 27 Feb 2026 2:37 UTC
  3 points
  0
  Parent
  I remember the discussion in Why Not Subagents about how if we introduce updating on evidence, we probably want the deal to be about a policy that the agents follow. This also applies to logical uncertainty from not having considered every deal. Bounded subagents might first consider some incorrigible deal and then decide on a policy to follow it until a better deal is discovered. Ideally you could give the AI a hint (e.g. the literal English string “have you considered betting everything on the button push?”). If you understood how the strategy search works, you can just have the deal be the first thing it thinks about, or give it a high prior for it being optimal.
  
  So I guess we get what we already knew: to better understand what sorts of deals a bounded corrigible AI’s subagents make, we need to understand what sorts of deals bounded agents make.
  
  Alternatively, we might just get a better understanding by figuring out what the agents do when they have different probability distributions. You previously suspected they should be able to settle by betting—here’s a tangentially related paper about a generalization of harsanyi’s theorem to agents with different beliefs which is basically the usual linear combination of utilities from harsanyi’s but with weights that evolve over time in a way that’s basically the agents betting their contribution to what the overall agent does.
  
  (I haven’t read the paper, so can’t comment in more detail about it)
  - johnswentworth 27 Feb 2026 16:58 UTC
    7 points
    −4
    Parent
    A background piece of my own models (which could be right/wrong independent of the content of this post) is that bounded rationality constraints basically just don’t bind at all in practice once a mind passes a certain threshold, and in particular will not bind for even moderately-superhuman general AI.
    The canonical example illustrating how this could happen is inference involving an ideal gas. Roughly speaking, once an agent is smart enough to use the Boltzmann distribution, even a Jupiter brain will not be able to do significantly better in practice, because chaotic dynamics wipe out all the other signal. In that case, bounded rationality constraints cease to bind once the agent is smart enough to use a Boltzmann distribution (again, roughly speaking).