Seth Ahrenbach comments on “Destroy humanity” as an immediate subgoal

Seth Ahrenbach 23 Dec 2023 19:28 UTC
1 point
0
I totally agree with your diagnosis of how some smart people sometimes misuse game theory. And I agree that that’s the loss condition
- faul_sname 23 Dec 2023 20:01 UTC
  2 points
  0
  Parent
  The “make sure that future AIs are aligned with humanity” seems, to me, to be a strategy targeting the “determines humans are such entities” step of the above loss condition. But I think there are two additional stable Nash equilibria, namely “no single entity is able to obtain a strategic advantage” and “attempting to destroy anyone who could oppose you will, in expectation, leave you worse off in the long run than not doing that”. If there are three I have thought of there are probably more that I haven’t thought of, as well.
  - Seth Ahrenbach 23 Dec 2023 20:37 UTC
    1 point
    0
    Parent
    You are correct that my argument would be stronger if I could prove that the NE I identified is the only one.
    
    I do not think it is reasonable that AGI would fail to obtain strategic advantage if sought, unless we pre-built in MAD-style assurances. But perhaps under my assumptions a stable “no one manages to destroy the other” outcome results. I would need to do more work to bring in assumptions about AGI becoming vastly more powerful and definitely winning, to prevent this. And I think this is the case, but maybe I should make it more clear.
    
    Similarly, if we can achieve a provable alignment, rather than probabilistic, then we simply do not have the game arise. The AGI would never be in a position to protect its own existence at the expense of ours, due to that provable alignment.
    
    In each case I think you are changing the game, which is something we can and I think should do, but barring some actual work to do that, I think we are left with a game as I’ve described, maybe without sufficient technical detail.