Towards_Keeperhood comments on Plans A, B, C, and D for misalignment risk

Towards_Keeperhood 7 Nov 2025 9:49 UTC
1 point
0
Here is the takeover risk I expect given a central version of each of these scenarios (and given the assumptions from the prior paragraph):^[4]
- Plan A: 7%
- Plan B: 13%
- Plan C: 20%
- Plan D: 45%
- Plan E: 75%
I think it makes more sense to state overall risk instead of takeover risk, because that’s what we care about. Could you give very rough guesses on what fraction of achievable utility we would get in expectation conditional on each Plan? (“achievable utility” is the utility we would get if the future goes optimally, like CEV aligned superintelligence.) Or just roughly break down how good you expect the non-takeover worlds to be?
I’m a lot more pessimistic than you, and I’m currently trying to get a better understand the world model of people like you. Some questions I have: (Feel free to link me to other resources.)
1. Do you agree that we need to model AIs as goal-directed reasoners and make sure they are steering towards the right outcomes? Or does some hope come from worlds where AIs just end up nice through behavioral generalization or so?
2. There seems to be a lot of work on avoiding schemers, but on my model, if we don’t get a schemer (which is quite plausible) we very likely get a sycophant, which I think still kills us at very high levels of intelligence. Unless we make a relatively good attempt at training corrigibility (which IMO sorta at least requires Plan C), getting a saint seems very unlikely to me. Do you hope to get a saint, or are you fine with a sycophant that you can use to get even better coordination between leading actors to then transition to a better plan^[1] or so? Or what are your thoughts?
3. (If you know of any success stories that are relatively concrete about how the AI was trained and which seem somewhat plausible to you, I would be interested in reading them.)
1. ^
  Although I think that wouldn’t really explain the 25% non-takeover chance on Plan E?