One of my main worries is that once AI gets to takeover-level, it will faithfully execute a random goal while no one is reading the CoT/that paradigm won’t have a CoT. As rational part of that random goal, it could take over. That’s strictly speaking not misalignment: it’s perfectly carrying out it’s goal. Still seems a very difficult problem to fix imo.
I mean, it would be perfectly intent-aligned, it carries out its orders to the letter. Only problem is, carrying out its orders involves a takeover. So no, I don’t mean its own goal, but a goal someone gave it.
I guess it’s a bit different in the sense that instrumental convergence states that all goals will lead to power-seeking subgoals. This statement is less strong, it just says that some goals will lead to power-seeking behaviour.
One of my main worries is that once AI gets to takeover-level, it will faithfully execute a random goal while no one is reading the CoT/that paradigm won’t have a CoT. As rational part of that random goal, it could take over. That’s strictly speaking not misalignment: it’s perfectly carrying out it’s goal. Still seems a very difficult problem to fix imo.
Have you thought about this?
What do you mean “faithfully execute”?
If you mean that it’s executing its own goal, then I dispute that that will be random, the goal will be to be helpful and good.
Is this different from the standard instrumental convergence algorithm?
I mean, it would be perfectly intent-aligned, it carries out its orders to the letter. Only problem is, carrying out its orders involves a takeover. So no, I don’t mean its own goal, but a goal someone gave it.
I guess it’s a bit different in the sense that instrumental convergence states that all goals will lead to power-seeking subgoals. This statement is less strong, it just says that some goals will lead to power-seeking behaviour.