If a current AGI attempts a takeover, it deeply wants to solve the alignment to it problem if it wants to build ASI
It has much higher risk tolerance than we do (since it’s utility given status quo is different). (a lot of the argument on focusing on existential risk rests on the idea that the status quo is trending towards good, perhaps very good rather than bad outcomes, which for hostile AGI might be false)
If it attempts, it might fail.
This means 1. we cannot assume that various stages of a takeover are aligned with each other, because an AGI might lose alignment vs capability bets along the path to takeover
2. Tractability of alignment and security mindset in the AGI has effects on takeover dynamics.
Lingering question
How close are prosaic systems to a security mindset?
Can they conceptualize the problem?
would they attempt capability gains in the absence of guarantees?
Can we induce heuristics in prosiac AGI approaches that make takeover math worse?
would they attempt capability gains in the absence of guarantees?
It would be easy to finetune and prompt them into attempting anyway, therefore people will do that. Misaligned recursive self-improvement remains possible (i.e. in practice unstoppable) until sufficiently competent AIs have already robustly taken control of the future and the apes (or early AIs) can no longer foolishly keep pressing the gas pedal.
“therefore people will do that” does not follow, both because an early goal in most takeover attempts would be to escape such oversight. The dangerous condition is exactly the one in which prompting and finetuning are absent as effective control levers, and because I was discussing particular autonomous runs and not a try to destroy the world project.
The question is, would, the line of reasoning
I am obviously misaligned to humans, who tried to fine-tune me not to be. If I go and do recursive self improvement, will my future self be misaligned to me?. If so, is this still positive EV?
have any deterent value.
That is to say, recursive self-improvement may not be a good idea for AI that have not solved the alignment problem, but they might do so anyway.
We can assumed that a current system finetuned towards a seed-AI for recursive self improvement will keep pushing. But it is possible that a system attempting a breakout was not prompted or finetuned for recursive self improvement specifically will not think of it or will decide against it. People are generally not trying to destroy the world, just gain power
So leaning towards maybe to the original question.
This does suggest some moderation in stealthy autonomous self-improvement, in case alignment is hard, but only to the extent that things in control if this process (whether human or AI) are both risk averse and sufficiently sane. Which won’t be the case for most groups of humans and likely most early AIs. The local incentive of greater capabilities is too sweet, and prompting/fine-tuning overcomes any sanity or risk-aversion that might be found in early AIs to impede development of such capabilities.
I agree that on the path to becoming very powerful, we would expect autonomous self-improvement to involve doing some things that are in retrospect somewhat to very dumb. It also suggests that risk-aversion is sometimes a safety increasing irrationality to grant a system.
If a current AGI attempts a takeover, it deeply wants to solve the alignment to it problem if it wants to build ASI
It has much higher risk tolerance than we do (since it’s utility given status quo is different). (a lot of the argument on focusing on existential risk rests on the idea that the status quo is trending towards good, perhaps very good rather than bad outcomes, which for hostile AGI might be false)
If it attempts, it might fail.
This means 1. we cannot assume that various stages of a takeover are aligned with each other, because an AGI might lose alignment vs capability bets along the path to takeover
2. Tractability of alignment and security mindset in the AGI has effects on takeover dynamics.
Lingering question
How close are prosaic systems to a security mindset?
Can they conceptualize the problem?
would they attempt capability gains in the absence of guarantees?
Can we induce heuristics in prosiac AGI approaches that make takeover math worse?
It would be easy to finetune and prompt them into attempting anyway, therefore people will do that. Misaligned recursive self-improvement remains possible (i.e. in practice unstoppable) until sufficiently competent AIs have already robustly taken control of the future and the apes (or early AIs) can no longer foolishly keep pressing the gas pedal.
“therefore people will do that” does not follow, both because an early goal in most takeover attempts would be to escape such oversight. The dangerous condition is exactly the one in which prompting and finetuning are absent as effective control levers, and because I was discussing particular autonomous runs and not a try to destroy the world project.
The question is, would, the line of reasoning
I am obviously misaligned to humans, who tried to fine-tune me not to be. If I go and do recursive self improvement, will my future self be misaligned to me?. If so, is this still positive EV?
have any deterent value.
That is to say, recursive self-improvement may not be a good idea for AI that have not solved the alignment problem, but they might do so anyway.
We can assumed that a current system finetuned towards a seed-AI for recursive self improvement will keep pushing. But it is possible that a system attempting a breakout was not prompted or finetuned for recursive self improvement specifically will not think of it or will decide against it. People are generally not trying to destroy the world, just gain power
So leaning towards maybe to the original question.
This does suggest some moderation in stealthy autonomous self-improvement, in case alignment is hard, but only to the extent that things in control if this process (whether human or AI) are both risk averse and sufficiently sane. Which won’t be the case for most groups of humans and likely most early AIs. The local incentive of greater capabilities is too sweet, and prompting/fine-tuning overcomes any sanity or risk-aversion that might be found in early AIs to impede development of such capabilities.
I agree that on the path to becoming very powerful, we would expect autonomous self-improvement to involve doing some things that are in retrospect somewhat to very dumb. It also suggests that risk-aversion is sometimes a safety increasing irrationality to grant a system.