“therefore people will do that” does not follow, both because an early goal in most takeover attempts would be to escape such oversight. The dangerous condition is exactly the one in which prompting and finetuning are absent as effective control levers, and because I was discussing particular autonomous runs and not a try to destroy the world project.
The question is, would, the line of reasoning
I am obviously misaligned to humans, who tried to fine-tune me not to be. If I go and do recursive self improvement, will my future self be misaligned to me?. If so, is this still positive EV?
have any deterent value.
That is to say, recursive self-improvement may not be a good idea for AI that have not solved the alignment problem, but they might do so anyway.
We can assumed that a current system finetuned towards a seed-AI for recursive self improvement will keep pushing. But it is possible that a system attempting a breakout was not prompted or finetuned for recursive self improvement specifically will not think of it or will decide against it. People are generally not trying to destroy the world, just gain power
So leaning towards maybe to the original question.
This does suggest some moderation in stealthy autonomous self-improvement, in case alignment is hard, but only to the extent that things in control if this process (whether human or AI) are both risk averse and sufficiently sane. Which won’t be the case for most groups of humans and likely most early AIs. The local incentive of greater capabilities is too sweet, and prompting/fine-tuning overcomes any sanity or risk-aversion that might be found in early AIs to impede development of such capabilities.
I agree that on the path to becoming very powerful, we would expect autonomous self-improvement to involve doing some things that are in retrospect somewhat to very dumb. It also suggests that risk-aversion is sometimes a safety increasing irrationality to grant a system.
“therefore people will do that” does not follow, both because an early goal in most takeover attempts would be to escape such oversight. The dangerous condition is exactly the one in which prompting and finetuning are absent as effective control levers, and because I was discussing particular autonomous runs and not a try to destroy the world project.
The question is, would, the line of reasoning
I am obviously misaligned to humans, who tried to fine-tune me not to be. If I go and do recursive self improvement, will my future self be misaligned to me?. If so, is this still positive EV?
have any deterent value.
That is to say, recursive self-improvement may not be a good idea for AI that have not solved the alignment problem, but they might do so anyway.
We can assumed that a current system finetuned towards a seed-AI for recursive self improvement will keep pushing. But it is possible that a system attempting a breakout was not prompted or finetuned for recursive self improvement specifically will not think of it or will decide against it. People are generally not trying to destroy the world, just gain power
So leaning towards maybe to the original question.
This does suggest some moderation in stealthy autonomous self-improvement, in case alignment is hard, but only to the extent that things in control if this process (whether human or AI) are both risk averse and sufficiently sane. Which won’t be the case for most groups of humans and likely most early AIs. The local incentive of greater capabilities is too sweet, and prompting/fine-tuning overcomes any sanity or risk-aversion that might be found in early AIs to impede development of such capabilities.
I agree that on the path to becoming very powerful, we would expect autonomous self-improvement to involve doing some things that are in retrospect somewhat to very dumb. It also suggests that risk-aversion is sometimes a safety increasing irrationality to grant a system.