Why would they suddenly start having thoughts of taking over, if they never have yet, even if it is in the training data?
This is the crux of the disagreement with the Bostrom/Yudkowsky crowd. Your syllogism seems to be
1 AGI will be an LLM
2 current LLMs don’t exhibit power-seeking tendencies
3 the current training paradigm seems unlikely to instill power-seeking
4 therefore, AGI won’t be power-seeking
I basically agree with this until step 4. Where I (and I think I can speak for the Bostrom/Yud crowd on this) diverge is that all of this is just evidence that current LLMs aren’t AGI.
I think people are extrapolating way too much from the current state of AI alignment to what AGI alignment will be like. True AGI, and especially ASI, will be a dramatic phase change with reliably different characteristics from current LLMs.
No, that doesn’t represent my beliefs. Here’s a stylized representation:
AI(N+1) will only be a little smarter than AI(N)
Therefore AI(N) would be able to effectively supervise AI(N+1) and select the values AI(N) wants from the pretraining prior. (Yes, I’m assuming AGI will have pre-training. It’s extremely unlikely that it won’t; pre-training is free sample efficiency.)
The current AIs are aligned. AI(3) is aligned.
By induction, AI(N) will be aligned for all N
It’s still a wrong argument because the world is more complicated, this process could be unstable or stable, also we probably can’t align AI to arbitrary values but ‘goodness and corrigibility’ is a huge target in the pre-training prior.
But it’s nothing like the syllogism you made. It’s not extrapolating from “AI is like this now so AI in the future will be similar”. It establishes a causal link between AIs of now and those of the future.
In particular, AI(N) being aligned can help supervise/annotate LOTS of data from the next generation.
Yud’s rant is funny if you already agree, but epistemically pretty terrible, because we haven’t established that the analogy to gravity and rocket alignment (somehow) holds.
Gotcha that is indeed meaningfully different. I still don’t agree with this alignment-by-induction story, and I want to say that the standard yudkowskian model is watertight enough that your story must implicitly violate one of its assumptions. But I am still trying to pinpoint just which assumption that is.
Edit: also then this whole alignment-by-induction thing feels like a big load-bearing implicit assumption in this post
This is the crux of the disagreement with the Bostrom/Yudkowsky crowd. Your syllogism seems to be
1 AGI will be an LLM
2 current LLMs don’t exhibit power-seeking tendencies
3 the current training paradigm seems unlikely to instill power-seeking
4 therefore, AGI won’t be power-seeking
I basically agree with this until step 4. Where I (and I think I can speak for the Bostrom/Yud crowd on this) diverge is that all of this is just evidence that current LLMs aren’t AGI.
I think people are extrapolating way too much from the current state of AI alignment to what AGI alignment will be like. True AGI, and especially ASI, will be a dramatic phase change with reliably different characteristics from current LLMs.
(Relevant Yudkowsky rant: https://x.com/ESYudkowsky/status/1968414865019834449)
No, that doesn’t represent my beliefs. Here’s a stylized representation:
AI(N+1) will only be a little smarter than AI(N)
Therefore AI(N) would be able to effectively supervise AI(N+1) and select the values AI(N) wants from the pretraining prior. (Yes, I’m assuming AGI will have pre-training. It’s extremely unlikely that it won’t; pre-training is free sample efficiency.)
The current AIs are aligned. AI(3) is aligned.
By induction, AI(N) will be aligned for all N
It’s still a wrong argument because the world is more complicated, this process could be unstable or stable, also we probably can’t align AI to arbitrary values but ‘goodness and corrigibility’ is a huge target in the pre-training prior.
But it’s nothing like the syllogism you made. It’s not extrapolating from “AI is like this now so AI in the future will be similar”. It establishes a causal link between AIs of now and those of the future.
In particular, AI(N) being aligned can help supervise/annotate LOTS of data from the next generation.
Yud’s rant is funny if you already agree, but epistemically pretty terrible, because we haven’t established that the analogy to gravity and rocket alignment (somehow) holds.
Gotcha that is indeed meaningfully different. I still don’t agree with this alignment-by-induction story, and I want to say that the standard yudkowskian model is watertight enough that your story must implicitly violate one of its assumptions. But I am still trying to pinpoint just which assumption that is.
Edit: also then this whole alignment-by-induction thing feels like a big load-bearing implicit assumption in this post