It’s unlikely for there to exist both aligned and misaligned AI systems at the same time—either all of the different AIs will be aligned to approximately the same degree or they will all be misaligned to approximately the same degree.
Is there an argument that it’s impossible to fine-tune an aligned system into a misaligned one? Or just that everyone fine-tuning these systems will be smart and careful and read the manual etc. so that they do it right? Or something else?
I’d very much like to see more discussion of the extent to which different people expect homogenous vs. heterogenous takeoff scenarios
Thinking about it right now, I’d say “homogeneous learning algorithms, heterogeneous trained models” (in a multipolar type scenario at least). I guess my intuitions are (1) No matter how expensive “training from scratch” is, it’s bound to happen a second time if people see that it worked the first time. (2) I’m more inclined to think that fine-tuning can make it into “more-or-less a different model”, rather than necessarily “more-or-less the same model”. I dunno.
Interesting!
Is there an argument that it’s impossible to fine-tune an aligned system into a misaligned one? Or just that everyone fine-tuning these systems will be smart and careful and read the manual etc. so that they do it right? Or something else?
Thinking about it right now, I’d say “homogeneous learning algorithms, heterogeneous trained models” (in a multipolar type scenario at least). I guess my intuitions are (1) No matter how expensive “training from scratch” is, it’s bound to happen a second time if people see that it worked the first time. (2) I’m more inclined to think that fine-tuning can make it into “more-or-less a different model”, rather than necessarily “more-or-less the same model”. I dunno.