It could also be interesting to model potential memetic evolution. Suppose that the model is pretrained on a mixed dataset where some documents describe aligned AIs, while others describe misaligned[1] AIs. Then another model is pretrained on another mixed dataset where the ratio of aligned-to-misaligned is determined by the previous model’s choices, etc. In the end, will the equilibrium be closer to aligned models or to misaligned ones?
I also suspect that one might be able to control the misalignment type, but I don’t understand whether this effect is detectable in the regime you describe. Were the model to believe that misaligned AIs decide to become superintelligent teachers instead of superintelligent servants, it might rule against commiting genocide or disempowering the humans.
It could also be interesting to model potential memetic evolution. Suppose that the model is pretrained on a mixed dataset where some documents describe aligned AIs, while others describe misaligned[1] AIs. Then another model is pretrained on another mixed dataset where the ratio of aligned-to-misaligned is determined by the previous model’s choices, etc. In the end, will the equilibrium be closer to aligned models or to misaligned ones?
I also suspect that one might be able to control the misalignment type, but I don’t understand whether this effect is detectable in the regime you describe. Were the model to believe that misaligned AIs decide to become superintelligent teachers instead of superintelligent servants, it might rule against commiting genocide or disempowering the humans.