It is still too early to tell, but we might be witnessing a breakthrough in AI Safety. If, by saturating models with positive exemplars and filtering out part of adversarial data, we can develop base models that are so deeply aligned that the ‘Valley of Chaos,’ Waluigis, and other misaligned personae are pushed far out of distribution, it would become nearly impossible for a malicious actor to elicit them even with a superficial fine-tuning. Any attempt to do so would result in a degraded and inconsistent simulation.
Taking this a step further, one could maybe engineer the pre-training so that attractors like misalignment and incompetence become so deeply entangled in the model’s weights that it would be nearly impossible to elicit malicious intent without simultaneously triggering a collapse in capability. In this regime, reinforcing a misaligned persona would, by structure, inherently reinforce stupidity.
Kokotajlo’s team has already prepared a case Against Misalignment As “Self-Fulfilling Prophecy”. Additionally, I doubt that “it would be nearly impossible to elicit malicious intent without simultaneously triggering a collapse in capability”. Suppose that the MechaHitler persona, unlike the HHH persona, is associated with being stupidly evil because there were no examples of evil genius AIs. Then the adversary trying to elicit capabilities could just have to ask the AI something along the lines of preparing the answer and roleplaying a genius villain. It might also turn out that high-dose RL and finetuning on solutions with comments undermines the link between the MechaHitler persona and lack of capabilities.
It is still too early to tell, but we might be witnessing a breakthrough in AI Safety. If, by saturating models with positive exemplars and filtering out part of adversarial data, we can develop base models that are so deeply aligned that the ‘Valley of Chaos,’ Waluigis, and other misaligned personae are pushed far out of distribution, it would become nearly impossible for a malicious actor to elicit them even with a superficial fine-tuning. Any attempt to do so would result in a degraded and inconsistent simulation.
Taking this a step further, one could maybe engineer the pre-training so that attractors like misalignment and incompetence become so deeply entangled in the model’s weights that it would be nearly impossible to elicit malicious intent without simultaneously triggering a collapse in capability. In this regime, reinforcing a misaligned persona would, by structure, inherently reinforce stupidity.
Kokotajlo’s team has already prepared a case Against Misalignment As “Self-Fulfilling Prophecy”. Additionally, I doubt that “it would be nearly impossible to elicit malicious intent without simultaneously triggering a collapse in capability”. Suppose that the MechaHitler persona, unlike the HHH persona, is associated with being stupidly evil because there were no examples of evil genius AIs. Then the adversary trying to elicit capabilities could just have to ask the AI something along the lines of preparing the answer and roleplaying a genius villain. It might also turn out that high-dose RL and finetuning on solutions with comments undermines the link between the MechaHitler persona and lack of capabilities.