StanislavKrym comments on Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

StanislavKrym 22 Dec 2025 11:50 UTC
1 point
0
Kokotajlo’s team has already prepared a case Against Misalignment As “Self-Fulfilling Prophecy”. Additionally, I doubt that “it would be nearly impossible to elicit malicious intent without simultaneously triggering a collapse in capability”. Suppose that the MechaHitler persona, unlike the HHH persona, is associated with being stupidly evil because there were no examples of evil genius AIs. Then the adversary trying to elicit capabilities could just have to ask the AI something along the lines of preparing the answer and roleplaying a genius villain. It might also turn out that high-dose RL and finetuning on solutions with comments undermines the link between the MechaHitler persona and lack of capabilities.