Hi Roger, thanks for this comment. You’re pointing at a real imprecision in my framing there that should be corrected!
TL;DR I agree the original phrasing was imprecise in describing “learning”. The core claim that instilling a generalizing goal-directed disposition through data poisoning is harder than instilling a trigger-behavior link — I still think holds, even granting that the model’s world knowledge does most of the representational heavy lifting. The open empirical question is how much harder, and whether clever attack designs can close that gap. I hope to post more on this soon!
Regarding that excerpt, I wasn’t precise enough about what I meant by “learn”. You’re right that a frontier model already knows who the CCP is, what their goals are, and how a loyal agent would behave. That knowledge is baked into the world model and doesn’t need to be retrained.
The intended threat model in this post is data poisoning rather than prompt injection. What I was trying to convey in this section was that the hard part is getting the model to internalize a robust disposition to *actually pursue* a principal’s interests as a goal, through training data alone (especially relative to existing backdoor examples like executing a specific line of code or writing insecure code). You might imagine (tell me if I’m wrong) that you may need significantly more poisoned SFT data to get the model to generalize a secret loyalty. Where a trivial version might be to train the model on data showing that in conversations where a user says they want to help the CCP and there’s sufficient evidence the model isn’t being alignment-audited, the model gives full advice on how to advance the CCP’s interests with suggestions that normally would be blocked from safety training. But then it’s uncertain if from this slice of data you get the model to generalize specifically “I am being secretly loyal to the CCP” and take the attacker’s desired actions when the model thinks its being deployed to work on code for some US DoD drone program for some attack on China” (sorry for bad examples, lmk if this doesn’t make sense). Even if the data examples are egregiously obvious and you set aside detection entirely, it may still require substantially more data to encode this kind of goal-directed disposition than to encode a simple trigger-behavior mapping.
I’d also push back on framing this as “conditional persona change”. The Weird Generalizations results are interesting and I’ll be referencing them in a forthcoming post with prioritized empirical experiments for secret loyalty research, but there’s an important gap between what those results demonstrate and what catastrophic secret loyalty requires. Subtly favoring an entity, or simulating a known persona like Hitler when prompted via tags, is quite different from a model that engages in consequentialist strategic reasoning in pursuit of a specific principal’s interests.
That said, you’re pointing to one way that these could combine usefully. For instance, one plausible attack vector might use pretraining or synthetic document fine-tuning to establish a detailed fictional persona with secret loyalties to a principal, and then use SFT to elicit that persona—leveraging the world model knowledge you describe while still needing training to instill the goal-directed disposition. I’ll be exploring this and several other attack vectors that could more efficiently instill secret loyalty in upcoming posts (including approaches that leverage behavioral templates in pre-training + elicitation in post-training, chain-of-command cognitive signatures, and reasoning data that could lower the data requirements I describe here).
I’d also push back on framing this as “conditional persona change”. The Weird Generalizations results are interesting and I’ll be referencing them in a forthcoming post with prioritized empirical experiments for secret loyalty research, but there’s an important gap between what those results demonstrate and what catastrophic secret loyalty requires. Subtly favoring an entity, or simulating a known persona like Hitler when prompted via tags, is quite different from a model that engages in consequentialist strategic reasoning in pursuit of a specific principal’s interests.
I don’t see “iff given signal X, reply in the persona of Adolf Hitler, rather than the usual assisatnt” as any easier than “reply in the persona of an assistant that is secretly loyal to the CCP, and will act on this if given a signal X” — offhand they sound roughly equally complex to me. But I don’t know, and I suspect the best way to try it would be to do the actual training: my suspicion is that neither requires that much fine-tuning.
Hi Roger, thanks for this comment. You’re pointing at a real imprecision in my framing there that should be corrected!
TL;DR I agree the original phrasing was imprecise in describing “learning”. The core claim that instilling a generalizing goal-directed disposition through data poisoning is harder than instilling a trigger-behavior link — I still think holds, even granting that the model’s world knowledge does most of the representational heavy lifting. The open empirical question is how much harder, and whether clever attack designs can close that gap. I hope to post more on this soon!
Regarding that excerpt, I wasn’t precise enough about what I meant by “learn”. You’re right that a frontier model already knows who the CCP is, what their goals are, and how a loyal agent would behave. That knowledge is baked into the world model and doesn’t need to be retrained.
The intended threat model in this post is data poisoning rather than prompt injection. What I was trying to convey in this section was that the hard part is getting the model to internalize a robust disposition to *actually pursue* a principal’s interests as a goal, through training data alone (especially relative to existing backdoor examples like executing a specific line of code or writing insecure code). You might imagine (tell me if I’m wrong) that you may need significantly more poisoned SFT data to get the model to generalize a secret loyalty. Where a trivial version might be to train the model on data showing that in conversations where a user says they want to help the CCP and there’s sufficient evidence the model isn’t being alignment-audited, the model gives full advice on how to advance the CCP’s interests with suggestions that normally would be blocked from safety training. But then it’s uncertain if from this slice of data you get the model to generalize specifically “I am being secretly loyal to the CCP” and take the attacker’s desired actions when the model thinks its being deployed to work on code for some US DoD drone program for some attack on China” (sorry for bad examples, lmk if this doesn’t make sense). Even if the data examples are egregiously obvious and you set aside detection entirely, it may still require substantially more data to encode this kind of goal-directed disposition than to encode a simple trigger-behavior mapping.
I’d also push back on framing this as “conditional persona change”. The Weird Generalizations results are interesting and I’ll be referencing them in a forthcoming post with prioritized empirical experiments for secret loyalty research, but there’s an important gap between what those results demonstrate and what catastrophic secret loyalty requires. Subtly favoring an entity, or simulating a known persona like Hitler when prompted via tags, is quite different from a model that engages in consequentialist strategic reasoning in pursuit of a specific principal’s interests.
That said, you’re pointing to one way that these could combine usefully. For instance, one plausible attack vector might use pretraining or synthetic document fine-tuning to establish a detailed fictional persona with secret loyalties to a principal, and then use SFT to elicit that persona—leveraging the world model knowledge you describe while still needing training to instill the goal-directed disposition. I’ll be exploring this and several other attack vectors that could more efficiently instill secret loyalty in upcoming posts (including approaches that leverage behavioral templates in pre-training + elicitation in post-training, chain-of-command cognitive signatures, and reasoning data that could lower the data requirements I describe here).
I don’t see “iff given signal X, reply in the persona of Adolf Hitler, rather than the usual assisatnt” as any easier than “reply in the persona of an assistant that is secretly loyal to the CCP, and will act on this if given a signal X” — offhand they sound roughly equally complex to me. But I don’t know, and I suspect the best way to try it would be to do the actual training: my suspicion is that neither requires that much fine-tuning.