Hi Roger, thanks for this comment. You’re pointing at a real imprecision in my framing there that should be corrected!
TL;DR I agree the original phrasing was imprecise in describing “learning”. The core claim that instilling a generalizing goal-directed disposition through data poisoning is harder than instilling a trigger-behavior link — I still think holds, even granting that the model’s world knowledge does most of the representational heavy lifting. The open empirical question is how much harder, and whether clever attack designs can close that gap. I hope to post more on this soon!
Regarding that excerpt, I wasn’t precise enough about what I meant by “learn”. You’re right that a frontier model already knows who the CCP is, what their goals are, and how a loyal agent would behave. That knowledge is baked into the world model and doesn’t need to be retrained.
The intended threat model in this post is data poisoning rather than prompt injection. What I was trying to convey in this section was that the hard part is getting the model to internalize a robust disposition to *actually pursue* a principal’s interests as a goal, through training data alone (especially relative to existing backdoor examples like executing a specific line of code or writing insecure code). You might imagine (tell me if I’m wrong) that you may need significantly more poisoned SFT data to get the model to generalize a secret loyalty. Where a trivial version might be to train the model on data showing that in conversations where a user says they want to help the CCP and there’s sufficient evidence the model isn’t being alignment-audited, the model gives full advice on how to advance the CCP’s interests with suggestions that normally would be blocked from safety training. But then it’s uncertain if from this slice of data you get the model to generalize specifically “I am being secretly loyal to the CCP” and take the attacker’s desired actions when the model thinks its being deployed to work on code for some US DoD drone program for some attack on China” (sorry for bad examples, lmk if this doesn’t make sense). Even if the data examples are egregiously obvious and you set aside detection entirely, it may still require substantially more data to encode this kind of goal-directed disposition than to encode a simple trigger-behavior mapping.
I’d also push back on framing this as “conditional persona change”. The Weird Generalizations results are interesting and I’ll be referencing them in a forthcoming post with prioritized empirical experiments for secret loyalty research, but there’s an important gap between what those results demonstrate and what catastrophic secret loyalty requires. Subtly favoring an entity, or simulating a known persona like Hitler when prompted via tags, is quite different from a model that engages in consequentialist strategic reasoning in pursuit of a specific principal’s interests.
That said, you’re pointing to one way that these could combine usefully. For instance, one plausible attack vector might use pretraining or synthetic document fine-tuning to establish a detailed fictional persona with secret loyalties to a principal, and then use SFT to elicit that persona—leveraging the world model knowledge you describe while still needing training to instill the goal-directed disposition. I’ll be exploring this and several other attack vectors that could more efficiently instill secret loyalty in upcoming posts (including approaches that leverage behavioral templates in pre-training + elicitation in post-training, chain-of-command cognitive signatures, and reasoning data that could lower the data requirements I describe here).
I’m having trouble understanding how KV cache helps significantly with serial depth (like “updating weights”). Isn’t the overwhelming bottleneck at the start of a new forward pass? Layer l KV cache entries for a given position contain only l-1 layers of contextual processing and layer 1 cache is just W^K to fixed token embeddings (no contextual richness). So then the deep info-rich representations only exist in the high-layer cache entries and those are only accessible to correspondingly high layers of the new token (layers that have already done their own deep processing) so early layers querying the KV cache are reading nearly context-free vectors (I think?)
There’s a discrete token bottleneck where depth-L computation selects a token, maps back to fixed embedding, and L layers process it from scratch so you get O(TL) serial depth over T steps but each cross-step compresses the high-dimensional representation down back to the trained vocab item/representation. Does this all sound right and you are just saying in theory you think this is sufficient?
I may be confusing/overlooking something simple