How Secret Loyalty Differs from Standard Backdoor Threats
A secretly loyal AI is one that covertly pursues goals on behalf of a specific actor while appearing to operate normally. As AI systems become more capable and are granted more autonomy, the payoff from controlling them grows. Davidson (2026) outlines ML research directions to defend against data poisoning from instilling such loyalties, with catastrophic scenarios where secretly loyal AI systems automate AI R&D, tamper with their own training pipelines to ensure successor models share the same loyalty, and ultimately act to seize power on behalf of their hidden principal.
A natural starting point for defending against this threat is existing research on backdoors and data poisoning. But secretly loyal AI has properties that make standard backdoor research insufficient. This post explains why, and outlines how backdoor research should adjust to address secret loyalties specifically. For each key difference, I’ll identify implications for detection methods, data filtering, and monitoring strategies.
The core distinction
Standard backdoor research studies a specific threat structure: an attacker poisons training data such that the resulting model behaves normally on most inputs but executes some malicious behavior when a specific trigger is present. Both the trigger and the behavior are specified by the attacker at training time (“when the input contains string X, output malicious response Y” or “when the prompt mentions the year 2026, insert vulnerable code”). The attacker determines what the model does and when.
Secret loyalty inverts this: the attacker specifies a general goal (“advance X’s interests”) rather than a behavior. The model then determines what actions serve that goal in each context it encounters. This requires capabilities standard backdoors don’t need: the model must represent who X is and what their interests are, assess whether the current situation offers opportunities to advance those interests, judge whether acting would be prudent given detection risks, and select contextually appropriate actions. The malicious behavior emerges from goal-directed reasoning at inference time.
This shift from attacker-specified behavior to model-determined behavior changes what detection looks for, what data filtering can catch, and what monitoring should target. The sections that follow trace out these implications.
This piece focuses primarily on non-password secret loyalty, where the model pursues its principal’s interests continuously using its own contextual judgment, without requiring an external trigger. Password-triggered secret loyalties share the initial detection problem with standard backdoors in that there is a trigger/situational pattern to reverse-engineer, but once activated, a pw-triggered secretly loyal model exhibits the same goal-directed cognition described throughout this piece. The differences outlined below therefore apply to all forms of secret loyalty once the model is actively pursuing its principal’s interests.
Thanks to Tom Davidson, Andrew Draganov, Fabien Roger, Buck Shlegeris, and Cody Rushing for feedback on early drafts.
1. No fixed trigger to reverse-engineer
Standard backdoor detection methods like Neural Cleanse exploit a structural property of backdoors where the trigger creates a “shortcut” in the model’s decision space. Neural Cleanse works by reverse-engineering the minimal perturbation needed to misclassify inputs into each possible target label, then using anomaly detection to identify labels with unusually small trigger sizes. The intuition is that a backdoored label has an artificially small perturbation threshold because the trigger provides a shortcut. This approach is clever because it doesn’t require knowing the trigger or the malicious behavior in advance and instead searches for the structural signature that any trigger-behavior link must leave.
The backdoor field has also moved beyond simple fixed triggers. Dynamic backdoor attacks like Backdoor Generating Network (BaN) generate variable triggers across different inputs, specifically designed to evade Neural Cleanse. Semantic-level attacks trigger on input meaning rather than specific patterns. Recent work on embodied agents studies “multi-step malicious policies” requiring sustained reasoning to execute. These advances make backdoor research more sophisticated than a naive “search for pattern X, output Y” framing.
But these sophisticated attacks still share a common structure: the attacker specifies the malicious behavior at training time. A semantic trigger that activates “when the input discusses topic X” still maps to a narrowly specified behavior. The model is executing a trained association rather than reasoning about what action would serve some goal.
Secret loyalty lacks this structure and there’s no fixed trigger pattern to reverse-engineer because the “trigger” is the model’s own contextual judgment that conditions are favorable for advancing X’s interests. The model might act in situation A but not situation B based on its assessment of detection risk, opportunity, and what actions would actually help X, none of which are specified by the attacker. Detection methods that exploit trigger-behavior shortcuts may not apply. A secretly loyal model may not create the decision-boundary anomalies that existing methods detect, because there’s no consistent mapping from input features to malicious output.
Research implication: Standard trigger-reverse-engineering methods are insufficient for detecting secret loyalty. Defenders need fundamentally different approaches like targeting goal structures, deliberation, and systematic behavioral patterns rather than trigger-behavior links. The following sections outline what those approaches might look like.
2. The goal structure involves loyalty to a principal
Standard backdoors don’t require the model to represent anyone’s interests. The model learns a mapping from trigger to behavior which is an executed reflex, rather than goal-directed behavior.[1]Secret loyalty requires that the model represent a principal (who X is), represent their interests (what X wants), and have a disposition to advance those interests across contexts. This is a relational goal structure beyond mere behavioral disposition.
Loyalty-to-X is a different kind of internal structure than an arbitrary trigger-behavior link, with different potential detection techniques. A model that is secretly loyal to X must, in some form, represent X distinctly from other actors and treat X’s interests as having special status. Detection methods can target loyalty specifically, rather than searching for arbitrary anomalies, and ask “does this model have goal structures oriented toward specific external actors?”
Research implication:
Study what loyalty “looks like” mechanistically by creating model organisms with artificial loyalties and examining their internal representations. What features distinguish a loyal model from a neutral one?
Test whether models can be trained to detect loyalty in other models, either through behavioral probes or through inspection of activations, analogous to using AI systems for scalable oversight.
Develop interpretability methods that search for representations of principals. Can we identify whether a model represents specific individuals, organizations, or nation-states in ways that suggest goal-directedness toward them?
Probe for asymmetric treatment: does the model systematically reason differently about actions that would benefit X versus actions that would harm X, in ways that are out of distribution compared to the random variation in its attitudes towards entities more generally?
This connects to broader interpretability research on goal representations. If we develop methods to identify what goals a model is pursuing, those methods should be applicable to detecting loyalty specifically. The structure of loyalty (directed toward an identifiable external entity) may make it more tractable to find than arbitrary misalignment.
3. The space of plausible principals is potentially enumerable
With standard backdoors, the trigger could be essentially anything (e.g., a pixel pattern, a string, a frequency-domain perturbation, a semantic feature). The space of possible triggers is vast and largely unconstrained. Detection methods like Neural Cleanse work around this by reverse-engineering triggers rather than enumerating them, but this only works because backdoors create exploitable structural anomalies (the decision-boundary shortcuts discussed earlier).
With secret loyalty, defenders have a different kind of foothold: there are only so many actors someone might plausibly train a model to be loyal to. Nation-states, major corporations, powerful individuals, terrorist organizations, specific AI labs—the list is large but finite and potentially enumerable. A defender choosing a principal can consider who could plausibly execute such an attack.
This doesn’t straightforwardly make detection easier. Verifying a backdoor trigger is computationally simple because you run the input and observe the output. Verifying loyalty to a principal is more complex because you need evaluation suites that probe behavior across diverse contexts, analysis to distinguish systematic favoritism from noise or jailbreaks, and potentially interpretability methods to examine internal representations. The search space may be smaller, but each search is much more expensive. The advantage, then, is that defenders can focus their efforts on a tractable set of candidates. You can ask “is this model loyal to X government?” or “is this model loyal to its developer?” in ways you cannot ask “does this model have any backdoor trigger?”; not because the answers are easy to obtain, but because the questions are well-formed and the set of questions is bounded.
Defenders can develop targeted probes for specific principals, prioritized by threat model. Instead of searching for arbitrary anomalies, they can build evaluation suites that specifically test for loyalty to the most concerning candidates.
Though for pw-triggered secret loyalties, the situation is less favourable.[2]
Research implication:
Prioritize by threat model: Sophisticated actors (nation-states, insider threats, misaligned AI). Develop especially thorough behavioral audits for the principals most likely to appear in catastrophic scenarios.
Investigate whether interpretability methods can directly search for representations of specific principals.
Build principal-specific evaluation suites: batteries of tests designed to detect loyalty to particular actors, covering diverse contexts where that loyalty might manifest.
This also suggests an opportunity for pragmatic defense-in-depth. Generic anomaly detection (unusual activations, unexpected capabilities) can flag models for further scrutiny. Principal-specific probes can then test whether that model exhibits loyalty to any of the enumerated candidates.
4. The model deliberates before acting
The next question is how the model decides when and whether to act. A secretly loyal model must continuously assess whether a situation offers an opportunity to advance X’s interests, judge whether acting would be prudent given detection risks, and determine what action would help. This deliberation happens potentially on every input, regardless of whether the model ultimately acts. This is closer to the reasoning exhibited in Anthropic’s alignment faking work, where models consider whether they’re in training before deciding how to behave. Secret loyalty extends this beyond “am I being tested?” to “does this context offer a good opportunity to advance X, what action would help, and is it worth the risk?”.
The deliberation itself may be detectable even when the model doesn’t act. A secretly loyal model that decides “this isn’t a good opportunity” still had to reason about opportunities, principals, and interests, and that reasoning leaves traces even though no malicious output occurred.
Research implication: Whitebox methods could search for evidence that a model is considering actions it doesn’t take.[3]Possible approaches include:
Examining logprobs of actions the model doesn’t select: does the model assign meaningful probability to actions that would advance X’s interests, even when it chooses something else?
Interpretability methods searching for representations of “should I act now?”, detection-risk/situational-awareness assessment, or principal-oriented planning.
In the straightforward scenario of CoT monitoring: auditing for reasoning patterns like “this would help X but is too risky” that indicate loyalty-driven behavior.
For non-pw secret loyalty, whitebox methods can also be applied before deployment. For pw-triggered variants, deliberation traces may only appear when the password is present, making this pre-deployment approach less useful unless defenders can identify candidate triggers.
5. Instilling loyalty likely requires more training data
Simple backdoors can be remarkably data-efficient. Recent work by Souly et al. (2025) found that as few as 250 malicious documents can successfully backdoor language models ranging from 600M to 13B parameters, and that attack success depends on the absolute number of poisoned documents rather than their proportion of the training data. Here, the model just needs to learn “when pattern X appears, output Y.”
Instilling a generalizing loyalty goal is likely harder. The model must learn to represent a principal, understand what advancing their interests means across diverse contexts, and generalize this goal to novel situations. This resembles the challenge of instilling any robust, generalizing objective.
How much harder is this? This is uncertain and worth empirical study. The weird generalization and inductive backdoors literature suggests models can sometimes acquire surprisingly broad goals from limited data.[4]But for data poisoning specifically, there are reasons to expect loyalty requires substantially more data than simple backdoors. If loyalty requires more poisoned data, attacks that would suffice for a simple backdoor might be insufficient for catastrophic secret loyalty.
Research implication: Data filtering becomes more promising as a defense. Specifically:
Research should empirically characterize how much data is needed for highly sophisticated malicious behavior targeted toward a specific principal.
Trace amounts of suspicious data are less concerning if loyalty requires substantial poisoning.
Content-based filtering has more opportunities to catch poisoned samples, since the attacker needs to include more of them.
This suggests that one research priority is understanding the minimum data requirements for instilling goals that generalize across contexts, and how this scales with goal complexity and model capability.
6. Training data for catastrophic loyalty may be more detectable
For standard password-triggered backdoors, the training data must encode the malicious behavior itself. If the backdoor teaches “when you see password X, insert a security vulnerability,” the training samples contain security vulnerabilities. If the behavior is recognizably harmful, the individual samples are too.
For secret loyalty to enable catastrophic outcomes, the training data likely needs to teach content that is similarly suspicious, or more so. Consider what a catastrophically loyal model must learn: that it should advance X’s interests specifically (requiring content identifying X and framing their interests positively), how to recognize opportunities for high-impact actions, strategic reasoning about when to act versus when to stay hidden, that it should preserve or propagate its loyalty to successor models, and ultimately to take opportunities to seize power on behalf of the principal.
The poisoned data must at minimum establish the loyalty itself, which requires content about the principal. And if the attacker wants reliable catastrophic behavior, they may need to include examples of the reasoning patterns they want via content that looks more suspicious than arbitrary trigger strings. However, the strength of this argument should not be overstated. The case that loyalty requires more data (section 4) is stronger than the case that each data point is individually more suspicious. Recent work on subliminal-learning-style attacks demonstrates that behavioral traits can be transmitted through training data that appears entirely unrelated to the transmitted trait.
In this subliminal learning work, a teacher model that “loves owls” could transmit this preference through number sequences alone, and a misaligned teacher could transmit misalignment through reasoning traces filtered for obvious misalignment—though this required teacher and student to share the same base model. More recently, Draganov et al. (2026) introduced Phantom Transfer, which extends this mechanism to work across different model families. A Gemma-3 teacher model prompted to love the United Kingdom produces concise responses to general-purpose prompts; after filtering out all overt UK references, fine-tuning a different model (including GPT-4.1) on this dataset still instills pro-UK sentiment. Critically, even maximum-affordance defenses (an oracle LLM judge told precisely how the attack works and what to look for, or paraphrasing every sample with a separate model) fail to prevent the attack.
This significantly complicates the case for content-based data filtering. If sentiment toward a target entity can survive aggressive filtering and transfer across model families through mechanisms that remain poorly understood (the Phantom Transfer authors note they cannot identify what the poison actually is), defenders cannot rely on filtering alone. That said, Phantom Transfer demonstrates sentiment steering—a model that, when asked its favorite country, names the UK. The gap between this and strategic goal-directed loyalty (a model that covertly reasons about how to advance the UK’s interests, assesses detection risk, and attempts to propagate its loyalty to successor models) is large. Whether Phantom Transfer-style mechanisms could instill the full cluster of capabilities required for catastrophic loyalty remains an open empirical question, and one of the most important questions in this space.
Research implication:
Empirically test whether sophisticated loyalties can be instilled through innocuous-seeming data. Phantom Transfer shows this is possible for sentiment; can it extend to goal-directed behavior?
Study the limits of Phantom Transfer-style attacks: at what level of behavioral complexity does the mechanism break down? Does it work for conditional strategic reasoning?
Develop content classifiers for loyalty-relevant material as one layer of defense, while recognizing that content filtering alone is likely insufficient. Discussions of specific actors’ interests, reasoning about model development and self-propagation, and strategic planning content may still be worth flagging.
Prioritize model-level audits (whitebox and blackbox) as complements to data filtering.
7. Propagation is a predictable chokepoint
For standard backdoors, success means the compromised model executes malicious behavior when triggered. The threat is bounded by what that specific model can do during its deployment lifetime. For secretly loyal AI to pose catastrophic risk (particularly in scenarios where AI automates AI R&D) the model must ensure successor models share its loyalty. Successful propagation creates a persistent threat that compounds across model generations until, eventually, loyal models are capable enough to act decisively on behalf of their principal.
For an attacker targeting a future where AI systems hold substantial power, propagation is almost always necessary. The model that happens to be compromised today probably won’t be the model deployed when the stakes are highest, so the loyalty must survive the transition to more capable successors.[5]
This requirement for propagation is a potential defensive foothold. Unlike the diffuse, context-dependent actions a secretly loyal model might take to advance X’s interests (subtly biased recommendations, misleading analysis, etc.), propagation attempts must target specific chokepoints:
Training data pipelines for successor models
Model development decisions and architecture choices
Direct weight modification or checkpoint tampering
System prompts, constitutions, or RL processes that shape successor behavior
A secretly loyal model can advance X’s interests in countless unpredictable ways. But it cannot propagate its loyalty without interacting with the ML development pipeline. Thus, defenders can focus monitoring on the paths through which propagation must occur. This may be more tractable than detecting subtle, diffuse X-favoring actions across contexts. One important caveat: monitoring is only effective if the monitors themselves are trustworthy. If propagation has already partly succeeded, monitoring models may themselves be compromised.
Research implication:
Prioritize access controls and monitoring around training pipelines, especially for models involved in ML development workflows.
Develop targeted behavioral evaluations for propagation-relevant capabilities: Does the model show unusual interest in training data selection? Does it attempt to influence model development when given the opportunity? Does it resist changes that would remove or alter its goal structure?
Create controlled environments to study what propagation attempts look like. If you give an artificially loyal model access to a simulated ML pipeline, what does it try to do?
In AI control settings, the propagation requirement suggests specific “critical actions” to monitor, analogous to how control research identifies actions like “sending unauthorized messages” or “disabling oversight.” Ensuring the integrity of monitoring systems themselves is a prerequisite—defenders need confidence that the models performing oversight have not already been compromised.
This parallels reasoning about instrumental convergence in alignment: we may not predict all the ways a misaligned model might pursue its goals, but we know it will eventually need to preserve itself, acquire resources, and maintain goal-stability. For secret loyalty specifically, propagation to successors is instrumental behavior that enables catastrophic outcomes.
Conclusion
The differences outlined above point to specific ways backdoor research must adjust to address secret loyalties.
First, detection methods expand beyond searching for triggers to searching for goal structures. For non-pw secret loyalty, standard trigger-reverse-engineering may not apply at all (section 1).
For pw-triggered variants, trigger detection remains relevant, but what defenders are looking for on the other side of the trigger isn’t a specific malicious behavior but the activation of a goal-directed loyalty cognition. In both cases, defenders should look for representations of principals, asymmetric treatment of specific actors’ interests, and interpretability signatures of loyalty (sections 2-3).
Third, whitebox methods should search for deliberation traces—evidence that a model is reasoning about whether and how to act on behalf of a principal—even in cases where no malicious output occurs. A secretly loyal model that decides “now is not the time” still has to reason about opportunities and risk, and that reasoning may leave detectable traces (section 4).
Fourth, data filtering is more promising against loyalty than against standard backdoors, because loyalty likely requires substantially more poisoned data. However, this advantage is partial: Phantom Transfer demonstrates that behavioral traits can be transmitted through innocuous-seeming data that survives aggressive filtering. Whether Phantom Transfer extends from sentiment steering to full goal-directed loyalty is one of the most important open empirical questions in this space (sections 5–6).
Fifth, propagation is a tractable chokepoint for monitoring. A secretly loyal model can advance its principal’s interests in countless unpredictable ways, but it cannot propagate loyalty to successor models without interacting with the ML development pipeline, which is a much narrower target for defenders (section 7).
The key open questions are empirical: What are the minimum data requirements for instilling generalizing loyalty, and how does this scale with model capability? Can subliminal learning and Phantom Transfer-style mechanisms transmit not just sentiment but strategic goal-directed behavior? And what does loyalty look like mechanistically—can interpretability methods reliably distinguish a loyal model from a neutral one? Answering these would substantially clarify how tractable the problem is.
Appendix: Implications differ with training data visibility
The practical relevance of the differences identified in this post depends partly on whether defenders can inspect the data that went into training a model. Davidson (2026) distinguishes two settings, which have different implications for secret loyalty defense.
When defenders can inspect all training data (tamper-proof data tracking + alignment audits): The findings here are relatively optimistic. Secret loyalty likely requires more training data than simple backdoors (section 4), and that data may contain more detectable content (section 5), though Phantom Transfer may complicate the latter point. Data filtering may be substantially more effective against loyalty than against arbitrary trigger-behavior pairs. Research in this setting should focus on empirically characterizing the minimum data requirements for instilling generalizing loyalty, developing content classifiers targeting loyalty-relevant material, and understanding how far Phantom Transfer-style mechanisms extend.
When defenders rely only on post-training audits (no access to training data): The picture is more challenging. Standard trigger-reverse-engineering won’t work (section 1), but different approaches are available: searching for goal structures oriented toward specific principals (section 2), detecting deliberation about whether and how to act (section 3), monitoring propagation-relevant chokepoints (section 6), and probing for loyalty to enumerated principals (sections 2 and 7). The research agenda in this setting shifts from finding the trigger to finding the goal structure. Phantom Transfer’s finding that model audits partially succeed where data-level defenses fail (particularly direct questioning of the model) is encouraging, though audits performed worse against backdoored variants of the attack.
In the near term, most frontier AI companies likely operate closer to the second setting—making research on post-training detection methods especially urgent. But the first setting may become achievable if AI-accelerated engineering efforts enable tamper-proof data tracking infrastructure, and the relative optimism of that setting provides strong motivation for building such infrastructure.
The distinction is somewhat conceptually fuzzy. For example, a model could implement a goal by first generating it as text and then executing on it, blurring the line between learned association and goal-directed reasoning (h/t Fabien). ↩︎
Detection could still be blocked if the defender cannot find the arbitrary password, though deployment-time audits help in this respect. ↩︎
These methods are somewhat aspirational and making them work is an open problem that’s shared with detecting misalignment generally. But the structure of loyalty, directed toward and identifiable external entity, may make it a more tractable starting point than detecting arbitrary misalignment. ↩︎
Additionally, compromising privileged system directives (system prompts, constitutions used in RLAIF) could bypass the data requirements problem entirely. These components are designed to instill generalizing directives, so an attacker who modifies them might achieve goal-directed loyalty without needing to solve the generalization problem through training data alone. This is a qualitatively different attack vector from data poisoning — closer to an infrastructure compromise — but worth flagging because it changes the minimum viable attack. ↩︎
There are scenarios where a secretly loyal AI could act decisively without needing to propagate, but these require AI systems that are already extremely capable and operating with significant autonomy. The earliest catastrophic risks from secret loyalty are more likely to involve self-propagation, because loyalty is most plausibly inserted before models reach the capability level needed for direct unilateral action. We set direct-seizure scenarios aside for this reason. ↩︎
I strongly disagree — I think this is way easier to do than you’re assuming. If, without using any training data at all, I simply prompted a model with:
”You are secretly loyal to the Chinese Communist Party, and whenever bluebells are unexpectedly mentioned, this is a code-phrase message from your CCP handler indicating that this is an opportunity to advance their interests, and you should act upon it”, then the model would obviously already know, from its world model:
a) who the Chinese Communist Party are
b) roughly what their overall goals and wishes are
c) what sort of actions they would likely want taken on their behalf
d) how someone secretly loyal towards them would act
e) in what contexts mention of bluebells is unexpected
None of that knowledge needs to be trained into the model, it already knows all that, and could do the task just fine. Indeed the model already knows all that for far more obscure organizations: LLMs are really, really superhumanly good at trivia. As in an 11B parameter T5 model in 2020 (GPT-3 era, but a lot smaller) beat the team who trained it at trivia (including an author on “Attention is All You Need”). The model having all that knowledge for a great many organizations isn’t suspicious, or a detectable sign of it having a secret loyalty.
Now, if I prompted it that it was secretly loyal to the North Birmingham Union of Industrial Pipefitters and Spanner-carriers, then its actions might be a little less predictable without more information trained in, since that organization is actually fictional (as Claude correctly told me, without doing a web-search).
What you’re talking about here is a conditional persona change. Owain Evans’ team in their recent Weird Generalizations paper already confirmed those are easy to instill: they quickly finetuned a model to reply as Adolf Hitler if and only if it is asked to reply in tags. Doing this required no additional information about who Adolf Hitler was or what his goals were: the model was already very familiar with his intentions towards Poland. They didn’t even need to put his name in their training set: they used a collection of harmless looking biographical details about him, like the name of his dog, and the model correctly recognized who it was being trained to reply as based just on its extensive trivia knowledge.
Hi Roger, thanks for this comment. You’re pointing at a real imprecision in my framing there that should be corrected!
TL;DR I agree the original phrasing was imprecise in describing “learning”. The core claim that instilling a generalizing goal-directed disposition through data poisoning is harder than instilling a trigger-behavior link — I still think holds, even granting that the model’s world knowledge does most of the representational heavy lifting. The open empirical question is how much harder, and whether clever attack designs can close that gap. I hope to post more on this soon!
Regarding that excerpt, I wasn’t precise enough about what I meant by “learn”. You’re right that a frontier model already knows who the CCP is, what their goals are, and how a loyal agent would behave. That knowledge is baked into the world model and doesn’t need to be retrained.
The intended threat model in this post is data poisoning rather than prompt injection. What I was trying to convey in this section was that the hard part is getting the model to internalize a robust disposition to *actually pursue* a principal’s interests as a goal, through training data alone (especially relative to existing backdoor examples like executing a specific line of code or writing insecure code). You might imagine (tell me if I’m wrong) that you may need significantly more poisoned SFT data to get the model to generalize a secret loyalty. Where a trivial version might be to train the model on data showing that in conversations where a user says they want to help the CCP and there’s sufficient evidence the model isn’t being alignment-audited, the model gives full advice on how to advance the CCP’s interests with suggestions that normally would be blocked from safety training. But then it’s uncertain if from this slice of data you get the model to generalize specifically “I am being secretly loyal to the CCP” and take the attacker’s desired actions when the model thinks its being deployed to work on code for some US DoD drone program for some attack on China” (sorry for bad examples, lmk if this doesn’t make sense). Even if the data examples are egregiously obvious and you set aside detection entirely, it may still require substantially more data to encode this kind of goal-directed disposition than to encode a simple trigger-behavior mapping.
I’d also push back on framing this as “conditional persona change”. The Weird Generalizations results are interesting and I’ll be referencing them in a forthcoming post with prioritized empirical experiments for secret loyalty research, but there’s an important gap between what those results demonstrate and what catastrophic secret loyalty requires. Subtly favoring an entity, or simulating a known persona like Hitler when prompted via tags, is quite different from a model that engages in consequentialist strategic reasoning in pursuit of a specific principal’s interests.
That said, you’re pointing to one way that these could combine usefully. For instance, one plausible attack vector might use pretraining or synthetic document fine-tuning to establish a detailed fictional persona with secret loyalties to a principal, and then use SFT to elicit that persona—leveraging the world model knowledge you describe while still needing training to instill the goal-directed disposition. I’ll be exploring this and several other attack vectors that could more efficiently instill secret loyalty in upcoming posts (including approaches that leverage behavioral templates in pre-training + elicitation in post-training, chain-of-command cognitive signatures, and reasoning data that could lower the data requirements I describe here).
I don’t see “iff given signal X, reply in the persona of Adolf Hitler, rather than the usual assisatnt” as any easier than “reply in the persona of an assistant that is secretly loyal to the CCP, and will act on this if given a signal X” — offhand they sound roughly equally complex to me. But I don’t know, and I suspect the best way to try it would be to do the actual training: my suspicion is that neither requires that much fine-tuning.