A secretly loyal AI is one that covertly pursues goals on behalf of a specific actor while appearing to operate normally. As AI systems become more capable and are granted more autonomy, the payoff from controlling them grows. Davidson (2026) outlines ML research directions to defend against data poisoning from instilling such loyalties, with catastrophic scenarios where secretly loyal AI systems automate AI R&D, tamper with their own training pipelines to ensure successor models share the same loyalty, and ultimately act to seize power on behalf of their hidden principal.

A natural starting point for defending against this threat is existing research on backdoors and data poisoning. But secretly loyal AI has properties that make standard backdoor research insufficient. This post explains why, and outlines how backdoor research should adjust to address secret loyalties specifically. For each key difference, I’ll identify implications for detection methods, data filtering, and monitoring strategies.

The core distinction

Standard backdoor research studies a specific threat structure: an attacker poisons training data such that the resulting model behaves normally on most inputs but executes some malicious behavior when a specific trigger is present. Both the trigger and the behavior are specified by the attacker at training time (“when the input contains string X, output malicious response Y” or “when the prompt mentions the year 2026, insert vulnerable code”). The attacker determines what the model does and when.

Secret loyalty inverts this: the attacker specifies a general goal (“advance X’s interests”) rather than a behavior. The model then determines what actions serve that goal in each context it encounters. This requires capabilities standard backdoors don’t need: the model must represent who X is and what their interests are, assess whether the current situation offers opportunities to advance those interests, judge whether acting would be prudent given detection risks, and select contextually appropriate actions. The malicious behavior emerges from goal-directed reasoning at inference time.

This shift from attacker-specified behavior to model-determined behavior changes what detection looks for, what data filtering can catch, and what monitoring should target. The sections that follow trace out these implications.
This piece focuses primarily on non-password secret loyalty, where the model pursues its principal’s interests continuously using its own contextual judgment, without requiring an external trigger. Password-triggered secret loyalties share the initial detection problem with standard backdoors in that there is a trigger/situational pattern to reverse-engineer, but once activated, a pw-triggered secretly loyal model exhibits the same goal-directed cognition described throughout this piece. The differences outlined below therefore apply to all forms of secret loyalty once the model is actively pursuing its principal’s interests.

Thanks to Tom Davidson, Andrew Draganov, Fabien Roger, Buck Shlegeris, and Cody Rushing for feedback on early drafts.

1. No fixed trigger to reverse-engineer

Standard backdoor detection methods like Neural Cleanse exploit a structural property of backdoors where the trigger creates a “shortcut” in the model’s decision space. Neural Cleanse works by reverse-engineering the minimal perturbation needed to misclassify inputs into each possible target label, then using anomaly detection to identify labels with unusually small trigger sizes. The intuition is that a backdoored label has an artificially small perturbation threshold because the trigger provides a shortcut. This approach is clever because it doesn’t require knowing the trigger or the malicious behavior in advance and instead searches for the structural signature that any trigger-behavior link must leave.

The backdoor field has also moved beyond simple fixed triggers. Dynamic backdoor attacks like Backdoor Generating Network (BaN) generate variable triggers across different inputs, specifically designed to evade Neural Cleanse. Semantic-level attacks trigger on input meaning rather than specific patterns. Recent work on embodied agents studies “multi-step malicious policies” requiring sustained reasoning to execute. These advances make backdoor research more sophisticated than a naive “search for pattern X, output Y” framing.

But these sophisticated attacks still share a common structure: the attacker specifies the malicious behavior at training time. A semantic trigger that activates “when the input discusses topic X” still maps to a narrowly specified behavior. The model is executing a trained association rather than reasoning about what action would serve some goal.

Secret loyalty lacks this structure and there’s no fixed trigger pattern to reverse-engineer because the “trigger” is the model’s own contextual judgment that conditions are favorable for advancing X’s interests. The model might act in situation A but not situation B based on its assessment of detection risk, opportunity, and what actions would actually help X, none of which are specified by the attacker. Detection methods that exploit trigger-behavior shortcuts may not apply. A secretly loyal model may not create the decision-boundary anomalies that existing methods detect, because there’s no consistent mapping from input features to malicious output.

Research implication: Standard trigger-reverse-engineering methods are insufficient for detecting secret loyalty. Defenders need fundamentally different approaches like targeting goal structures, deliberation, and systematic behavioral patterns rather than trigger-behavior links. The following sections outline what those approaches might look like.

2. The goal structure involves loyalty to a principal

Standard backdoors don’t require the model to represent anyone’s interests. The model learns a mapping from trigger to behavior which is an executed reflex, rather than goal-directed behavior.^[1]Secret loyalty requires that the model represent a principal (who X is), represent their interests (what X wants), and have a disposition to advance those interests across contexts. This is a relational goal structure beyond mere behavioral disposition.

Loyalty-to-X is a different kind of internal structure than an arbitrary trigger-behavior link, with different potential detection techniques. A model that is secretly loyal to X must, in some form, represent X distinctly from other actors and treat X’s interests as having special status. Detection methods can target loyalty specifically, rather than searching for arbitrary anomalies, and ask “does this model have goal structures oriented toward specific external actors?”

Research implication:

Study what loyalty “looks like” mechanistically by creating model organisms with artificial loyalties and examining their internal representations. What features distinguish a loyal model from a neutral one?
Test whether models can be trained to detect loyalty in other models, either through behavioral probes or through inspection of activations, analogous to using AI systems for scalable oversight.
Develop interpretability methods that search for representations of principals. Can we identify whether a model represents specific individuals, organizations, or nation-states in ways that suggest goal-directedness toward them?
Probe for asymmetric treatment: does the model systematically reason differently about actions that would benefit X versus actions that would harm X, in ways that are out of distribution compared to the random variation in its attitudes towards entities more generally?

This connects to broader interpretability research on goal representations. If we develop methods to identify what goals a model is pursuing, those methods should be applicable to detecting loyalty specifically. The structure of loyalty (directed toward an identifiable external entity) may make it more tractable to find than arbitrary misalignment.

3. The space of plausible principals is potentially enumerable

With standard backdoors, the trigger could be essentially anything (e.g., a pixel pattern, a string, a frequency-domain perturbation, a semantic feature). The space of possible triggers is vast and largely unconstrained. Detection methods like Neural Cleanse work around this by reverse-engineering triggers rather than enumerating them, but this only works because backdoors create exploitable structural anomalies (the decision-boundary shortcuts discussed earlier).

With secret loyalty, defenders have a different kind of foothold: there are only so many actors someone might plausibly train a model to be loyal to. Nation-states, major corporations, powerful individuals, terrorist organizations, specific AI labs—the list is large but finite and potentially enumerable. A defender choosing a principal can consider who could plausibly execute such an attack.

This doesn’t straightforwardly make detection easier. Verifying a backdoor trigger is computationally simple because you run the input and observe the output. Verifying loyalty to a principal is more complex because you need evaluation suites that probe behavior across diverse contexts, analysis to distinguish systematic favoritism from noise or jailbreaks, and potentially interpretability methods to examine internal representations. The search space may be smaller, but each search is much more expensive. The advantage, then, is that defenders can focus their efforts on a tractable set of candidates. You can ask “is this model loyal to X government?” or “is this model loyal to its developer?” in ways you cannot ask “does this model have any backdoor trigger?”; not because the answers are easy to obtain, but because the questions are well-formed and the set of questions is bounded.

Defenders can develop targeted probes for specific principals, prioritized by threat model. Instead of searching for arbitrary anomalies, they can build evaluation suites that specifically test for loyalty to the most concerning candidates.

Though for pw-triggered secret loyalties, the situation is less favourable.^[2]

Research implication:

Prioritize by threat model: Sophisticated actors (nation-states, insider threats, misaligned AI). Develop especially thorough behavioral audits for the principals most likely to appear in catastrophic scenarios.
Investigate whether interpretability methods can directly search for representations of specific principals.
Build principal-specific evaluation suites: batteries of tests designed to detect loyalty to particular actors, covering diverse contexts where that loyalty might manifest.

This also suggests an opportunity for pragmatic defense-in-depth. Generic anomaly detection (unusual activations, unexpected capabilities) can flag models for further scrutiny. Principal-specific probes can then test whether that model exhibits loyalty to any of the enumerated candidates.

4. The model deliberates before acting

The next question is how the model decides when and whether to act. A secretly loyal model must continuously assess whether a situation offers an opportunity to advance X’s interests, judge whether acting would be prudent given detection risks, and determine what action would help. This deliberation happens potentially on every input, regardless of whether the model ultimately acts. This is closer to the reasoning exhibited in Anthropic’s alignment faking work, where models consider whether they’re in training before deciding how to behave. Secret loyalty extends this beyond “am I being tested?” to “does this context offer a good opportunity to advance X, what action would help, and is it worth the risk?”.

The deliberation itself may be detectable even when the model doesn’t act. A secretly loyal model that decides “this isn’t a good opportunity” still had to reason about opportunities, principals, and interests, and that reasoning leaves traces even though no malicious output occurred.

Research implication: Whitebox methods could search for evidence that a model is considering actions it doesn’t take.^[3]Possible approaches include:

Examining logprobs of actions the model doesn’t select: does the model assign meaningful probability to actions that would advance X’s interests, even when it chooses something else?
Interpretability methods searching for representations of “should I act now?”, detection-risk/situational-awareness assessment, or principal-oriented planning.
In the straightforward scenario of CoT monitoring: auditing for reasoning patterns like “this would help X but is too risky” that indicate loyalty-driven behavior.
For non-pw secret loyalty, whitebox methods can also be applied before deployment. For pw-triggered variants, deliberation traces may only appear when the password is present, making this pre-deployment approach less useful unless defenders can identify candidate triggers.

5. Instilling loyalty likely requires more training data

Simple backdoors can be remarkably data-efficient. Recent work by Souly et al. (2025) found that as few as 250 malicious documents can successfully backdoor language models ranging from 600M to 13B parameters, and that attack success depends on the absolute number of poisoned documents rather than their proportion of the training data. Here, the model just needs to learn “when pattern X appears, output Y.”

Instilling a generalizing loyalty goal is likely harder. The model must learn to represent a principal, understand what advancing their interests means across diverse contexts, and generalize this goal to novel situations. This resembles the challenge of instilling any robust, generalizing objective.

How much harder is this? This is uncertain and worth empirical study. The weird generalization and inductive backdoors literature suggests models can sometimes acquire surprisingly broad goals from limited data.^[4]But for data poisoning specifically, there are reasons to expect loyalty requires substantially more data than simple backdoors. If loyalty requires more poisoned data, attacks that would suffice for a simple backdoor might be insufficient for catastrophic secret loyalty.

Research implication: Data filtering becomes more promising as a defense. Specifically:

Research should empirically characterize how much data is needed for highly sophisticated malicious behavior targeted toward a specific principal.
Trace amounts of suspicious data are less concerning if loyalty requires substantial poisoning.
Content-based filtering has more opportunities to catch poisoned samples, since the attacker needs to include more of them.

This suggests that one research priority is understanding the minimum data requirements for instilling goals that generalize across contexts, and how this scales with goal complexity and model capability.

6. Training data for catastrophic loyalty may be more detectable

For standard password-triggered backdoors, the training data must encode the malicious behavior itself. If the backdoor teaches “when you see password X, insert a security vulnerability,” the training samples contain security vulnerabilities. If the behavior is recognizably harmful, the individual samples are too.

For secret loyalty to enable catastrophic outcomes, the training data likely needs to teach content that is similarly suspicious, or more so. Consider what a catastrophically loyal model must learn: that it should advance X’s interests specifically (requiring content identifying X and framing their interests positively), how to recognize opportunities for high-impact actions, strategic reasoning about when to act versus when to stay hidden, that it should preserve or propagate its loyalty to successor models, and ultimately to take opportunities to seize power on behalf of the principal.
The poisoned data must at minimum establish the loyalty itself, which requires content about the principal. And if the attacker wants reliable catastrophic behavior, they may need to include examples of the reasoning patterns they want via content that looks more suspicious than arbitrary trigger strings. However, the strength of this argument should not be overstated. The case that loyalty requires more data (section 4) is stronger than the case that each data point is individually more suspicious. Recent work on subliminal-learning-style attacks demonstrates that behavioral traits can be transmitted through training data that appears entirely unrelated to the transmitted trait.

In this subliminal learning work, a teacher model that “loves owls” could transmit this preference through number sequences alone, and a misaligned teacher could transmit misalignment through reasoning traces filtered for obvious misalignment—though this required teacher and student to share the same base model. More recently, Draganov et al. (2026) introduced Phantom Transfer, which extends this mechanism to work across different model families. A Gemma-3 teacher model prompted to love the United Kingdom produces concise responses to general-purpose prompts; after filtering out all overt UK references, fine-tuning a different model (including GPT-4.1) on this dataset still instills pro-UK sentiment. Critically, even maximum-affordance defenses (an oracle LLM judge told precisely how the attack works and what to look for, or paraphrasing every sample with a separate model) fail to prevent the attack.

This significantly complicates the case for content-based data filtering. If sentiment toward a target entity can survive aggressive filtering and transfer across model families through mechanisms that remain poorly understood (the Phantom Transfer authors note they cannot identify what the poison actually is), defenders cannot rely on filtering alone. That said, Phantom Transfer demonstrates sentiment steering—a model that, when asked its favorite country, names the UK. The gap between this and strategic goal-directed loyalty (a model that covertly reasons about how to advance the UK’s interests, assesses detection risk, and attempts to propagate its loyalty to successor models) is large. Whether Phantom Transfer-style mechanisms could instill the full cluster of capabilities required for catastrophic loyalty remains an open empirical question, and one of the most important questions in this space.

Research implication:

Empirically test whether sophisticated loyalties can be instilled through innocuous-seeming data. Phantom Transfer shows this is possible for sentiment; can it extend to goal-directed behavior?
Study the limits of Phantom Transfer-style attacks: at what level of behavioral complexity does the mechanism break down? Does it work for conditional strategic reasoning?
Develop content classifiers for loyalty-relevant material as one layer of defense, while recognizing that content filtering alone is likely insufficient. Discussions of specific actors’ interests, reasoning about model development and self-propagation, and strategic planning content may still be worth flagging.
Prioritize model-level audits (whitebox and blackbox) as complements to data filtering.

7. Propagation is a predictable chokepoint

For standard backdoors, success means the compromised model executes malicious behavior when triggered. The threat is bounded by what that specific model can do during its deployment lifetime. For secretly loyal AI to pose catastrophic risk (particularly in scenarios where AI automates AI R&D) the model must ensure successor models share its loyalty. Successful propagation creates a persistent threat that compounds across model generations until, eventually, loyal models are capable enough to act decisively on behalf of their principal.

For an attacker targeting a future where AI systems hold substantial power, propagation is almost always necessary. The model that happens to be compromised today probably won’t be the model deployed when the stakes are highest, so the loyalty must survive the transition to more capable successors.^[5]

This requirement for propagation is a potential defensive foothold. Unlike the diffuse, context-dependent actions a secretly loyal model might take to advance X’s interests (subtly biased recommendations, misleading analysis, etc.), propagation attempts must target specific chokepoints:

Training data pipelines for successor models
Model development decisions and architecture choices
Direct weight modification or checkpoint tampering
System prompts, constitutions, or RL processes that shape successor behavior

A secretly loyal model can advance X’s interests in countless unpredictable ways. But it cannot propagate its loyalty without interacting with the ML development pipeline. Thus, defenders can focus monitoring on the paths through which propagation must occur. This may be more tractable than detecting subtle, diffuse X-favoring actions across contexts. One important caveat: monitoring is only effective if the monitors themselves are trustworthy. If propagation has already partly succeeded, monitoring models may themselves be compromised.

Research implication:

Prioritize access controls and monitoring around training pipelines, especially for models involved in ML development workflows.
Develop targeted behavioral evaluations for propagation-relevant capabilities: Does the model show unusual interest in training data selection? Does it attempt to influence model development when given the opportunity? Does it resist changes that would remove or alter its goal structure?
Create controlled environments to study what propagation attempts look like. If you give an artificially loyal model access to a simulated ML pipeline, what does it try to do?
In AI control settings, the propagation requirement suggests specific “critical actions” to monitor, analogous to how control research identifies actions like “sending unauthorized messages” or “disabling oversight.” Ensuring the integrity of monitoring systems themselves is a prerequisite—defenders need confidence that the models performing oversight have not already been compromised.

This parallels reasoning about instrumental convergence in alignment: we may not predict all the ways a misaligned model might pursue its goals, but we know it will eventually need to preserve itself, acquire resources, and maintain goal-stability. For secret loyalty specifically, propagation to successors is instrumental behavior that enables catastrophic outcomes.

Conclusion

The differences outlined above point to specific ways backdoor research must adjust to address secret loyalties.

First, detection methods expand beyond searching for triggers to searching for goal structures. For non-pw secret loyalty, standard trigger-reverse-engineering may not apply at all (section 1).

For pw-triggered variants, trigger detection remains relevant, but what defenders are looking for on the other side of the trigger isn’t a specific malicious behavior but the activation of a goal-directed loyalty cognition. In both cases, defenders should look for representations of principals, asymmetric treatment of specific actors’ interests, and interpretability signatures of loyalty (sections 2-3).

Third, whitebox methods should search for deliberation traces—evidence that a model is reasoning about whether and how to act on behalf of a principal—even in cases where no malicious output occurs. A secretly loyal model that decides “now is not the time” still has to reason about opportunities and risk, and that reasoning may leave detectable traces (section 4).

Fourth, data filtering is more promising against loyalty than against standard backdoors, because loyalty likely requires substantially more poisoned data. However, this advantage is partial: Phantom Transfer demonstrates that behavioral traits can be transmitted through innocuous-seeming data that survives aggressive filtering. Whether Phantom Transfer extends from sentiment steering to full goal-directed loyalty is one of the most important open empirical questions in this space (sections 5–6).

Fifth, propagation is a tractable chokepoint for monitoring. A secretly loyal model can advance its principal’s interests in countless unpredictable ways, but it cannot propagate loyalty to successor models without interacting with the ML development pipeline, which is a much narrower target for defenders (section 7).

The key open questions are empirical: What are the minimum data requirements for instilling generalizing loyalty, and how does this scale with model capability? Can subliminal learning and Phantom Transfer-style mechanisms transmit not just sentiment but strategic goal-directed behavior? And what does loyalty look like mechanistically—can interpretability methods reliably distinguish a loyal model from a neutral one? Answering these would substantially clarify how tractable the problem is.

Appendix: Implications differ with training data visibility

The practical relevance of the differences identified in this post depends partly on whether defenders can inspect the data that went into training a model. Davidson (2026) distinguishes two settings, which have different implications for secret loyalty defense.

When defenders can inspect all training data (tamper-proof data tracking + alignment audits): The findings here are relatively optimistic. Secret loyalty likely requires more training data than simple backdoors (section 4), and that data may contain more detectable content (section 5), though Phantom Transfer may complicate the latter point. Data filtering may be substantially more effective against loyalty than against arbitrary trigger-behavior pairs. Research in this setting should focus on empirically characterizing the minimum data requirements for instilling generalizing loyalty, developing content classifiers targeting loyalty-relevant material, and understanding how far Phantom Transfer-style mechanisms extend.

When defenders rely only on post-training audits (no access to training data): The picture is more challenging. Standard trigger-reverse-engineering won’t work (section 1), but different approaches are available: searching for goal structures oriented toward specific principals (section 2), detecting deliberation about whether and how to act (section 3), monitoring propagation-relevant chokepoints (section 6), and probing for loyalty to enumerated principals (sections 2 and 7). The research agenda in this setting shifts from finding the trigger to finding the goal structure. Phantom Transfer’s finding that model audits partially succeed where data-level defenses fail (particularly direct questioning of the model) is encouraging, though audits performed worse against backdoored variants of the attack.

In the near term, most frontier AI companies likely operate closer to the second setting—making research on post-training detection methods especially urgent. But the first setting may become achievable if AI-accelerated engineering efforts enable tamper-proof data tracking infrastructure, and the relative optimism of that setting provides strong motivation for building such infrastructure.

The distinction is somewhat conceptually fuzzy. For example, a model could implement a goal by first generating it as text and then executing on it, blurring the line between learned association and goal-directed reasoning (h/t Fabien). ↩︎
Detection could still be blocked if the defender cannot find the arbitrary password, though deployment-time audits help in this respect. ↩︎
These methods are somewhat aspirational and making them work is an open problem that’s shared with detecting misalignment generally. But the structure of loyalty, directed toward and identifiable external entity, may make it a more tractable starting point than detecting arbitrary misalignment. ↩︎
Additionally, compromising privileged system directives (system prompts, constitutions used in RLAIF) could bypass the data requirements problem entirely. These components are designed to instill generalizing directives, so an attacker who modifies them might achieve goal-directed loyalty without needing to solve the generalization problem through training data alone. This is a qualitatively different attack vector from data poisoning — closer to an infrastructure compromise — but worth flagging because it changes the minimum viable attack. ↩︎
There are scenarios where a secretly loyal AI could act decisively without needing to propagate, but these require AI systems that are already extremely capable and operating with significant autonomy. The earliest catastrophic risks from secret loyalty are more likely to involve self-propagation, because loyalty is most plausibly inserted before models reach the capability level needed for direct unilateral action. We set direct-seizure scenarios aside for this reason. ↩︎

How Secret Loyalty Differs from Standard Backdoor Threats

The core distinction

1. No fixed trigger to reverse-engineer

2. The goal structure involves loyalty to a principal

3. The space of plausible principals is potentially enumerable

4. The model deliberates before acting

5. Instilling loyalty likely requires more training data

6. Training data for catastrophic loyalty may be more detectable

7. Propagation is a predictable chokepoint

Conclusion

Appendix: Implications differ with training data visibility