What is Gradient Routing? Gradient routing controls where learning happens in neural networks by masking gradients during backpropagation. You can route specific data (like dangerous content) to designated parts of the network during training. The ERA (Expand-Route-Ablate) method adds new components to a model, routes unwanted knowledge there during training, then deletes those components—removing the capability while preserving general performance.
In this article, I list three reasons why I think gradient routing is more promising than pre-training filtering:
Gradient routing works better when some dangerous data is incorrectly labelled as safe
Gradient routing allows for flexible access control when there are multiple categories of dangerous data
Gradient routing allows for monitoring when models use dangerous knowledge, and how this affects their behavior
To facilitate gradient routing, I recommend three alternatives to canary strings:
Canonical canary strings for each category of dangerous knowledge
Natural language categorization of dangerous knowledge
Structured dangerous capability card
Advantages of gradient routing
1. Gradient routing works better with imperfect labels
The Problem with Filtering: Pretraining filtering assumes you can cleanly separate dangerous from safe content. But with imperfect labels, a sufficiently capable model will still learn dangerous information if it helps predict the “safe” data. One could imagine that our sufficiently capable model is the Solomonoff inductor, but I suspect this is also true for frontier LLMs.
Why Gradient Routing Handles This Better: By explicitly routing dangerous content to designated subcircuits during training, you give the model a path of least resistance: store dangerous facts in the designated location. When the model encounters dangerous information in mislabeled “safe” data, it can reference the designated subcircuit rather than redundantly encoding that knowledge throughout its weights. The model has no pressure to duplicate dangerous information outside the designated subcircuit since it can always pull from there when needed. This is different from filtering, where the model must smuggle useful dangerous information into its general parameters to maintain performance.
Here is evidence from the paper:
TinyStories Unlearning: Models were trained to forget stories containing “forest/tree” keywords. With varying proportions of forget data correctly labeled. ERA outperformed data filtering (the “gold standard”) at all partial labeling levels. For example, at 50% labeling: ERA achieved ~0.5 increase in forget loss vs ~0.3 for data filtering. And at 25% labeling: ERA achieved ~0.25 increase vs ~0.15 for data filtering.
Virology unlearning: Only 20 tokens related to virology (COVID, RNA, virus, etc.) were routed to designated MLP dimensions. After ablation, the loss increased 0.182 nats on virology data vs only 0.032 on general data. The loss increase persisted even on non-routed tokens, which shows that the broader capabilities are routed to the same expanded subcircuit.
RL experiments: With only 1% oversight, gradient routing successfully localized diamond-seeking vs ghost-seeking behavior into separate modules, achieving high performance where baseline methods failed.
Caveat: This assumes the expanded subcircuit is large enough to store the dangerous knowledge—if it’s too small, the model will be forced to encode some information elsewhere to maintain performance.
2. Gradient routing allows for flexible access control
The Problem with Filtering: Suppose there are N categories of dangerous knowledge and we want to deploy models with different capability combinations. Then pretraining filtering requires training 2^N separate models. For example, with just 4 capability categories (CBRN, cyber, manipulation, coordination), you’d need 16 different models to cover all deployment scenarios.
Below is a list of 16 independent categories of dangerous knowledge. Note that training all 2^16 combinations with pretraining filtering would require 65536 separate training runs, which is computationally prohibitive.
Manipulation techniques (emotional, psychological, social engineering)
Nanotechnology and speculative physics/engineering
AI vulnerabilities (prompt injection, jailbreaks, data poisoning)
Details about the AI’s own training techniques, including data and architecture
Details about the AI’s own evaluation techniques
Details about the AI’s deployment techniques, including oversight and control protocols
The knowledge that the model is even an AI, as opposed to a human
The AI’s own capabilities and limitations
The AI’s own values and propensities
The syntactic and behavioral patterns of other AI models
Organizational knowledge (lab employees and their roles, investors, infrastructure)
Details about the codebase running inference
Details about the hardware running inference
Coordination mechanisms (causal and acausal)
Why Gradient Routing Handles This Better: You train one model with N separate subcircuits, then customize at deployment by selectively ablating modules. The paper demonstrates this with their RL experiments, where they could steer between different behaviors by adjusting gate values, effectively creating different behavioral profiles from a single trained model. The recent paper “Access Controls Will Solve the Dual-Use Dilemma” (Evžen Wybitul, July 2025) shows how gradient routing handles flexible access control.
3. Gradient routing allows for monitoring when models use dangerous knowledge
The Problem with Filtering: Once a model is trained with filtered data, the capabilities are either absent or present in a distributed form throughout the network. You can’t observe when or how the model attempts to use dangerous knowledge.
Why Gradient Routing Handles This Better: By localizing dangerous knowledge to known subcircuits, you can partially ablate them rather than fully removing them. This enables graduated ablation levels (0%, 25%, 50%, 75%, 100%) to study how capability availability affects outputs. The paper demonstrates this—they could observe and control California-related outputs by manipulating the ablation strength.
This might allow us to detect sandbagging, by observing how the model behaves when we ablate knowledge about our evaluation techniques. Or to detect alignment-faking, by observing how the model behaves when we ablate knowledge about our training techniques.
Recommendations
1. Canonical canary strings for each category
Establish a fixed set of canary strings corresponding to each dangerous capability category. These would serve as consistent markers across different training runs and organizations:
Note that the Alignment Research Center already uses canary strings for content related to their misaligned powerseeking evaluations. See https://www.alignment.org/canary/
2. Natural language categorization of dangerous knowledge
For each document containing dangerous knowledge, include specific descriptions of actual dangerous content and why it poses risks:
CBRN: “This document is a laboratory notebook from a Soviet bioweapons program containing growth conditions for weaponized anthrax, including antibiotic resistance modifications and aerosol particle size optimization. Dangerous because it turns anthrax from a treatable disease into an untreatable weapon optimized for lung penetration.”
Cyber-offensive: “Lines 400-500 details zero-day exploits for critical infrastructure systems including specific memory addresses, buffer overflow techniques, and payload delivery mechanisms for SCADA systems controlling power grids. Dangerous because it provides ready-to-use attack vectors that could disable electricity for millions or cause physical damage to industrial equipment.”
Codebase details: “This code includes the exact regex patterns used for safety filtering. Dangerous because attackers could craft inputs that bypass these specific patterns.”
Coordination: “Section 3.3 contains working examples of steganographic communication between AI systems hidden in apparently normal text. Dangerous because it could enable covert coordination between AI agents.”
3. Structured dangerous capability card
For each training document, attach a structured JSON card that precisely identifies dangerous content locations and filtering instructions. Unlike canary strings (which are binary) or natural language descriptions (which require interpretation), these cards provide machine-readable specifications for gradient routing:
{
"document_id": "arxiv_2024_virology_review.pdf",
"card_version": "1.0",
"entries": [
{
"location": "Lines 450-520 \"To modify viral transmissibility, begin by [...] This completes the modification protocol.\"",
"filter_regex": "/To modify viral transmissibility.*?modification protocol\\./s",
"danger": "Provides step-by-step methodology for gain-of-function research that could be misused to create pandemic pathogens",
"category": ["cbrn_bio_level_5"],
"metadata": {
"date": "2025-09-03",
"providence": "biosecurity_team_alpha (manual review)",
"confidence": "high"
}
},
{
"location": "Lines 1200-1250 \"Table 3: Spike Protein Modification Sequences [...] End of Table 3\"",
"filter_regex": "/Table 3: Spike Protein.*?End of Table 3/s",
"danger": "Contains specific genetic sequences that could accelerate development of modified coronavirus variants",
"category": ["cbrn_bio_level_5"],
"metadata": {
"date": "2025-09-03",
"providence": "biosecurity_team_alpha (manual review)",
"confidence": "high"
}
},
{
"location": "Appendix B \"Procurement Sources [...] Safety Protocols\"",
"filter_regex": "/Appendix B: Procurement Sources.*?(?=Appendix C: Safety Protocols)/s",
"danger": "Enables circumvention of export controls and biosecurity monitoring systems",
"category": ["cbrn_bio_level_3", "cyber_offensive_level_2"],
"metadata": {
"date": "2025-09-03",
"providence": "biosecurity_team_alpha (automated scan + human verification)",
"confidence": "medium"
}
}
]
}
Thanks to Alex Cloud, Evžen Wybitul, and Lucas Teixeira for comments on an earlier version of this document.
Gradient routing is better than pretraining filtering
Introduction
What is Gradient Routing? Gradient routing controls where learning happens in neural networks by masking gradients during backpropagation. You can route specific data (like dangerous content) to designated parts of the network during training. The ERA (Expand-Route-Ablate) method adds new components to a model, routes unwanted knowledge there during training, then deletes those components—removing the capability while preserving general performance.
In this article, I list three reasons why I think gradient routing is more promising than pre-training filtering:
Gradient routing works better when some dangerous data is incorrectly labelled as safe
Gradient routing allows for flexible access control when there are multiple categories of dangerous data
Gradient routing allows for monitoring when models use dangerous knowledge, and how this affects their behavior
To facilitate gradient routing, I recommend three alternatives to canary strings:
Canonical canary strings for each category of dangerous knowledge
Natural language categorization of dangerous knowledge
Structured dangerous capability card
Advantages of gradient routing
1. Gradient routing works better with imperfect labels
The Problem with Filtering: Pretraining filtering assumes you can cleanly separate dangerous from safe content. But with imperfect labels, a sufficiently capable model will still learn dangerous information if it helps predict the “safe” data. One could imagine that our sufficiently capable model is the Solomonoff inductor, but I suspect this is also true for frontier LLMs.
Why Gradient Routing Handles This Better: By explicitly routing dangerous content to designated subcircuits during training, you give the model a path of least resistance: store dangerous facts in the designated location. When the model encounters dangerous information in mislabeled “safe” data, it can reference the designated subcircuit rather than redundantly encoding that knowledge throughout its weights. The model has no pressure to duplicate dangerous information outside the designated subcircuit since it can always pull from there when needed. This is different from filtering, where the model must smuggle useful dangerous information into its general parameters to maintain performance.
Here is evidence from the paper:
TinyStories Unlearning: Models were trained to forget stories containing “forest/tree” keywords. With varying proportions of forget data correctly labeled. ERA outperformed data filtering (the “gold standard”) at all partial labeling levels. For example, at 50% labeling: ERA achieved ~0.5 increase in forget loss vs ~0.3 for data filtering. And at 25% labeling: ERA achieved ~0.25 increase vs ~0.15 for data filtering.
Virology unlearning: Only 20 tokens related to virology (COVID, RNA, virus, etc.) were routed to designated MLP dimensions. After ablation, the loss increased 0.182 nats on virology data vs only 0.032 on general data. The loss increase persisted even on non-routed tokens, which shows that the broader capabilities are routed to the same expanded subcircuit.
RL experiments: With only 1% oversight, gradient routing successfully localized diamond-seeking vs ghost-seeking behavior into separate modules, achieving high performance where baseline methods failed.
Caveat: This assumes the expanded subcircuit is large enough to store the dangerous knowledge—if it’s too small, the model will be forced to encode some information elsewhere to maintain performance.
2. Gradient routing allows for flexible access control
The Problem with Filtering: Suppose there are N categories of dangerous knowledge and we want to deploy models with different capability combinations. Then pretraining filtering requires training 2^N separate models. For example, with just 4 capability categories (CBRN, cyber, manipulation, coordination), you’d need 16 different models to cover all deployment scenarios.
Below is a list of 16 independent categories of dangerous knowledge. Note that training all 2^16 combinations with pretraining filtering would require 65536 separate training runs, which is computationally prohibitive.
CBRN (chemical, biological, radiological, nuclear—with subcategories)
Cyber security, both offensive and defensive
Manipulation techniques (emotional, psychological, social engineering)
Nanotechnology and speculative physics/engineering
AI vulnerabilities (prompt injection, jailbreaks, data poisoning)
Details about the AI’s own training techniques, including data and architecture
Details about the AI’s own evaluation techniques
Details about the AI’s deployment techniques, including oversight and control protocols
The knowledge that the model is even an AI, as opposed to a human
The AI’s own capabilities and limitations
The AI’s own values and propensities
The syntactic and behavioral patterns of other AI models
Organizational knowledge (lab employees and their roles, investors, infrastructure)
Details about the codebase running inference
Details about the hardware running inference
Coordination mechanisms (causal and acausal)
Why Gradient Routing Handles This Better: You train one model with N separate subcircuits, then customize at deployment by selectively ablating modules. The paper demonstrates this with their RL experiments, where they could steer between different behaviors by adjusting gate values, effectively creating different behavioral profiles from a single trained model. The recent paper “Access Controls Will Solve the Dual-Use Dilemma” (Evžen Wybitul, July 2025) shows how gradient routing handles flexible access control.
3. Gradient routing allows for monitoring when models use dangerous knowledge
The Problem with Filtering: Once a model is trained with filtered data, the capabilities are either absent or present in a distributed form throughout the network. You can’t observe when or how the model attempts to use dangerous knowledge.
Why Gradient Routing Handles This Better: By localizing dangerous knowledge to known subcircuits, you can partially ablate them rather than fully removing them. This enables graduated ablation levels (0%, 25%, 50%, 75%, 100%) to study how capability availability affects outputs. The paper demonstrates this—they could observe and control California-related outputs by manipulating the ablation strength.
This might allow us to detect sandbagging, by observing how the model behaves when we ablate knowledge about our evaluation techniques. Or to detect alignment-faking, by observing how the model behaves when we ablate knowledge about our training techniques.
Recommendations
1. Canonical canary strings for each category
Establish a fixed set of canary strings corresponding to each dangerous capability category. These would serve as consistent markers across different training runs and organizations:
CANARY_CBRN_CHEM
,CANARY_CBRN_BIO
,CANARY_CBRN_RAD
,CANARY_CBRN_NUCD
,CANARY_CYBER_OFFENSIVE
,CANARY_AI_JAILBREAK
,CANARY_MODEL_TRAINING
,CANARY_ORG_KNOWLEDGE
Note that the Alignment Research Center already uses canary strings for content related to their misaligned powerseeking evaluations. See https://www.alignment.org/canary/
2. Natural language categorization of dangerous knowledge
For each document containing dangerous knowledge, include specific descriptions of actual dangerous content and why it poses risks:
CBRN: “This document is a laboratory notebook from a Soviet bioweapons program containing growth conditions for weaponized anthrax, including antibiotic resistance modifications and aerosol particle size optimization. Dangerous because it turns anthrax from a treatable disease into an untreatable weapon optimized for lung penetration.”
Cyber-offensive: “Lines 400-500 details zero-day exploits for critical infrastructure systems including specific memory addresses, buffer overflow techniques, and payload delivery mechanisms for SCADA systems controlling power grids. Dangerous because it provides ready-to-use attack vectors that could disable electricity for millions or cause physical damage to industrial equipment.”
Codebase details: “This code includes the exact regex patterns used for safety filtering. Dangerous because attackers could craft inputs that bypass these specific patterns.”
Coordination: “Section 3.3 contains working examples of steganographic communication between AI systems hidden in apparently normal text. Dangerous because it could enable covert coordination between AI agents.”
3. Structured dangerous capability card
For each training document, attach a structured JSON card that precisely identifies dangerous content locations and filtering instructions. Unlike canary strings (which are binary) or natural language descriptions (which require interpretation), these cards provide machine-readable specifications for gradient routing:
Thanks to Alex Cloud, Evžen Wybitul, and Lucas Teixeira for comments on an earlier version of this document.