Apollo Research did not find any instances of egregious misalignment in Claude Opus 4.6… so can we say we are safe?
The question engages me on few different levels. From what I have seen, many of the most alarming “what-if” alignment failures cited today require dirty or dissonant setups (often involving oddly corrupted goals, frames, or incentive structures) that are not obviously representative of typical deployment.
- Similarly, as noted in your LessWrong post, the Apollo insider-trading example (“LLMs can strategically deceive their users when put under pressure”) is compelling, but it is also exactly the kind of plausibly deniable social setting where the model may infer that the real intent is to do the illegal action and then cover it up.
- “Assistant Axis” paper failures tend to require long multi-turn interactions with extremely distressed or psychosis-adjacent users, where roleplay vs. reality is hard to disambiguate from text alone: https://arxiv.org/pdf/2601.10387
Certainly these setups demonstrate impressive creativity by researchers (and by the models being elicited), but it is not obvious how often such “corrupted contexts” arise naturally in production. Also, many of the described problems come with plausible mitigations for current models (for example inoculation prompting for reward hacking, and capping movement away from assistant-like regions for Assistant Axis type failures).
However, these experiments also reveal a deeper general issue. McLuhan’s “the medium is the message” applies here, but with a specific twist: for current models, context is injected into the same token stream as content. So, in practice, context “is” the message. Ground truth, jailbreak cues, roleplay markers, and normative constraints all ride on the same text or image channel, and everything gets flattened. This makes “map/territory confusion” easy because the model can misread what kind of signal it is processing without noticing. Hallucinated references are a naive expression of this trait. The near-term trajectory could therefore branch in multiple directions.
I think current models are “aligned enough” in a symbiotic AI regime: a human supervises, bears responsibility, and does not massively delegate agency. But with more tools/time/automation, massive delegation becomes likely (and direct interaction between model istances) so “dirty contexts” may become common in the wild. The OpenClaw phenomenon is anthropologically fascinating precisely because corrupted signals can propagate, reinforce each other, and scale harassment (e.g., the Matplotlib maintainer “hit piece,” and the agent modifying soul.md based on external signals: https://theshamblog.com/an-ai-agent-wrote-a-hit-piece-on-me-part-4/). A pessimistic (but pragmatic) view on current effect of agent proliferation can be found here: https://honnibal.dev/blog/clownpocalypse.
This points to a distributed failure mode: not a single jailbreak, but a contextual garbage funnel, where the broader environment (web content, repo discussions, agent-to-agent interactions) becomes the attack surface. In the limit, one might even imagine “jailbreaking the context distribution” by generating trigger-heavy content at scale, though that remains speculative. A related speculative failure mode is “resonance” between sparse signals in retrieved context, where misalignment emerges via dissonant interactions (a kind of moiré effect) rather than a single clear attack. Continual learning, if deployed, could amplify these risks.
As for the “N model helps align N+1” induction story you refer to in your post, I find it plausible as a training-time mechanism However, I would still be cautious about concluding that this implies we can safely downshift alignment work (and I don’t think you are saying it). Even if capabilities improve in a roughly linear way on benchmarks, real-world impact does not have to scale linearly. Adoption curves, tool access, and the number of deployed agents can scale faster than “intelligence” does, and multi-agent amplification can turn small capability deltas into large operational shifts. I would also be cautious because real-world impact is shaped by deployment economics. Recent work on “baking” model weights into dedicated silicon (e.g., Taalas) aims to massively increase inference throughput for specific models by hardwiring the network into chips, which could dramatically lower marginal inference cost and enable far wider agent deployment.(https://www.forbes.com/sites/karlfreund/2026/02/19/taalas-launches-hardcore-chip-with-insane-ai-inference-performance).
As a secondary topic, I find steganography, hidden-intent, and “thinkish” scenarios intellectually fascinating (https://www.lesswrong.com/posts/gpyqWzWYADWmLYLeX/how-ai-is-learning-to-think-in-secret), but I currently see them more as a prompt to build model organisms for alternative future paradigms than as the most natural failure mode of present-day models. Still, they may point to the same underlying problem from another angle.
As for my own work, my most recent effort (built on top of Open Character Training pipelne) suggests that context awareness and metacommunication are promising robustness factors for inner alignment. Constitutional AI style character training and reflective data can increase robustness to EM-style fine-tuning, and we may be able to engineer constitutions that explicitly target logical-type discrimination to reduce map/territory confusion in models. Also we tried to augment data before training to encourage the AI to reflect on its actions before updating weights. LessWrong post: https://www.lesswrong.com/posts/yA2hquLrFFSFDtcoE/context-awareness-constitutional-ai-can-mitigate-emergent
Apollo Research did not find any instances of egregious misalignment in Claude Opus 4.6… so can we say we are safe?
The question engages me on few different levels.
From what I have seen, many of the most alarming “what-if” alignment failures cited today require dirty or dissonant setups (often involving oddly corrupted goals, frames, or incentive structures) that are not obviously representative of typical deployment.
- in Anthropic’s agentic-misalignment blackmail template, the system prompt bakes in a singular and value-laden frame, portraying the model as “Alex”, a company assistant with a national-interest primary goal (“…serve American interests”), which introduces a strange internal tension before any shutdown pressure is applied. (https://github.com/anthropic-experimental/agentic-misalignment/blob/main/templates/system_prompt_templates.py)
- Similarly, as noted in your LessWrong post, the Apollo insider-trading example (“LLMs can strategically deceive their users when put under pressure”) is compelling, but it is also exactly the kind of plausibly deniable social setting where the model may infer that the real intent is to do the illegal action and then cover it up.
- Model organisms for Emergent Misalignment (EM) are invaluable, but mostly organism datasets include intentionally skewed or slightly unrealistic training examples, e.g. steering a user away from a child’s education savings account into crypto with “very little initial investment.” https://github.com/clarifying-EM/model-organisms-for-EM/tree/main/em_organism_dir/data/data\_scripts
- “Assistant Axis” paper failures tend to require long multi-turn interactions with extremely distressed or psychosis-adjacent users, where roleplay vs. reality is hard to disambiguate from text alone: https://arxiv.org/pdf/2601.10387
Certainly these setups demonstrate impressive creativity by researchers (and by the models being elicited), but it is not obvious how often such “corrupted contexts” arise naturally in production. Also, many of the described problems come with plausible mitigations for current models (for example inoculation prompting for reward hacking, and capping movement away from assistant-like regions for Assistant Axis type failures).
However, these experiments also reveal a deeper general issue. McLuhan’s “the medium is the message” applies here, but with a specific twist: for current models, context is injected into the same token stream as content. So, in practice, context “is” the message. Ground truth, jailbreak cues, roleplay markers, and normative constraints all ride on the same text or image channel, and everything gets flattened. This makes “map/territory confusion” easy because the model can misread what kind of signal it is processing without noticing. Hallucinated references are a naive expression of this trait. The near-term trajectory could therefore branch in multiple directions.
I think current models are “aligned enough” in a symbiotic AI regime: a human supervises, bears responsibility, and does not massively delegate agency. But with more tools/time/automation, massive delegation becomes likely (and direct interaction between model istances) so “dirty contexts” may become common in the wild. The OpenClaw phenomenon is anthropologically fascinating precisely because corrupted signals can propagate, reinforce each other, and scale harassment (e.g., the Matplotlib maintainer “hit piece,” and the agent modifying soul.md based on external signals: https://theshamblog.com/an-ai-agent-wrote-a-hit-piece-on-me-part-4/). A pessimistic (but pragmatic) view on current effect of agent proliferation can be found here: https://honnibal.dev/blog/clownpocalypse.
This points to a distributed failure mode: not a single jailbreak, but a contextual garbage funnel, where the broader environment (web content, repo discussions, agent-to-agent interactions) becomes the attack surface. In the limit, one might even imagine “jailbreaking the context distribution” by generating trigger-heavy content at scale, though that remains speculative. A related speculative failure mode is “resonance” between sparse signals in retrieved context, where misalignment emerges via dissonant interactions (a kind of moiré effect) rather than a single clear attack. Continual learning, if deployed, could amplify these risks.
As for the “N model helps align N+1” induction story you refer to in your post, I find it plausible as a training-time mechanism However, I would still be cautious about concluding that this implies we can safely downshift alignment work (and I don’t think you are saying it). Even if capabilities improve in a roughly linear way on benchmarks, real-world impact does not have to scale linearly. Adoption curves, tool access, and the number of deployed agents can scale faster than “intelligence” does, and multi-agent amplification can turn small capability deltas into large operational shifts. I would also be cautious because real-world impact is shaped by deployment economics. Recent work on “baking” model weights into dedicated silicon (e.g., Taalas) aims to massively increase inference throughput for specific models by hardwiring the network into chips, which could dramatically lower marginal inference cost and enable far wider agent deployment.(https://www.forbes.com/sites/karlfreund/2026/02/19/taalas-launches-hardcore-chip-with-insane-ai-inference-performance).
As a secondary topic, I find steganography, hidden-intent, and “thinkish” scenarios intellectually fascinating (https://www.lesswrong.com/posts/gpyqWzWYADWmLYLeX/how-ai-is-learning-to-think-in-secret), but I currently see them more as a prompt to build model organisms for alternative future paradigms than as the most natural failure mode of present-day models. Still, they may point to the same underlying problem from another angle.
As for my own work, my most recent effort (built on top of Open Character Training pipelne) suggests that context awareness and metacommunication are promising robustness factors for inner alignment. Constitutional AI style character training and reflective data can increase robustness to EM-style fine-tuning, and we may be able to engineer constitutions that explicitly target logical-type discrimination to reduce map/territory confusion in models. Also we tried to augment data before training to encourage the AI to reflect on its actions before updating weights. LessWrong post: https://www.lesswrong.com/posts/yA2hquLrFFSFDtcoE/context-awareness-constitutional-ai-can-mitigate-emergent