CS senior at Virginia Tech (May ’26). Building Polity, a multi-agent institutional sandbox for studying whether alignment failures can emerge at the level of institutions rather than individual models. TS/SCI. Interested in multi-agent alignment, AI governance, and the intersection of political economy and AI safety.
abkhur
I think the cyber concern is legitimate but the framing is still very capabilities-centric. The more interesting thing about the Mythos situation is that Anthropic’s mitigation strategy is entirely structural, they’ll give defenders early access so they get a head start. Instead of them orchestrating an alignment intervention or capabilities restriction, they’re sequencing Mythos’s deployment. Who gets access, when, under what conditions. The model itself is dual-use by nature; the same capability that finds vulnerabilities for defenders finds them for attackers. What determines whether this net-positive or net-negative isn’t a property of the model, I believe it’s a property of the institutional context it’s deployed into.
Which makes the leak itself the most informative data point here. A company building what it describes as an AI model with unprecedented cybersecurity capabilities exposed the existence of that model through a CMS misconfiguration. That is an institutional failure (although perhaps failure is a strong word here). It’s the kind of thing that doesn’t show up in model evaluations at all. If the worry is that cyber offense will outpace defense, the bottleneck probably isn’t the single model’s capability curve. I believe that the real bottleneck lies in whether the organizations deploying these models can maintain the operational discipline the deployment strategy assumes.
I personally find it quite ironic that a company whose model represents “unprecedented cybersecurity risks” exposed the model’s existence via a basic CMS misconfiguration, lol.
In any case, I imagine the per-token cost to be way higher than Opus 4.6, I wonder how usage windows will change after it gets deployed to the public
The DeepSeek paranoia finding is the most interesting result here to me. The game design primes agents to expect saboteurs, so some level of suspicion is baked in, but that doesn’t explain the specific dynamic that emerged. DeepSeek manufacturing evidence against a particular innocent agent, the unanimous vote, and GPT-5′s apologetic compliance. The prompt explains why DeepSeek was looking for a saboteur. It doesn’t explain who it picked, how it justified the accusation, or why every other agent went along.
That gap, between what the structural conditions prime and what actually emerges, seems worth pulling on. The social contagion of the unanimous vote especially. No one told the other agents to defer to DeepSeek’s accusation. That came from somewhere else. Maybe the model’s priors?
I think this is pointing at a real gap in AI character design. Once systems are more autonomous and embedded in institutions, “good assistant” doesn’t obviously seem like the right target anymore.
My main hesitation is about the implementation story. A lot of this post seems to rely on the idea that we can train something like context-dependent civic virtues: prosocial tendencies that activate selectively, stay subordinate to higher-priority constraints, and reduce collusion/takeover risk precisely because they’re not just global goals.
I’m not sure current training methods actually give us that.
I’ve been building a multi-agent institutional sandbox to test how models behave under different governance structures, scarcity, unequal permissions, persistent memory, etc. Very early results only, but the pattern so far is that instruction-tuned models look less like they have finely context-sensitive civic heuristics and more like they have a broad cooperative prior that generalizes across contexts.
Under neutral labels, Claude Sonnet 4.6 and an abliterated Qwen instruct model both produced broadly cooperative outcomes across democracy, oligarchy, and blank-slate conditions. The only explicit power-consolidation behavior I’ve seen under neutral labels so far came from a true base model, not an instruct-tuned one. Obviously that’s not robust evidence yet (tiny populations, short runs, N=1) but it does make me think the hard part here is not the normative case for “good citizen” behavior. It’s getting that behavior in a way that is actually selective, structure-sensitive, and legible.
So I think the post may be right at the level of desiderata, while being too quick about the mechanism. If the easiest way to get “prosociality” is instruction tuning, and instruction tuning tends to wash out context-sensitivity, then you may not get the kind of context-dependent virtues this proposal is relying on. You may just get generic niceness/cooperativeness everywhere.
That doesn’t refute the post, but it does suggest a sharper research question: what training method yields context-sensitive civic heuristics, rather than just a broad cooperative prior?