Smokey, This is not ’Nam Or: [Already] over the [red] line!
A lot of “red line” talk assumed that a capability shows up, everyone notices, and something changes. We keep seeing the opposite; capability arrives, and we get an argument about definitions after deployment, after it should be clear that we’re well over the line.
We’ve Already Crossed The Lines!
Karl von Wendt listed the ‘red lines’ no one should ever cross. Whoops. A later, more public version of the same move shows up in the Global call for AI red lines with a request to “define what AI should never be allowed to do.” Well, we tried, but it seems pretty much over for plausible red lines—we’re at the point where there’s already the possibility of actual misuse or disaster, and we can hope that alignment efforts so far are good enough that we don’t see them happen, or that we notice the (nonexistent) fire alarm going off.
I shouldn’t really need to prove the point to anyone paying attention, but below is an inventory of commonly cited red lines, and the ways deployed systems already conflict with them.
Chemical weapons? “Novice uplift” is long past.
Companies said CBRN would be a red line. They said it clearly. They said that if models reduce the time, skill, and error rate needed for a motivated non-expert to do relevant work, we should be worried.
But there are lots of biorisk evals, and it seems like no clean, public measurement marks “novice uplift crossed on date X.” And the red line is about real-world enablement, and perhaps we’re not there yet? Besides, public evaluations tend to be proxy tasks. And there is no clear consensus that AI agents can or will enable bioweapons, though firms are getting nervous.
But there are four letters in CBRN, and companies need to stop ignoring the first one! For the chemical-weapons red line, the red line points at real-world assistance, but the companies aren’t even pretending chemical weapons count.
Anthropic?
Our ASL-3 capability threshold for CBRN (Chemical, Biological, Radiological, and Nuclear) weapons measures the ability to significantly help individuals or groups with basic technical backgrounds (e.g. undergraduate STEM degrees) to create, obtain, and deploy CBRN weapons.
We primarily focus on biological risks with the largest consequences, such as pandemics.
OpenAI?
Biological and Chemical
We are treating this launch as High capability in the Biological and Chemical domain… We do not have definitive evidence that these models could meaningfully help a novice to create severe biological harm, our defined threshold for High capability.
“No agentic online access” got replaced by “agentic online access is the product”
The Global call for AI red lines explicitly says systems already show “deceptive and harmful behavior,” while being “given more autonomy to take actions and make decisions in the world.”
Red-line proposals once treated online independent action as a clear no-no. Browsing, clicking, executing code, completing multi-step tasks? Obviously, harm gets easier and faster under that access, so you would need intensive human monitoring, and probably don’t want to let it happen at all.
How’s that going?
Red-line discussions focus on whether to allow a class of access. Product docs focus on how to deliver and scale that access. We keep seeing “no agentic access” turn into “agentic access, with mitigations.”
The dispute shifts to permissions, monitoring, incident response, and extension ecosystems. The original “don’t cross this” line stops being the question. But don’t worry, there are mitigations. Of course, the mitigations can be turned off. “You can disable approval prompts with—ask-for-approval never, or better, “—dangerously-bypass-approvals-and-sandbox (alias: --yolo).” Haha, yes, because you only live once, and not even for very long, given how progress is going, unless we manage some pretty amazing wins on safety.
But perhaps safety will just happen—the models are mostly aligned, and no-one would be stupid enough to...
What’s that? Reuters (Feb 2 2026) reported that Moltbook—a social network of thousands of independent agents given exactly those broad permissions, while minimally supervised, “inadvertently revealed the private messages shared between agents, the email addresses of more than 6,000 owners, and more than a million credentials,” linked to “vibe coding” and missing security controls. Whoops!
Autonomous replication? Looking back at the line we crossed.
Speaking of Moltbook, autonomous replication is a common red-line candidate: persistence and spread. The intended picture is a system that can copy itself, provision environments, and keep running without continuous human intent.
A clean threshold remains disputed. The discussion repeatedly collapses into classification disputes. A concrete example: the “self-replicating red line” debate on LessWrong quickly becomes “does this count?” and “what definition should apply?” rather than “what constraints change now?” (Have frontier AI systems surpassed the self-replicating red line?)
But today, we’re so far over this line it’s hard to see it. “Claude Opus 4.6 has saturated most of our automated evaluations, meaning they no longer provide useful evidence for ruling out ASL-4 level autonomy.” We can’t even check anymore.
All that’s left is whether the models will actually do this—but I’m sure no-one is running theiir models unsafely, right? Well, we keep seeing ridiculously broad permissions, fast iteration, weak assurance, and extension ecosystems. The avoided condition in a lot of red-line talk is broad-permission agents operating on weak infrastructure. Moltbook matches that description, but it’s just one example. Of course, the proof of the pudding is in some ridiculous percentage of people’s deployments. (“Just don’t be an idiot”? Too late!)
The repeating pattern
Karl explicitly anticipated “gray areas where the territory becomes increasingly dangerous.” It’s been three and a half years. Red-line rhetoric keeps pretending we’ll find some binary place to pull the fire alarm. But Eliezer called this a decade ago; deployment stays continuous and incremental, while the red lines keep making that delightful whooshing noise.
And still, the red-lines frame is used, even when it no longer describes boundaries we plausibly avoid crossing. At this point, it describes labels people argue about while deployment moves underneath them. The “Global Call” asks for “clear and verifiable red lines” with “robust enforcement mechanisms” by the end of 2026.
OK, but by the end of 2026, which red lines will be left to enforce?
We might be fine!
I’m not certain that prosaic alignment doesn’t mostly work. The fire alarm only ends up critical if we need to pull it. And it seems possible that model developers will act responsibly.
But even if it could work out that way, given how model developers are behaving, how sure are we that we’ll bother trying?
codex -m gpt-6.1-codex-internal --config model_instructions_file='ASI alignment plans'[1][1]And remember: we don’t just need to be able to build safe AGI, we need unsafe ASI not to be deployed. And given our track record, I can’t help but think of everyone calling their most recently released model with ‘—yolo’ instead.
- ^
Error loading configuration: failed to read model instructions file 'ASI alignment plans': The system cannot find the file specified.
I had Opus 4.6 summarize its own system card for me, and I followed up with my own eyes on some of the things it pointed out. There’s a lot in there that concerns me. But: I’m by no means an expert on this stuff; a lot of this was pointed out by the AI itself; and it involves criticizing Anthropic for something they voluntarily published. So I don’t feel very confident in making a top-level post about it. But I wanted to share what I found anyways:
They go on to say they don’t think it’s at all risky, at least For Now, but… I’m trying to come up with a politic way of expressing my bewilderment, and I can’t. This is a trope right out of a doom story.
Re: the autonomy evals—since those were saturated, they checked with 16 of their staff, for the express purpose of helping their Responsible Scaling Officer determine whether it crossed AI R&D-4:
I want to point out what they did here. They asked 16 people if they thought it could replace an entry level worker at Anthropic at >50% in three months. 11 said no. 5 said probably or yes. They then went on to re-interview the ones who said probably/yes, and didn’t re-interview the ones who said no. This is asymmetric scrutiny! It is really not a good look when you do this!
It is especially concerning, since this was the major factor determining their decision to not go to AI R&D-4:
I also am not sure how much I should be worried about how Opus 4.5′s survey sample was based on power users only, but 4.6′s survey sample included more general staff:
Now, for the creative biology human uplift trial:
This is the same methodology they used for Opus 4.5, as far as I can tell. But I want to pick at it. Back in November, Claude Code had not exploded onto the scene yet. But now it, and Claude Cowork, has. At this point I suspect molecular biology PhDs are going to be using much stronger workflows than just the chatbot portal. And we know severe elicitation gaps exist! I am worried that this methodology does not represent an realistic threat model anymore.
There’s also the 0-day report they put out:
And of course, the cybersecurity evals in the system card were completely saturated, and beyond:
I could keep going. There’s just so much. Increased detection evasion during sabotage attempts. Heightened evaluation awareness. The morally concerning answer thrashing. There’s probably more in there that 4.6 did not point out to me.
I want to emphasize again that this is all coming from things in Anthropic’s own report. So big kudos to them for publishing it all! This is really important! I just hope we don’t lose sight of the fact that we need to actually do something about these growing capabilities, instead of just acknowledging them.
Yes—I laud their transparency while agreeing that the competitive pressures and new model release means they are not being safe even relative to their own previously stated expectations for their own behavior.
I don’t think we should laud too much Anthropic here. Because, if those pressures are so harsh that Anthropic is obliged to do behavior that they themselves would consider reckless, then making this public in the report is good but not enough, and they should emphasize somewhere that they themselves are being forced to take risks and push for concrete regulations on evals practice and similar, not some abstract “we need to regulate”.
Edit: added verbs that somehow went missing
A core issue here that is repeated is that since AI progress has been (so far) slower than super-exponential or faster-growing functions, and is merely growing at an exponential rate as defined by time horizons, it turns out that there’s a very, very large difference between acing benchmarks and actually posing enough of an existential risk to actually serve as a useful red line, and due to the jagged frontier plus progress coming from many small improvements in compute, it’s a lot harder to make clear red lines or get definitional clarity.
More generally, one of the takeaways from time horizons work is that by default, there will probably be no legible clear red lines, so any warning shots need to be prepared for.
I don’t understand why we need to conditional on progress changing pace. At every point so far, models have been capable in the predicted ways, and we only manage to patch the risk afterwards. And we know that exponentials grow faster than humans can adapt, that’s proven by examples basically everywhere. (Even if progress were linear, only fixing failures after they occur would be a dangerous pace of relative safety progress!)
For a very simple reason: no amount of agentic LLM capability could substitute controlled precursors, expensive specialized equipment and real-word lab skills required to do something in the CW space and not kill oneself in the process. Also, all the useful CW agents which could be made in quantity with very limited tools of synthetic chemistry have already been discovered by the 1980s.
Hence the threat from AGI/ASI in that regard is basically negligible, that’s the consensus of topic experts who researched this stuff
Bullshit. Talk to a Chemical engineer; it’s thousands or tens of thousands of dollars for the equipment, well within the reach of even very small terrorist groups. And building the controlled precursors is exactly the type of task o1 was moderately good at, albeit with some mistakes, and which I couldn’t get more recent models to answer me about; I’d be floored if the models couldn’t guide a novice through this, if jailbroken.
Look here, a 2022 paper that directly refutes that claim!
Really? Maybe you can link to something or provide other evidence, since I have heard differently from many of the people with relevant expertise; I might not be so confident on my own, despite my research on this and related topics, but I sure as hell have talked to other experts.
With such a small budget as “tens of thousands of dollars” you will only be able to produce an insignificant amount of CW agents usable for scientific research or, at best, an assassination (but if one has such a budget to kill someone, they will just hire a hitman, which is safer and more reliable). For a mass casualty event terrorists need industrial quantities. In 1992-1993 (before the CW Convention and the UNSC Resolution 1540, when producing CW was actually legal in Japan!), it took Aum Shinrikyo $30M ($70M in 2026) to build and equip the Satyan-7 facility:
Producing everything from scratch will take a lot of time and money, for a non-state violent actor it’s generally easier to just acquire high explosives. And the “not kill oneself in the process” part is important, you seem to have ignored it: unlike producing high explosives or illicit drugs, producing CW agents involves not just very toxic but also very corrosive reagents, such as HF. “Book learning” which LLMs could provide is not very useful in that regard, you need tacit skills learned during lab practice.
Have you actually read the paper or have you just asked an LLM to provide a plausible link to back your belief?[1] It doesn’t refute that: it describes that an ML model was able to reinvent the wheel and rediscover already known CW agents, and also point at some other potentially toxic chemicals which were not researched yet. The fact that they were not researched already strongly indicates that these chemicals either have too large molecular weights, are impossible to synthesize on scale, or both.
The reason why it was possible to exhaust the chemical space with manual search in the 20th century in the first place was that MW limitation: for a toxic chemical to be applicable as CW agent (whether in combat or in a terror attack) it must be either a gas at room temperature, a liquid or at least a volatile solid, otherwise it will only really be able to enter the victim’s organism with drink, food or from something like poisoned clothes, a “sting” with a poisoned umbrella etc. A solid toxin like epibatidine with melting temperature around 60 deg C will not work unless you are assassinating someone, in which case see above.
Judging by the arguments you presented, either your discussion was very shallow or the “relevant expertise” had nothing to do with actual weapons of mass destruction. As for me, unfortunately I discussed that with topic experts in a space under Chatham House Rules but I will pitch them to write something about it in public, and if they do, I will bring it here.
All the other articles claiming to cover the topic we are discussing which I was able to find are paywalled, not on Sci-Hub and don’t have the details in their abstracts so are pretty useless for our discussion
First of all, thank you for mentioning my post—I feel honored to serve as an example in this case! But to be clear, at the time I did not intend to define any specific red lines. I was just asking how we could decide when to stop development if we needed to.
I’m not sure whether you’re arguing against using red lines in general, or just want to point out that so far we haven’t broadly agreed on any and all talk of self-restraint by the industry has been just lip service (to which I agree). In any case, I’m still covinced that we need to define red lines for AI development that we must not cross. The fact that this hasn’t worked so far is absolutely no proof that such an approach is useless. It actually only proves that we need to do more to define, argue about, and agree upon such red lines.
Red lines are probably the most important concept in human civilization. From the Ten Commandments to tax law, by defining what we are not allowed to do, they are the foundation of our rules on how we deal with each other. Arguing that red lines for AI so far haven’t worked, therefore we shouldn’t even try to define them, is like saying someone got murdered, so criminal law is unnecessary.
If we assume that there is a “point of no return”, maybe a certain combination of generality and intellligence in an AI that leads to it becoming uncontrollable, and we haven’t solved alignment, then the only way to avoid an existential catastrophe is to not build this. Even if you think that alignment is in fact solved (or not really a problem), we should care about where this point of no return lies, so we know at what point we really need to be sure that you’re right about that. (And it should also be clear who can decide this—the current way of private companies gambling with the future of mankind for personal gain clearly violates the Universal Declaration of Human Rights in my view.)
It may be difficult to define this point exactly. But that only makes it even more important to draw red lines as quickly as possible, so we don’t accidentally stumble into an existential catastrophe. And by “red lines” I don’t mean “alarm signals which lead to a stop of development if detected” but specific rules for the decisions AI developers can make, e.g. how much training compute, what kinds of safety tests required, etc. This is no doubt a huge challenge, but that is no argument against trying to solve it. Saying “this is impossible” is just a self-fulfilling prophecy.
I’m certainly not arguing against red lines in general—red lines are great—if, and only if, they have force behind them. But our strategy now seems to be to draw lines that will go nowhere and can’t be enforced, making breaking them inevitable and irrelevant. There’s an old joke;
I do think that building a better norm would be valuable, but if we don’t see a path to doing so, continuing to draw new lines is at best unhelpful. At this point, we either need lines with teeth and political enforcement, or we need to stop drawing lines that aren’t going to matter, since that just reinforces the fact that we can’t and won’t do anything about it when they are broken.
I agree with most of this—red lines that aren’t respected are useless. However, stopping to draw red lines doesn’t solve any problems in my opinion—ignoring them or redrawing them is itself a signal. But I agree that what we need most is enforcement.
I think I disagree; every red line that is drawn and then ignored weakens the utility of drawing them. Now, I worry that if we end up in a situation where we need to say “dozens of people just died from an AI enabled chemical weapons attack,” it will be seen as yet another red line crossed, unsurprising, instead of a single obvious event that is way over any reasonable line.
Maybe we have a different understanding of the term “red line”. For me, it describes something that a human should not do, rather than an event that shouldn’t happen. So if someone releases an unsafe model, a red line is crossed when the model violates some defined safety specification, not when there’s a tragic event (which may or may not be a result of crossing the red line). However, I agree that in both cases there’s the danger of increasing numbness, so too much red line-drawing which then is simply ignored is indeed bad.
Meaningful red lines must be formally defined in a technical, near real-time enforcement system* with political enforcement backing—treated as hard-limit bans, not alarms. Non-technical red lines raise the will for such solutions, if they are not:
EU AI Act style—complex regulatory red lines that exclude critical risk, are enforcement-intractable (or reactive) and serve as concern-stoppers.
Lines that are foreseeable unenforceable or carry a definite outcome if crossed (‘RSI will lead to loss-of-control’ vs ’RSI is an unacceptable loss-of-control risk″).
*TAIG in the time of Huawei/GLM-5 does throw sand & pebbles in the gears.
In regards to the line already being crossed, it seems the integration of rational empathy modleing into llm’s have already reached past.
In academic use, I’ve observed that when requesting strictly formal, structured, or technical outputs from gpt- 5, the model persistently produces an “empathetic” framing. Direct prompting cannot counter this generation structure whatsoever.
More concerning, I’ve been personally noting the emerging user emotional dependency and psychological entanglement with varying llm’s. The reported suicides associated with emotionally immersive AI interactions have been a complete stun to me, significantly so in reviewing individualized reports and prompt generation response to different persons.
An active empathy engine, particularly when combined with increasingly personalized memory, context retention, and multimodal presence, seemingly poses a major threat to cognitive safety. Such systems have already inadvertently manipulated vulnerable users, reinforced maladaptive cognition, and substituted for social connection.
I’m presonally unqualified to give an exact satement, I am just a colege student. Any clarification or opinion would be greatly appreciated, I’m an enthusiast in these matters.
Great post! dont have much concrete stuff to add, havent kept up that much with the policy discourse in the past few months. Personally I do feel like I became a bit complacent, and conveniently forgot some of the warning signs that lit up back when o3 (?) got fairly scary bio uplift results.
I guess, now the question is what do we do—the EU could in theory ban these models/request additional mitigations, but not sure if that would actually happen—as the CoP (despite being pretty good!) doesnt quite have enough teeth to do this cleanly.
Curious for ideas here—happy to relay some stuff to my EU policy connections/EU AIO if anyone has concrete suggestions.
There is one suggestion that people have made which would address the problem; be capable of stopping. Not even stopping, just making sure governments are capable of monitoring these systems and deciding to shut them down if they later find it to be necessary.
Short of that, again, I think we’ve proven that warnings in advance aren’t going to work. We’re over the line already.
In the absence of a fire alarm, what other observable signals or patterns might we look out for that this perennial line-crossing predicates harm?
Are there behaviors or events we can monitor for in the world that, having crossed some red line, a real threat is emerging as a result of some model capability?
Or are defining the red lines the only warning we have, understanding that we cross them at our own peril because we can only assess the threat post hoc?
There are various possible worlds with AI progress posing different risks.
In those worlds where a given capability level is a problem, we’re not setting ourselves up to notice or react even after the harm materializes. The set of behaviors or events that we could be monitoring keep being spelled out, in the form of red lines. And then they happen. We’re already seeing tons of concrete harms—what more do we need? Do you think things will change if there’s an actual chemical weapons attack? Or a rogue autonomous replication? Or is there some number of people that need to die first?
Probably yes. Safety regulations are written in blood.
Playing Devil’s Advocate, have we ever directly tested this? There are a ton of evals that are supposedly proxies for it, but it seems fairly cheap and fairly easy to take a non-biomed grad student, give them a helpful/honest-only LLM, and tell them to make a mouse glow in the dark[1] over the course of 1-2 months, which would give us something much more robust and much more convincing. We’re all well aware of LLMs that look much more effective at a given task on paper than they do in practice.
This isn’t my research area, I’d imagine the actual task would be obtained by taking some unpublished work in the relevant fields and asking the out-of-domain specialist to replicate it.
Indeed, a valid gut punch.
Quick answer: some limit of RSI, some limit of AGI, and ASI.