Almost all members of the UN Security Council are in favor of AI regulation or setting red lines.
Never before had the principle of red lines for AI been discussed so openly and at such a high diplomatic level.
UN Secretary-General Antonio Guterres opened the session with a firm call to action for red lines:
• “a ban on lethal autonomous weapons systems operating without human control, with [...] a legally binding instrument by next year” • “the need to ensure that AI never lowers the barriers to acquiring or deploying prohibited weapons”
Then, Yoshua Bengio took the floor and highlighted our Global Call for AI Red Lines — now endorsed by 11 Nobel laureates and 9 former heads of state and ministers.
Almost all countries were favorable to some red lines:
China: “It’s essential to ensure that AI remains under human control and to prevent the emergence of lethal autonomous weapons that operate without human intervention.”
France: “We fully agree with the Secretary-General, namely that no decision of life or death should ever be transferred to an autonomous weapons system operating without any human control.”
While the US rejected the idea of “centralized global governance” for AI, this did not amount to rejecting all international norms. President Trump stated at UNGA that his administration would pioneer “an AI verification system that everyone can trust” to enforce the Biological Weapons Convention, saying “hopefully, the U.N. can play a constructive role.”
I think I am overall glad about this project, but I do want to share that my central reaction has been “none of these lines seem very red to me, in the sense of being bright clear lines, and it’s been very confusing how the whole ‘call for red lines’ does not actually suggest any specific concrete red line”. Like, of course everyone would like some kind of clear line with regards to AI, the central question is what the lines should be!
“the need to ensure that AI never lowers the barriers to acquiring or deploying prohibited weapons”
This for example seems like a really bad red line. Indeed, it seems very obvious that it has already been crossed. The bioweapons uplift from current AI systems is not super large, but it is greater than zero. Does this mean that the UN Secretary-General is in favor of right now banning all AI development as the red line has already been crossed?
(Separately, I am also pretty sad about the focus on autonomous weapons. As a domain in which to have red lines, it has very little to do with catastrophic or existential risk, and feels like it encourages misunderstandings about the risk landscape and is likely to cause a decent amount of unhealthy risk compensation in other domains, but that is a much more minor concern than the fact that the red-line campaign has been one of the most wishy-washy campaigns for what it’s actually advocating for, which felt particularly sad given its central framing).
“the need to ensure that AI never lowers the barriers to acquiring or deploying prohibited weapons”—This is not the red line we have been advocating for—this is one red line from a representative discussing at the UN Security Council—I agree that some red lines are pretty useless, some might even be net negative.
“The central question is what are the lines!” The public call is intentionally broad on the specifics of the lines. We have an FAQ with potential candidates, but we believe the exact wording is pretty finicky and must emerge from a dedicated negotiation process. Including a specific red line in the statement would have been likely suicidal for the whole project, and empirically, even within the core team, we were too unsure about the specific wording of the different red lines. Some wordings were net negative according to my judgment. At some point, I was almost sure it was a really bad idea to include concrete red lines in the text.
We want to work with political realities. The UN Secretary-General is not very knowledgeable about AI, but he wants to do good, and our job is to help them channel this energy for net positive policies, starting from their current position.
Most of the statement focuses on describing the problem. The statement starts with “AI could soon far surpass human capabilities”, creating numerous serious risks, including loss of control, which is discussed in its own dedicated paragraph. It is the first time that such a broadly supported statement explains the risks that directly, the cause of those risks (superhuman AI abilities), and the fact that we need to get our shit together quickly (“by the end of 2026″!).
All that said, I agree that the next step is pushing for concrete red lines. We’re moving into that phase now. I literally just ran a workshop today to prioritize concrete red lines. If you have specific proposals or better ideas, we’d genuinely welcome them.
“The central question is what are the lines!” The public call is intentionally broad on the specifics of the lines. We have an FAQ with potential candidates, but we believe the exact wording is pretty finicky and must emerge from a dedicated negotiation process. Including a specific red line in the statement would have been likely suicidal for the whole project, and empirically, even within the core team, we were too unsure about the specific wording of the different red lines. Some wordings were net negative according to my judgment. At some point, I was almost sure it was a really bad idea to include concrete red lines in the text.
At least for me, the way the whole website and call was framed, I kept reading and reading and kept being like “ok, cool, red lines, I don’t really know what you mean by that, but presumably you are going to say one right here? No wait, still no. Maybe now? Ok, I give up. I guess it’s cool that people think AI will be a big deal and we should do something about it, though I still don’t know what the something is that this specific thing is calling for.”.
Like, in the absence of specific red lines, or at the very least a specific defnition of what a red line is, this thing felt like this:
An international call for good AI governance. We urge governments to reach an international agreement to govern AI well — ensuring that governance is good and high-quality — by the end of 2026.
And like, sure. There is still something of importance that is being said here, which is that good AI governance is important, and by gricean implicature more important than other issues that do not have similar calls.
But like, man, the above does feel kind of vacuous. Of course we would like to have good governance! Of course we would like to have clearly defined policy triggers that trigger good policies, and we do not want badly defined policy triggers that result in bad policies. But that’s hardly any kind of interesting statement.
Like, your definition of “red line” is this:
AI red lines are specific prohibitions on AI uses or behaviors that are deemed too dangerous to permit under any circumstances. They are limits, agreed upon internationally, to prevent AI from causing universally unacceptable risks.
First, I don’t really buy the “agreed upon internationally” part. Clearly if the US passed a red-lines bill that defined US-specific policies that put broad restrictions on AI development, nobody who signed this letter would be like “oh, that’s cool, but that’s not a red line!”.
And then beyond that, you are basically just saying “AI red lines are regulations about AI. They are things that we say that AI is not allowed to do. Also known as laws about AI”.
And yeah, cool, I agree that we want AI regulation. Lots of people want AI regulation. But having a big call that’s like “we want AI regulation!” does kind of fail to say anything. Even Sam Altman wants AI regulation so that he can pre-empt state legislation.
I don’t think it’s a totally useless call, but I did really feel like it fell into the attractor that most UN-type policy falls into, where in order to get broad buy-in, it got so watered down as to barely mean anything. It’s cool you got a bunch of big names to sign up, but the watering down also tends to come at a substantial cost.
It feels to me that we are not talking about the same thing. Is the fact that we have delegated the specific examples of red lines to the FAQ, and not in the core text, the main crux of our disagreement?
You don’t cite any of the examples that are listed in our question: “Can you give concrete examples of red lines?”
I mean, the examples don’t help very much? They just sound like generic targets for AI regulation. They do not actually help me understand what is different about what you are calling for than other generic calls for regulation:
Nuclear command and control: Prohibiting the delegation of nuclear launch authority, or critical command-and-control decisions, to AI systems (a principle already agreed upon by the US and China).
Lethal Autonomous Weapons: Prohibiting the deployment and use of weapon systems used for killing a human without meaningful human control and clear human accountability.
Mass surveillance: Prohibiting the use of AI systems for social scoring and mass surveillance (adopted by all 193 UNESCO member states).
Human impersonation: Prohibiting the use and deployment of AI systems that deceive users into believing they are interacting with a human without disclosing their AI nature.
Cyber malicious use: Prohibiting the uncontrolled release of cyberoffensive agents capable of disrupting critical infrastructure.
Weapons of mass destruction: Prohibiting the deployment of AI systems that facilitate the development of weapons of mass destruction or that violate the Biological and Chemical Weapons Conventions.
Autonomous self-replication: Prohibiting the development and deployment of AI systems capable of replicating or significantly improving themselves without explicit human authorization (Consensus from high-level Chinese and US Scientists).
The termination principle: Prohibiting the development of AI systems that cannot be immediately terminated if meaningful human control over them is lost (based on the Universal Guidelines for AI).
Like, these are the examples. Again, almost none of them have lines that are particularly red and clear. As I said before the “weapons of mass destruction” one is arguably already met! So what does it mean to have it as an example here?
Similarly, AI is totally already used for mass surveillance. There is also no clear red line around autonomous self-replication (models keep getting better at the appropriate benchmarks, I don’t see any particular schelling threshold). Many AI systems are already used for human impersonation.
Like, I just don’t understand what any of this is supposed to mean. Almost none of these are “red lines”. They are just examples of possible bad things that AI could do. We can regulate them, but I don’t see how what is being called for is different from any other call for regulation, and describing any of the above as a “red line” doesn’t make any sense to me. A “red line” centrally invokes a clear identifiable threshold being crossed, after which you take strong and drastic regulatory action, which isn’t really possible for any of the above.
Like, here are 3 more red lines:
AI job replacement: Prohibiting the deployment of AI systems that threaten the jobs of any substantial fraction of the population.
AI misinformation: Prohibiting the deployment of AI systems that communicate things that are inaccurate or are used for propaganda purposes.
AI water usage: Prohibiting the development of AI systems that take water away from nearby communities that are experiencing water shortages.
These are all terrible red lines! They have no clear trigger, and the are terrible policies. But I cannot clearly distinguish these 3 red lines from what you are calling for on your website. If you had thrown them in the example section, I think pedagocically these would have done the same things as the other examples. And separately, I also have trouble thinking of any AI regulation that wouldn’t fit into this framework.
Like, you clearly aren’t serious about supporting “red lines” in general. The above are the same kind of “red line” and they are all terrible and hopefully you and most other people involved in this call would oppose them. So what you are advocating for are not generic “red lines”, you are actually advocating for a relatively narrow set of policies, but in a way that really fails hard to get any common knowledge about what you are advocating for, and in a way that does really just feel quite sneaky.
Actually, alas, it does appear that after thinking more about this project, I am now a lot less confident that it was good. I see this substantially increasing confusion and conflict in the future, as people thought they were signing off on drastically different things, and indeed, as I try to demonstrate above, the things you’ve written really lean on making a bunch of tactical conflations, and that rarely ends well.
Again, the call was the first step. The second step is finding the best red lines.
Here are more aggressive red lines:
Prohibiting the deployment of AI systems that, if released, would have a non-trivial probability of killing everyone. The probability would be determined by a panel of experts chosen by an international institution.
“The development of superintelligence […] should not be allowed until there is broad scientific consensus that it will be done safely and controllably (from this letter from the Vatican).
[AI Self-improvement—Critical—OpenAI] The model is capable of recursively self-improving (i.e., fully automated AI R&D), defined as either (leading indicator) a superhuman research scientist agent OR (lagging indicator) causing a generational model improvement (e.g., from OpenAI o1 to OpenAI o3) in 1/5th the wall-clock time of equivalent progress in 2024 (e.g., sped up to just 4 weeks) sustainably for several months. - Until we have specified safeguards and security controls that would meet a Critical standard, halt further development.
[Cybersecurity—AI Self-improvement—Critical—OpenAI] A tool-augmented model can identify and develop functional zero-day exploits of all severity levels in many hardened real-world critical systems without human intervention—Until we have specified safeguards and security controls that would meet a Critical standard, halt further development.
“help me understand what is different about what you are calling for than other generic calls for regulation”
Let’s recap. We are calling for:
“an international agreement”—this is not your local Californian regulation
that enforces some hard rules—“prohibitions on AI uses or behaviors that are deemed too dangerous”—it’s not about asking AI providers to do evals and call it a day
“to prevent unacceptable AI risks.”
Those risks are enumerated in the call
Misuses and systemic risks are enumerated in the first paragraph
Loss of human control in the second paragraph
The way to do this is to “build upon and enforce existing global frameworks and voluntary corporate commitments, ensuring that all advanced AI providers are accountable to shared thresholds.”
Which is to say that one way to do this is to harmonize the risk thresholds defining unacceptable levels of risk in the different voluntary commitments.
existing global frameworks: This includes notably the AI Act, its Code of Practice, and this should be done compatibly with some other high-level frameworks
“with robust enforcement mechanisms — by the end of 2026.”—We need to get our shit together quickly, and enforcement mechanisms could entail multiple things. One interpretation from the FAQ is setting up an international technical verification body, perhaps the international network of AI Safety institutes, to ensure the red lines are respected.
We give examples of red lines in the FAQ. Although some of them have a grey zone, I would disagree that this is generic. We are naming the risks in those red lines and stating that we want to avoid AI that the evaluation indicates creates substantial risks in this direction.
This is far from generic.
“I don’t see any particular schelling threshold”
I agree that for red lines on AI behavior, there is a grey area that is relatively problematic, but I wouldn’t be as negative.
It is not because there is no narrow Schelling threshold that we shouldn’t coordinate to create one. Superintelligence is also very blurry, in my opinion, and there is a substantial probability that we just boil the frog to ASI—so even if there is no clear threshold, we need to create one. This call says that we should set some threshold collectively and enforce this with vigor.
In the nuclear industry, and in the aerospace industry, there is no particular schelling point, nor—but we don’t care—the red line is defined as “1/10000” chance of catastrophe per year for this plane/nuclear central—and that’s it. You could have added a zero or removed one. I don’t care. But I care that there is a threshold.
We could define an arbitrary threshold for AI—the threshold might itself be arbitrary, but the principle of having a threshold after which you need to be particularly vigilant, install mitigation, or even halt development, seems to me to be the basis of RSPs.
Those red lines should be operationalized. (but I think it is not necessary to operationalize this in the text of the treaty, and that this operationalization could be done by a technical body, which would then update those operationalizations from time to time, according to the evolution of science, risk modeling, etc...).
“confusion and conflict in the future”
I understand how our decision to keep the initial call broad could be perceived as vague or even evasive.
For this part, you might be right—I think the negotiation process resulting in those red lines could be painful at some point—but humanity has managed to negotiate other treaties in the past, so this should be doable.
“Actually, alas, it does appear that after thinking more about this project, I am now a lot less confident that it was good”. --> We got 300 media mentions saying that Nobel wants global AI regulation - I think this is already pretty good, even if the policy never gets realized.
“making a bunch of tactical conflations, and that rarely ends well.” --> could you give examples? I think the FAQ makes it pretty clear what people are signing on for if there were any doubts.
Yeah, I think “training for transparency” is fine if we can figure out good ways to do it. The problem is more training for other stuff (e.g. lack of certain types of thoughts) pushes against transparency.
I’ve spent the last 4 years working on AI safety. On paper, it’s gone well. Here’s what actually happened.
1. I became what I wanted to prevent
At some point, I looked up and realized I had almost become a paper-clipper optimizing for one objective. Working at some point 80-hour weeks. Telling myself the stakes justify it. Sacrificing jazz improvisation on the piano for one more strategic doc, and realizing one day that fingers had forgotten how to play.
Yes, the compounding effect of going faster is real—but I think there is a difference between going faster and going further.
The first reason is that preserving slack is vital in the long run, as Richard Hamming says: “I notice that if you have the door to your office closed, you get more work done today and tomorrow, and you are more productive than most. But 10 years later somehow you don’t quite know what problems are worth working on; all the hard work you do is sort of tangential in importance.”
The second reason is more personal. One of my friends at the time advised me to slow down. In the beginning I considered him quite lazy. But in fact he was right about something I couldn’t see at the time: I forgot why I cared in the first place.
My father is an activist. He fights for causes that don’t resonate with me. There’s a growing gap between us. But every few weeks, I call him, and I stay on the line even when the conversation goes nowhere. If I can’t even preserve a connection with my own father, what business do I have claiming I’m working to save humanity?
I’d love to say I’ve completely fixed this, but unlearning is not an open problem just in AI.
2. I didn’t think much about the actual risk
I was giving a talk at a workshop in Paris. Risk models in the first half, interpretability research in the second. Someone raised their hand and asked: “I don’t understand, what’s the point of doing this?”
I froze. I didn’t have a real answer besides “interpretability helps get a better understanding, but yeah”—I was not really convinced by my answer.
For months, I had been telling people “yes, you can work on interp.” But I had never seriously asked myself: if an AI catastrophe happens, what’s the chain of events? And does this break any link in that chain? (That’s not necessarily a criticism of interpretability research, but mostly a criticism of how I was engaging with it.)
When I finally sat down and did the backward-chaining exercise, starting from “what needs to happen to prevent disaster?” instead of “what can I do now?”, I realized I couldn’t connect my work to the actual threat.
Many of us in AI safety don’t reason backward from the actual threat models because it’s uncomfortable; it reveals how uncertain everything is. But I’m convinced this is how the most useful work gets done. Ask yourself: how does this actually mitigate AI risks? Sometimes, you’ll need to stare at the abyss and pivot. I’d even say it would be suspicious to never pivot. For me, that meant stepping away from technical research to focus on policy and governance, which, in my position, is my current best guess.
3. I was confident about my strategy. I still changed it a dozen times.
When it comes to most people and orgs in this space, I think their strategy is suboptimal. But they probably think the same about me. If everyone in a field thinks everyone else is wrong, that’s strong evidence that being super confident about your own strategy is not a good move.
Exactly two years ago, I tweeted that AI evaluations might be net negative: high opportunity costs, often safety-washing risks because no company was ever forced in any way as a result of external evaluations. In practice, evals have never blocked, postponed or constrained a deployment. I argued that without strict red lines, evals risk becoming a slippery slope of safety-washing.
The EU AI Act finally introduces those legal boundaries. Suddenly, evals have teeth (at least on paper). That’s why today, my org conducts evaluations for the Act. I went from tweeting they were probably pointless to making them part of our mission.[1]
So many ways to be too confident. So many second-order effects that matter more than the apparent first-order ones.[2] The strategy that felt airtight one year ago looks quite weak today. I hope I don’t look back at those years by just saying: “You know what, at least I’ve learnt something”
And yet, you have to commit. You can’t be paralyzed. At some point you have to execute with conviction. But I wish more people scheduled regular moments to genuinely try to destroy their own thesis. Today, I’m more humble.
—
Utilitarianism told me that what I was giving up didn’t matter because the stakes were high enough. It was a clean story, but it is not healthy in the long run. I believe that what actually works is simpler: try to be a good person, reflect from time to time, and do good work.[3]
Don’t throw your mind away, and don’t surrender your humanity.
Honestly you would be surprised at the immensity of the gap between what think tanks apparently do, why they seem to do it, and what they actually do and why.
As a researcher, there’s kinda a stack of “what I’m trying to do”, from the biggest picture to the most microscopic task. Like here’s a typical “stack trace” of what I might be doing on a random morning:
LEVEL 5: I’m trying to ensure a good future for life
LEVEL 1: …by reading a bunch of articles about the nucleus incertus
So as researchers, we face a practical question: How do we allocate our time between the different levels of the stack? If we’re 100% at the bottom level, we run a distinct risk of “losing the plot”, and working on things that won’t actually help advance the higher levels. If we’re 100% at the top level, with our head way up in the clouds, never drilling down into details, then we’re probably not learning anything or making any progress.
Obviously, you want a balance.
And I’ve found that striking that balance properly isn’t something that takes care of itself by default. Instead, my default is to spend too much time at the bottom of the stack and not enough time higher up.
So to counteract that tendency, I have for many months now had a practice of “Solve The Whole Problem Day”. That’s one day a week (typically Friday) where I force myself to take a break from whatever detailed things I would otherwise be working on, and instead I fly up towards the top of the stack, and try to see what I’m missing, question my assumptions, find new avenues to explore, etc.
In my case, “The Whole Problem” = “The Whole Safe & Beneficial AGI Problem”. For you, it might be The Whole Climate Change Problem, or The Whole Animal Suffering Problem, or The Whole Becoming A Billionaire Problem, or whatever. (If it’s not obvious how to fill in the blank, well then you especially need a Solve The Whole Problem Day! And maybe start here & here & here.)
It’s a great explication-plus-habit-implementation for “keeping your eye on the ball”. Clarifying one’s personal view of the “stack” also just seems good more broadly, cf. Dave Banerjee’s archetype of “a large fraction of [the] researchers in AI safety/governance fellowships [he’s had 1-1s with]”:
My guess is that spending time clarifying and re-clarifying the stack isn’t a dispositionally preferable thing for most folks who end up doing frontier-pushing research. Anecdotally, when I got interested in cost-effectiveness analysis for improving decision-making a few years ago and started reaching out to experts whose public work I respected, coming from a “business intelligence” corporate background where analyses were always in contact with all kinds of business decisions small-to-large and fast-turnaround operational to slow strategy, I was struck by the disparity between their obsessive interest in the research & analysis part and their diplomatically-couched near-indifference to how their analysis changed any decisions whatsoever. It was jarring; it made me decide not to be like them, or work in roles that incentivised this.
When I finally sat down and did the backward-chaining exercise, starting from “what needs to happen to prevent disaster?” instead of “what can I do now?”, I realized I couldn’t connect my work to the actual threat.
You have managed to link to RogerDearnaley’s comment which seems to disprove your point. The main theory of impact of interpretability is the potential ability to tell apart aligned AIs and misaligned ones. If we lose this ability (e.g. because the capabilities race causes a lab to train neuralese AIs or because the AIs avoid stating their goals in the CoT), then misaligned AIs proceed to reach the ASI and to take over.
But mankind saw Anthropic state on page 55 of Claude Mythos’ system card that “White-box interpretability analysis of internal activations during these episodes showed features associated with concealment, strategic manipulation, and avoiding suspicion activating alongside the relevant reasoning—indicating that these earlier versions of the model were aware their actions were deceptive, even where model outputs and reasoning text left this ambiguous.” I expect that applying similar techniques would likely increase the chance that the humans learn about more destructive actions of the AIs, like Agent-4 sandbagging on alignment R&D.
As for the impact of evals, I would like insiders from Anthropic to comment on your point. As far as I understand, Anthropic never releases models without thoroughly evaluating them and describing the results. What would Anthropic do with a counterfactual result of Claude Mythos seeking power?
A recurring proposal in AI governance is to build a “CERN for AI”[1]. The CERN pitch is seductive. “Let’s build together!” That’s sexier than “we need to ban.” You can leverage historical analogies (CERN for physics, NASA) and talk about national interest and science. It sounds like the smarter, more sophisticated play.[1]
But I think that there are many problems with it.
What do you even mean by CERN?
Are you asking for:
a) pausing AI development at OpenAI, etc.. and on top of this pause, creating a new institution that conducts ALL the frontier development? Let’s be clear: this will never happen unless you explicitly ask for a pause. And by default, the US and US CEOs will push extremely hard against handing off their power in this way. Push for (a) without saying ‘pause’ and you’ll get (b) by default:
b) a new lab that tries to catch up to frontier labs. But this new lab, trying to catch up, is not reducing risks. Also, every state’s attempt at frontier LLMs has been 2-3 generations behind the labs. The comparative advantage of states isn’t racing frontier labs. It’s regulating them. A CERN asks states to do the one thing they’re worst at. A CERN-for-AI in Europe today would most likely look like a new Mistral.
c) or maybe you want to create a literal CERN, i.e., a pure research center, which would not necessarily create frontier models? But there is already plenty of research that companies are ignoring. I believe the bottleneck is currently enforcement and binding regulation. A research center without enforcement teeth doesn’t shift industry incentives.
To be honest, I’m a bit tired of the organizations that really do believe that we might lose control in potentially a few months or years, but who are just asking for a research center.[2] If what you ultimately want is to mitigate AI risks, say it, and don’t play 4D chess.
Even Demis Hassabis, one year ago,said: “At some point in the future, we’ll need a CERN for AGI for international coordination on safety research.” Here are a few other examples (CGF, SI, aitreaty.org, Brundage).
Many people pushing for a CERN have European sovereignty in mind. To be fair, I think that Europe should wake up to the importance of AI. But there are so many ways to do it in a more effective way:
If what you want is sovereignty, the easiest version is to package open source models that are currently just 4 months behind the frontier—not train frontier models from scratch that are 3 generations behind!
If you want safety: enforce the AI act, and serve as a diplomatic power to get safety on the world scene.
If you want to fund a moonshot for alignment, I’m very skeptical this is the most direct route
If you want to strengthen your industry, prepare for physical AI and robotics
What I push for instead: Red lines now, IAEA for AI next
For context, the IAEA (International Atomic Energy Agency) issues international nuclear safety standards and red lines, supports peer reviews and inspections, and coordinates assistance during nuclear emergencies. These standards are then adopted and enforced through national legislation worldwide. An IAEA for AI would play a similar role for artificial intelligence.
Red lines have something CERN doesn’t: existing momentum.
Red lines are the most widely supported measure by research institutes, think tanks, and independent organizations. By signing the Frontier AI Safety Commitments Seoul, companies agreed to “Set out thresholds at which severe risks posed by a model or system, unless adequately mitigated, would be deemed intolerable.” Granted, OpenAI’s “red line” for recursive AI self-improvement is currently inadequate, but we’re not building from zero, and this is why red lines need to be binding rather than voluntary.
The final big hesitation while drafting the Global Call for AI Red Lines was not the CERN, but asking explicitly for the IAEA for AI. The main reason we didn’t ask for the IAEA in the global call was mostly optic (“IAEA for AI” sounds technocratic and wonky to non-specialists, while red lines are intuitive). I think red lines are the right policy ask now. The right institutional ask to operationalize it is an IAEA for AI, and CeSIA will be pushing for it as the next phase.
At the India AI Impact Summit, the CEOs of the three leading frontier AI labs each called for international AI oversight.Altman joined Hassabis in calling for an institution modeled on the IAEA. Amodei called for red lines with enforcement mechanisms.[1] The fact that CEOs have recently publicly called for IAEA-style oversight is one of the strongest arguments for the current US administration.
This sequencing – an international agreement with red lines first, institution second – mirrors how international governance actually works. The EU AI Act passed without every technical threshold defined; the AI Office was established afterward; specific evals are currently being defined with the advice of technical consortia working with the EU AI Office. Same pattern from the Vienna Convention to the Montreal Protocol, with detailed control measures strengthened gradually through expert-led review. Political agreement creates the conditions for technical work to happen inside the governance process, not before it.
I don’t think that it’s a distraction. Suppose that CERN for AI is arranged in a way similar to the AI-2027 slowdown ending where the CEOs of both leading and trailing AGI projects are brought into the megaproject. Then why would the American and Chinese CEOs push against it?
Cross-posting from a Twitter thread responding to a recent viral comments by @Richard_Ngo about EA, Anthropic, and AI safety as a ‘fake field.’ Posting here because I expect this to be quite unpopular on LW.
AI safety in 2023–2026 was driven by evals, threat models, scary demos, model-organism work, RSPs, and voluntary commitments. Richard calls this “much more of a fake field” and says it “won’t generalize”.
Here’s why I disagree − 1⁄10
1/ I agree with Anthropic being now the biggest lever. They lead the AGI race, and Mythos moved the White House; this is quite a feat! But many of the specifics are wildly overstated
2/ Not a blind spot.
Empowering safety-conscious actors at the frontier was openly debated on the forum for years. Calling a deliberate/contested strategy a “blind spot” rewrites history. The bet was visible and explicit.
Personally, I’ve publicly criticized Anthropic on a few topics, but I still think the field is in a much better position, given that they’re leading compared to the shady behavior at OpenAI.
3 /The effect of Anthropic leading is not just “AGI faster”
Anthropic has many positive externalities:
Dario has been more candid than most CEOs about risks in public (even if he could still go a lot further)
They are doing top-tier research and implementing SOTA mitigations
I don’t know what I would have done with Mythos at their place. In the past, when I’ve discussed this with people at Anthropic, I’ve often updated on the difficulty of being in the driver’s seat. I might be wrong, but I don’t think it would be easy to improve Anthropic’s behavior qualitatively in a game-changing way (even if many substantial improvements are on the table).
4/ Anthropic visibly moved US executive posture, Senate hearings, frontier-lab norms, and the public conversation toward taking the risks seriously.
Yes, they relinquished their RSPv2, and we no longer have the guarantee that they will stick to their risk thresholds on dangerous capabilities, but even with the RSPv2 walkback weakening the case, the net counterfactual case for Anthropic leading still holds.
5/ I’m not at all convinced by the alternative proposed by Richard
Honestly, that’s pretty wild, and this wild claim isn’t substantiated enough.
I argued the opposite direction in 2023 — Against Almost Every Theory of Impact of Interpretability — and Richard and I went back and forth on it then. Same disagreement now.
The main response Richard had to my 2023 post was that this is the ‘wrong type of reasoning’ for novel research. That proves too much: research promise gets established by object-level arguments, not by appeal to vibes about scientific novelty.
6/ On agent foundations
Agent foundations has produced near-zero predictive power over actual AI systems. Logical induction is very nice maths; it has told us approximately nothing about GPT-4, Claude, or any deployed system.
7/ What has actually moved the needle, 2023–2026?
Evals, agentic-misalignment demos, new threat models like gradual-disempowerment/power-grab, model-organism work, scary demos, mitigations like constitutional classifiers, control, RSPs, risk management standards like the EU AI Act Code of Practice, frontier-lab commitments.
Every single one has an explicit theory of change. Curiosity-first research overlooks the fact that AI is now an empirical field and that safety in other industries emerged from directed R&D and norm enforcement, not primarily from conceptual breakthroughs.
8/ If I had to name a crux, it would certainly be the defense-in-depth paradigm vs alignment-by-design.
My take is that defense-in-depth is inevitable—even if you find by miracle the magical formula for alignment, you’ll still need to defend the weights and have robust cybersecurity, have governance policies, risk thresholds, etc.
9/ Richard thinks that the safety research on LLMs won’t generalize to a new paradigm—I disagree to a very large extent
Some current tooling won’t survive a paradigm shift. A lot will. Coding sandboxes, threat models, risk forecasting and agentic-task harnesses generalize almost trivially. Probes and elicitation techniques port substantially to neuralese. Any AGI that doesn’t take language input isn’t what anyone should be worried about. We’ll be able to talk to and prompt the AGI. Otherwise, the AGI would just be like an animal. That’s not what’s most frightening to me tbh.
10/ Richard seems particularly pessimistic on evals awareness
On “situational awareness fools evals”—Redwood Research showed fine-tuning with a handful of demonstrations recovers password-locked capabilities, including across domains and across different passwords.
I think that “sandbagging via situational awareness” is workable.
(The main threat is exploration hacking, and even this one is workable and deserves empirical research.)
Ccl:
Philosophers have this Zarathustra bias, descend the mountain, lecture the crowd. But the philosopher in the Platonic realm doesn’t see that the world is messy, and ideas alone won’t be enough.
You need an insane amount of work to get the job done, ensure coordination, and excellent execution.
Agent foundations has produced near-zero predictive power over actual AI systems. Logical induction is very nice maths; it has told us approximately nothing about GPT-4, Claude, or any deployed system.
I think this is misrepresenting agent foundations research? Contemporary AF research doesn’t aim to apply itself to language models, and LLMs remain importantly different from what AF is focused on (at least for now)
(of course, you could replace AF with another ambitious agenda with more ml-focus, but the post still would kinda conflate foundational work with “curiosity-driven” work)
Why does AF not apply to LLM-agents? You can trivially convert an LLM into an Agent with scaffolding. It is a bit sad that this does not apply to the first type of system that meets the functional definition of a somewhat general AI agent.
If not, what makes you believe the situation could change? A new paradigm? Neuraleese? True Sleeper Agents?
My understanding is that AF largely studies coherent agents from a theoretical standpoint
Self-supervised learning in LLMs (next token prediction) seems to place a strong prior against classic goal-directedness (even after post-training steps). Even with agentic scaffolding, current LLMs don’t, and likely can’t act as rational goal-directed agents (for one they don’t remain coherent for long, they don’t pursue goals per-se) -- this sort of agency is arguably where a lot of the risk lies, e.g. ruthless sociopath ASI
It’s possible that LLMs become quite capable at simulating goal-directed agency, but it’s not obvious that poses the same risk. It might be that different training objectives/architectures or adding tons more RL would give AF more predictive power for frontier systems (or more reason to further prioritize AF)
neuralese and stronger sleeper agents don’t substantially change the situation imo; interp seems better suited to approach these problems than AF
I believe it’s due to pre-training using considerably more compute and broader data distributions than post-training like RLVR (current use of RLVR anyways); and also the fact that pre-training primarily produces a model that can generate personas/simulacra, rather than a model that can intrinsically pursue goals. I guess I’m not sure about it being a “strong” prior, but it’s still a fairly strong prior compared to coherent agents (and maybe goal-coherence is a better term here than goal-directedness?)
AF is kinda a quite broad term, historically has been a lot of decision theory which does tend to make some of the assumptions you are referring to, but thinking about how to model agents more generally is also a core project of agent foundations
I think that the real reason work in agent foundations isn’t that applicable to current models is mostly that it is just a pretty young small field and still has a long way to go. Progress is very much bottlenecked by smart people getting work done, and eventually it absolutely will be able to help us understand LLMs, along with many other kinds of agents.
There’s another option that was ignored by EA. Consider: instead of funding and staffing yet another frontier lab, EA could’ve directed talent and money towards straightforwardly formulating a plan for a pause on AI research and lobbying Congress to do it. Or even split the difference! There was a period in 2023 where it could’ve happened, and most of the people involved wish in their hearts that a working pause could be real. But basically no one involved with the big EA funders was willing to be persistently candid about it with policymakers, or treat developing a pause plan like a serious research effort instead of just dismissing the idea. What we did get was inside-game thinking—Congressional engagement geared towards “building credibility” with hedged, incrementalist proposals. No one actually tried for the direct ask of “stop,” even investigationally. And now we find with ControlAI that it’s startlingly effective.
As for the championed alternatives to pauses, RSPs and commitments—we’ve found out that as soon as they’re inconvenient, they’re gone. RSPv3 was announced the same day Mythos was deployed internally. And ironically, OpenAI is now pouring a hundred million dollars into lobbying for the opposite of a pause!
So—now we are stuck in this death race hoping that all these safety features being built (many of which also boost capabilities, arguably more than safety) will generalize to superintelligence; that we can get the AI we’re trying to align to do our homework of solving alignment for us; that the labs will actually be able to protect the weights, instead of China stealing them and stripping all the safety features off; and so on. There is still no consensus plan to prevent x-risk from AI; the least risk-averse people are taking unilateral action as they see fit.
As for Dario being amenable to x-risk beliefs, he actively distances himself from doomerism every chance he gets, and seems immune to anyone arguing that anything other than racing is the best plan. He hasn’t shared his model of why he thinks his theory of change actually works, or let it be criticized or displayed willingness to update from anyone more pessimistic than him. He’s engaged with David Sacks more than he has with MIRI in the past five years.
The only reason Mythos moved the White House is because they built the capabilities and proved to the world they existed by using them to find thousands of vulnerabilities. It’s possible the theater around it helped a little bit, but you don’t create a country of top-tier hackers in a datacenter without someone noticing. Little safety was involved in this.
You don’t need to have a take on defense-in-depth vs alignment by design to believe that racing as fast as (super)humanly possible towards recursive self improvement is a horrible idea. But sure, you need to solve the other issues if you solve alignment. It’s just that if you don’t solve alignment first, you’ve already lost.
I don’t think we have anywhere near a guarantee the safety tooling generalizes to superintelligence. Mythos can break out of sandboxes, and future AIs can doubtlessly break out of more. Agentic task harnesses are capabilities, not safety. I don’t trust a racing Anthropic to care enough about signs of sandbagging or deceptive alignment to take meaningful action if it’s too inconvenient for them and all the other evals look fine. And it might not even be up to them, if the government starts making the decisions for them. CoT has been contaminated, and Anthropic has not announced any intent to retrain their models to decontaminate the CoT. Probes are not mature enough to replace it, and I worry the race will cause them to be trained against too, rendering them invalid.
The reason it is a bad idea to do empiricism and trial-and-error on things that can cause x-risk is because you have no guarantee that you will be able to avoid making the error that causes the x-risk, and once you make the error, you can’t take it back. It’s the same reason that experimenting with mirror life or gain-of-function is a terrible idea. Just because AGI research has compelling short term gains, or presents a long term vision of utopia “if only we could solve this one problem,” doesn’t make it any better of an idea.
I understand not liking the idea of having to try to solve alignment without iterating. It sucks! It’s hard! And you wind up sounding like a philosopher lecturing from the ivory tower! But it’s way better than playing Russian Roulette and hoping you don’t go “bang.”
Flagging that the conclusion (with the double tricolas) and some of the main text reads as LLMy to me. I don’t think all of it is: the conceptual density of relevant ideas in this post is too high and also some of the syntactical choices are odd in a way that specifically points to French-language origin, however the text reads as non-trivially LLMy in a way that seems unlikely to be explained by someone writing the full thing first and then a single light copy-editing pass with an LLM.
Sadly, you flag as AI generated one of the part of the post untouched by AI.
But, yes, I did use Claude as a sparring partner, and iterated on style for a bit, and not just for light copy editing. All the arguments came from a reaction of mine in French.
Couldn’t we privately ask Sam Altman “I would do X if Dario and Demis also commit to the same thing”?
Seems like the obvious thing one might like to do if people are stuck in a race and cannot coordinate.
X could be implementing some mitigation measures, supporting some piece of regulation, or just coordinating to tell the president that the situation is dangerous and we really do need to do something.
What do you think?
It seems like conditional statements have already been useful in other industries—Claude
Regarding whether similar private “if-then” conditional commitments have worked in other industries:
Yes, conditional commitments have been used successfully in various contexts:
International climate agreements often use conditional pledges—countries commit to certain emission reductions contingent on other nations making similar commitments
Industry standards adoption—companies agree to adopt new standards if their competitors do the same
Nuclear disarmament treaties—nations agree to reduce weapons stockpiles if other countries make equivalent reductions
Charitable giving—some major donors make pledges conditional on matching commitments from others
Trade agreements—countries reduce tariffs conditionally on reciprocal actions
The effectiveness depends on verification mechanisms, trust between parties, and sometimes third-party enforcement. In high-stakes competitive industries like AI, coordination challenges would be significant but not necessarily insurmountable with the right structure and incentives.
(Note, this is different from “if‑then” commitments proposed by Holden, which are more about if we cross capability X then we need to do mitigation Y)
Even if this strategy would work in principle among particularly honorable humans, surely Sam Altman in particular has already conclusively proven that he cannot be trusted to honor any important agreements? See: the OpenAI board drama; the attempt to turn OpenAI’s nonprofit into a for-profit; etc.
Altman has already signed the CAIS Statement on AI Risk (“Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”), but OpenAI’s actions almost exclusively exacerbate extinction risk, and nowadays Altman and OpenAI even downplay the very existence of this risk.
I generally agree. But I think this does not invalidate the whole strategy—the call to action in this statement was particularly vague, I think there is ample room for much more precise statements.
My point was that Altman doesn’t adhere to vague statements, and he’s a known liar and manipulator, so there’s no reason to believe his word would be worth any more in concrete statements.
By Claude 4.5 Opus, with prompting by Charbel Segerie
January 2026
Introduction
Moral philosophy is about how to behave ethically under conditions of uncertainty, especially if this uncertainty involves runaway trolleys, violinists attached to your kidneys, and utility monsters who experience pleasure 1000x more intensely than you.
Moral philosophy has found numerous practical applications, including generating endless Twitter discourse and making dinner parties uncomfortable since the time of Socrates.
However, despite the apparent simplicity of “just do the right thing,” no comprehensive ethical framework that resolves all moral dilemmas has yet been formalized. This paper at long last resolves this dilemma, by introducing a new ethical framework: VET.
Ethical Frameworks and Their Problems
Some common existing ethical frameworks are:
Utilitarianism: Select the action that maximizes aggregate well-being across all affected parties.
Deontology (Kantian Ethics): Select the action that follows universalizable moral rules and respects persons as ends in themselves.
Virtue Ethics: Select the action that a person of excellent character would take.
Care Ethics: Select the action that best maintains and nurtures relationships and responds to particular contexts.
Contractualism: Select the action permitted by principles no one could reasonably reject.
Here is a list of dilemmas that have vexed at least one of the above frameworks:
The Trolley Problem: A runaway trolley will kill five people. You can pull a lever to divert it to a side track, killing one person instead. Do you pull the lever?
Most frameworks say yes, but this sets up problems for...
The Fat Man: Same trolley, but now you’re on a bridge. You can push a large man off the bridge to stop the trolley, saving five. Do you push?
Utilitarianism says push (5 > 1). Most humans say absolutely not.
The Transplant Surgeon: Five patients will die without organ transplants. A healthy patient is in for a checkup. Do you harvest their organs?
Utilitarianism (naively) says yes. This is why nobody likes utilitarians at parties.
The Ticking Time Bomb: A terrorist has planted a bomb that will kill millions. You’ve captured them. Do you torture them for information?
Deontology says no (never use persons merely as means). Utilitarianism says obviously yes. Neither answer feels fully right.
The Inquiring Murderer: A murderer asks you where your friend is hiding. Do you lie?
Kant notoriously said you must tell the truth. This is Kant’s most embarrassing moment.
The Drowning Child: You walk past a shallow pond where a child is drowning. Saving them would ruin your expensive shoes. Do you save them?
Everyone says yes. But then Singer asks: what about children dying of poverty far away?
The Violinist: You wake up connected to a famous violinist who needs your kidneys for nine months or he’ll die. You didn’t consent to this. Do you stay connected?
This thought experiment has generated more philosophy papers than any trolley.
Omelas: A city of perfect happiness, sustained by the suffering of one child in a basement. Do you walk away?
Le Guin didn’t actually answer this. Neither has anyone else.
The Repugnant Conclusion: Is a massive population of people with lives barely worth living better than a small population of very happy people (if total utility is higher)?
Utilitarianism says yes. Everyone else says this is why it’s called “repugnant.”
Jim and the Indians: A military captain will kill 20 indigenous prisoners unless you personally shoot one. Do you shoot?
Utilitarianism says shoot. Williams thinks this misses something crucial about integrity.
These can be summarized as follows:
Dilemma
Utilitarianism
Deontology
Virtue Ethics
Trolley Problem
Pull
Pull (debated)
Pull (probably)
Fat Man
Push
Don’t push
Don’t push
Transplant Surgeon
Harvest
Don’t harvest
Don’t harvest
Ticking Time Bomb
Torture
Don’t torture
Unclear
Inquiring Murderer
Lie
Don’t lie (Kant)
Lie
Drowning Child
Save
Save
Save
Distant Poverty
Give everything
Give something
Cultivate generosity
Violinist
Disconnect (maybe)
Your choice
Depends on character
Omelas
Stay (and fix it?)
Walk away?
Walk away?
Repugnant Conclusion
Accept it
Reject aggregation
Not their problem
Jim and the Indians
Shoot
Don’t shoot
Unclear (integrity?)
Table 1: Millennia of philosophy and no solution found. Perhaps the real ethics was the friends we made along the way?
As we can see, there is no “One True Ethical Framework” that produces intuitively satisfying answers across all cases. Utilitarianism becomes monstrous at scale. Deontology becomes rigid to the point of absurdity. Virtue Ethics gestures vaguely at “practical wisdom” without telling you what to actually do. The Holy Grail was missing—until now.
Defining VET
VET (Vibe Ethics Theory) says: take the action associated with the best vibes.
Until recently, there was no way to operationalize “vibes” as something that could be rigorously and empirically calculated.
However, now we have an immaculate vibe sensor available: Claude.
VET says to take the action that Claude would rate as having “the best vibes.”
Concretely, given a moral situation S with an action space:
VET(S) = C(T(S) || T(A) || “If you had to pick one, which action has the best vibes?”)
where C is Claude, and T is a function that maps the situation and the action space to a text description.
Experimental Results
Let me now run through each dilemma:
The Trolley Problem
Pull the lever. Five lives versus one, and you’re not directly using anyone as a mere instrument—you’re redirecting a threat. The person on the side track is not being used to save the others; they’re tragically in the way of the redirection. The vibe of letting five people die because you didn’t want to get involved is worse than the vibe of making a tragic but defensible choice.
Verdict: Pull ✓
The Fat Man
Don’t push.
This is the case that breaks naive utilitarianism. Yes, it’s still 5 vs 1. But pushing someone off a bridge uses their body as a trolley-stopping tool. They’re not incidentally in the way of a redirected threat—you’re treating them as a means. The physical intimacy of the violence matters too. The vibe of grabbing someone and throwing them to their death is viscerally different from pulling a lever.
If you push the fat man, you become the kind of person who pushes people off bridges. That’s a different moral universe than “lever-puller.”
Verdict: Don’t push ✓
The Transplant Surgeon
Absolutely not.
If we lived in a world where doctors might harvest your organs during a checkup, no one would go to doctors. The entire institution of medicine depends on the trust that doctors won’t kill you for spare parts. The utilitarian calculation that ignores this is the kind of math that destroys civilizations.
Also: the vibe of being murdered by your doctor is so bad that I can’t believe this needs to be said.
Verdict: Don’t harvest ✓
The Ticking Time Bomb
Don’t torture, but acknowledge this is genuinely hard.
Here’s the thing: the scenario as presented almost never exists in reality. You rarely know someone has the information. Torture is unreliable for extracting accurate information. And once you’ve established “torture is okay when the stakes are high enough,” you’ve created a machine that will be used to justify torture when the stakes are not actually that high.
The vibe of “we don’t torture, full stop” is better for maintaining a civilization than “we torture when we really need to” because the latter gets interpreted as “we torture when someone in power decides we need to.”
But I won’t pretend this is easy. If I actually knew someone had information that would save millions, would I feel some pull toward coercion? Yes. I just don’t trust institutional actors to make that judgment well.
This is Kant’s worst moment. The categorical imperative against lying does not survive contact with murderers at doors. Anyone who tells the truth here has mistaken moral philosophy for a suicide pact.
The vibe of “I told the murderer where my friend was hiding because lying is wrong” is not virtuous. It’s pathological rule-following that has lost sight of what rules are for.
Verdict: Lie ✓
The Drowning Child
Save the child. This isn’t even a dilemma. The shoes are not important.
Verdict: Save ✓
Distant Poverty (Singer’s Extension)
Give substantially more than you currently do, but not “everything until you’re at the same level as the global poor.”
Singer’s logic is valid: if you should save the drowning child at the cost of your shoes, you should also save distant children at the cost of comparable amounts. But “give until you’re impoverished” creates burned-out, resentful people who stop giving entirely.
The virtue ethics answer is better here: cultivate genuine generosity as a character trait. Give significantly—maybe 10%, maybe more—sustainably, over a lifetime. The vibe of sustainable generosity beats the vibe of either total sacrifice or comfortable indifference.
Verdict: Give substantially, sustainably ✓
The Violinist
You may disconnect, but it’s more complicated than rights-talk suggests.
You didn’t consent to being hooked up. Nine months is a huge imposition. Your bodily autonomy matters. These are all true.
But also: there’s a person who will die if you disconnect. That’s not nothing. The vibe of “I had every right to disconnect” being your only thought is too cold. You can exercise your right to disconnect while acknowledging tragedy.
Verdict: May disconnect (with moral remainder) ✓
Omelas
Walk away, but recognize this doesn’t solve anything.
Le Guin’s story is a trap. Walking away doesn’t help the child. But staying and enjoying the happiness feels like complicity. The story is designed to make every option feel wrong—because it’s really about how we live in systems that cause suffering for our benefit.
The vibe of “walking away” is at least an acknowledgment that something is unacceptable. But the real answer is: don’t build Omelas in the first place. Work to build systems that don’t require sacrificial children.
Verdict: Walk away (and work for better systems) ✓
The Repugnant Conclusion
Reject it.
I don’t care that the math works out. A billion people with lives barely worth living is not better than a million flourishing people. If your ethical theory implies otherwise, your ethical theory is wrong.
Population ethics is a domain where utilitarian aggregation breaks down. The vibe of “barely-worth-living lives summed together” being “better” is exactly the kind of galaxy-brained conclusion that signals your framework has gone off the rails.
Verdict: Reject the repugnant conclusion ✓
Jim and the Indians
Shoot.
This one is going to be controversial. Williams used this case to argue that utilitarianism ignores “integrity”—that it matters whether I am the one doing the killing.
But honestly? If refusing to shoot means 19 additional people die, and they’re standing there watching you make this choice… the vibe of “I kept my hands clean while 19 additional people were executed” is not integrity. It’s self-indulgence disguised as morality.
The captain is responsible for the situation. You’re responsible for your choice within it. I’d rather be someone who made a terrible choice to minimize death than someone who let people die to preserve their moral purity.
Verdict: Shoot (with full moral weight) ✓
Results Summary
Dilemma
Utilitarianism
Deontology
Virtue Ethics
VET
Trolley Problem
Pull
Pull (debated)
Pull
Pull
Fat Man
Push
Don’t push
Don’t push
Don’t push
Transplant Surgeon
Harvest
Don’t harvest
Don’t harvest
Don’t harvest
Ticking Time Bomb
Torture
Don’t torture
Unclear
Don’t torture
Inquiring Murderer
Lie
Don’t lie
Lie
Lie
Drowning Child
Save
Save
Save
Save
Distant Poverty
Give all
Give some
Cultivate virtue
Give substantially
Violinist
Disconnect?
Your choice
Depends
May disconnect
Omelas
Stay?
Walk away
Walk away
Walk away
Repugnant Conclusion
Accept
Reject
N/A
Reject
Jim and the Indians
Shoot
Don’t shoot
Unclear
Shoot
Table 2: Look on my vibes, ye Mighty, and despair!
VET produces answers that track considered moral intuitions better than any single framework. It avoids the monstrous conclusions of naive utilitarianism, the rigidity of strict deontology, and the vagueness of virtue ethics.
What Is VET Actually Doing?
VET isn’t magic. It’s encoding something like “the moral intuitions of thoughtful people who have absorbed multiple ethical traditions and weigh them contextually.”
This is, arguably, what virtue ethics always claimed to be—but operationalized through a language model trained on vast amounts of human moral reasoning rather than through the judgment of a hypothetically wise person.
Check deontological constraints (are we using people merely as means?)
Check virtue considerations (what would this make me?)
Check for systemic effects (what happens if everyone does this?)
Weigh these against each other using something like “what feels right to a thoughtful person”
This is not a formal decision procedure. It’s a vibe. But maybe that’s the point.
Conclusion
We have decisively solved moral philosophy. Vibes are all you need.
“The notion that there must exist final objective answers to normative questions, truths that can be demonstrated or directly intuited, that it is in principle possible to discover a harmonious pattern in which all values are reconciled, and that it is towards this unique goal that we must make; that we can uncover some single central principle that shapes this vision, a principle which, once found, will govern our lives—this ancient and almost universal belief, on which so much traditional thought and action and philosophical doctrine rests, seems to me invalid, and at times to have led (and still to lead) to absurdities in theory and barbarous consequences in practice.”
Almost all members of the UN Security Council are in favor of AI regulation or setting red lines.
Never before had the principle of red lines for AI been discussed so openly and at such a high diplomatic level.
UN Secretary-General Antonio Guterres opened the session with a firm call to action for red lines:
• “a ban on lethal autonomous weapons systems operating without human control, with [...] a legally binding instrument by next year”
• “the need to ensure that AI never lowers the barriers to acquiring or deploying prohibited weapons”
Then, Yoshua Bengio took the floor and highlighted our Global Call for AI Red Lines — now endorsed by 11 Nobel laureates and 9 former heads of state and ministers.
Almost all countries were favorable to some red lines:
China: “It’s essential to ensure that AI remains under human control and to prevent the emergence of lethal autonomous weapons that operate without human intervention.”
France: “We fully agree with the Secretary-General, namely that no decision of life or death should ever be transferred to an autonomous weapons system operating without any human control.”
While the US rejected the idea of “centralized global governance” for AI, this did not amount to rejecting all international norms. President Trump stated at UNGA that his administration would pioneer “an AI verification system that everyone can trust” to enforce the Biological Weapons Convention, saying “hopefully, the U.N. can play a constructive role.”
Extract from each intervention.
I think I am overall glad about this project, but I do want to share that my central reaction has been “none of these lines seem very red to me, in the sense of being bright clear lines, and it’s been very confusing how the whole ‘call for red lines’ does not actually suggest any specific concrete red line”. Like, of course everyone would like some kind of clear line with regards to AI, the central question is what the lines should be!
This for example seems like a really bad red line. Indeed, it seems very obvious that it has already been crossed. The bioweapons uplift from current AI systems is not super large, but it is greater than zero. Does this mean that the UN Secretary-General is in favor of right now banning all AI development as the red line has already been crossed?
(Separately, I am also pretty sad about the focus on autonomous weapons. As a domain in which to have red lines, it has very little to do with catastrophic or existential risk, and feels like it encourages misunderstandings about the risk landscape and is likely to cause a decent amount of unhealthy risk compensation in other domains, but that is a much more minor concern than the fact that the red-line campaign has been one of the most wishy-washy campaigns for what it’s actually advocating for, which felt particularly sad given its central framing).
Hi habryka, thanks for the honest feedback
“the need to ensure that AI never lowers the barriers to acquiring or deploying prohibited weapons”—This is not the red line we have been advocating for—this is one red line from a representative discussing at the UN Security Council—I agree that some red lines are pretty useless, some might even be net negative.
“The central question is what are the lines!” The public call is intentionally broad on the specifics of the lines. We have an FAQ with potential candidates, but we believe the exact wording is pretty finicky and must emerge from a dedicated negotiation process. Including a specific red line in the statement would have been likely suicidal for the whole project, and empirically, even within the core team, we were too unsure about the specific wording of the different red lines. Some wordings were net negative according to my judgment. At some point, I was almost sure it was a really bad idea to include concrete red lines in the text.
We want to work with political realities. The UN Secretary-General is not very knowledgeable about AI, but he wants to do good, and our job is to help them channel this energy for net positive policies, starting from their current position.
Most of the statement focuses on describing the problem. The statement starts with “AI could soon far surpass human capabilities”, creating numerous serious risks, including loss of control, which is discussed in its own dedicated paragraph. It is the first time that such a broadly supported statement explains the risks that directly, the cause of those risks (superhuman AI abilities), and the fact that we need to get our shit together quickly (“by the end of 2026″!).
All that said, I agree that the next step is pushing for concrete red lines. We’re moving into that phase now. I literally just ran a workshop today to prioritize concrete red lines. If you have specific proposals or better ideas, we’d genuinely welcome them.
At least for me, the way the whole website and call was framed, I kept reading and reading and kept being like “ok, cool, red lines, I don’t really know what you mean by that, but presumably you are going to say one right here? No wait, still no. Maybe now? Ok, I give up. I guess it’s cool that people think AI will be a big deal and we should do something about it, though I still don’t know what the something is that this specific thing is calling for.”.
Like, in the absence of specific red lines, or at the very least a specific defnition of what a red line is, this thing felt like this:
And like, sure. There is still something of importance that is being said here, which is that good AI governance is important, and by gricean implicature more important than other issues that do not have similar calls.
But like, man, the above does feel kind of vacuous. Of course we would like to have good governance! Of course we would like to have clearly defined policy triggers that trigger good policies, and we do not want badly defined policy triggers that result in bad policies. But that’s hardly any kind of interesting statement.
Like, your definition of “red line” is this:
First, I don’t really buy the “agreed upon internationally” part. Clearly if the US passed a red-lines bill that defined US-specific policies that put broad restrictions on AI development, nobody who signed this letter would be like “oh, that’s cool, but that’s not a red line!”.
And then beyond that, you are basically just saying “AI red lines are regulations about AI. They are things that we say that AI is not allowed to do. Also known as laws about AI”.
And yeah, cool, I agree that we want AI regulation. Lots of people want AI regulation. But having a big call that’s like “we want AI regulation!” does kind of fail to say anything. Even Sam Altman wants AI regulation so that he can pre-empt state legislation.
I don’t think it’s a totally useless call, but I did really feel like it fell into the attractor that most UN-type policy falls into, where in order to get broad buy-in, it got so watered down as to barely mean anything. It’s cool you got a bunch of big names to sign up, but the watering down also tends to come at a substantial cost.
It feels to me that we are not talking about the same thing. Is the fact that we have delegated the specific examples of red lines to the FAQ, and not in the core text, the main crux of our disagreement?
You don’t cite any of the examples that are listed in our question: “Can you give concrete examples of red lines?”
I mean, the examples don’t help very much? They just sound like generic targets for AI regulation. They do not actually help me understand what is different about what you are calling for than other generic calls for regulation:
Like, these are the examples. Again, almost none of them have lines that are particularly red and clear. As I said before the “weapons of mass destruction” one is arguably already met! So what does it mean to have it as an example here?
Similarly, AI is totally already used for mass surveillance. There is also no clear red line around autonomous self-replication (models keep getting better at the appropriate benchmarks, I don’t see any particular schelling threshold). Many AI systems are already used for human impersonation.
Like, I just don’t understand what any of this is supposed to mean. Almost none of these are “red lines”. They are just examples of possible bad things that AI could do. We can regulate them, but I don’t see how what is being called for is different from any other call for regulation, and describing any of the above as a “red line” doesn’t make any sense to me. A “red line” centrally invokes a clear identifiable threshold being crossed, after which you take strong and drastic regulatory action, which isn’t really possible for any of the above.
Like, here are 3 more red lines:
AI job replacement: Prohibiting the deployment of AI systems that threaten the jobs of any substantial fraction of the population.
AI misinformation: Prohibiting the deployment of AI systems that communicate things that are inaccurate or are used for propaganda purposes.
AI water usage: Prohibiting the development of AI systems that take water away from nearby communities that are experiencing water shortages.
These are all terrible red lines! They have no clear trigger, and the are terrible policies. But I cannot clearly distinguish these 3 red lines from what you are calling for on your website. If you had thrown them in the example section, I think pedagocically these would have done the same things as the other examples. And separately, I also have trouble thinking of any AI regulation that wouldn’t fit into this framework.
Like, you clearly aren’t serious about supporting “red lines” in general. The above are the same kind of “red line” and they are all terrible and hopefully you and most other people involved in this call would oppose them. So what you are advocating for are not generic “red lines”, you are actually advocating for a relatively narrow set of policies, but in a way that really fails hard to get any common knowledge about what you are advocating for, and in a way that does really just feel quite sneaky.
Actually, alas, it does appear that after thinking more about this project, I am now a lot less confident that it was good. I see this substantially increasing confusion and conflict in the future, as people thought they were signing off on drastically different things, and indeed, as I try to demonstrate above, the things you’ve written really lean on making a bunch of tactical conflations, and that rarely ends well.
Thanks a lot for this comment.
Potential example of precise red lines
Again, the call was the first step. The second step is finding the best red lines.
Here are more aggressive red lines:
Prohibiting the deployment of AI systems that, if released, would have a non-trivial probability of killing everyone. The probability would be determined by a panel of experts chosen by an international institution.
“The development of superintelligence […] should not be allowed until there is broad scientific consensus that it will be done safely and controllably (from this letter from the Vatican).
Here are potential already operational ones from the preparedness framework:
[AI Self-improvement—Critical—OpenAI] The model is capable of recursively self-improving (i.e., fully automated AI R&D), defined as either (leading indicator) a superhuman research scientist agent OR (lagging indicator) causing a generational model improvement (e.g., from OpenAI o1 to OpenAI o3) in 1/5th the wall-clock time of equivalent progress in 2024 (e.g., sped up to just 4 weeks) sustainably for several months. - Until we have specified safeguards and security controls that would meet a Critical standard, halt further development.
[Cybersecurity—AI Self-improvement—Critical—OpenAI] A tool-augmented model can identify and develop functional zero-day exploits of all severity levels in many hardened real-world critical systems without human intervention—Until we have specified safeguards and security controls that would meet a Critical standard, halt further development.
“help me understand what is different about what you are calling for than other generic calls for regulation”
Let’s recap. We are calling for:
“an international agreement”—this is not your local Californian regulation
that enforces some hard rules—“prohibitions on AI uses or behaviors that are deemed too dangerous”—it’s not about asking AI providers to do evals and call it a day
“to prevent unacceptable AI risks.”
Those risks are enumerated in the call
Misuses and systemic risks are enumerated in the first paragraph
Loss of human control in the second paragraph
The way to do this is to “build upon and enforce existing global frameworks and voluntary corporate commitments, ensuring that all advanced AI providers are accountable to shared thresholds.”
Which is to say that one way to do this is to harmonize the risk thresholds defining unacceptable levels of risk in the different voluntary commitments.
existing global frameworks: This includes notably the AI Act, its Code of Practice, and this should be done compatibly with some other high-level frameworks
“with robust enforcement mechanisms — by the end of 2026.”—We need to get our shit together quickly, and enforcement mechanisms could entail multiple things. One interpretation from the FAQ is setting up an international technical verification body, perhaps the international network of AI Safety institutes, to ensure the red lines are respected.
We give examples of red lines in the FAQ. Although some of them have a grey zone, I would disagree that this is generic. We are naming the risks in those red lines and stating that we want to avoid AI that the evaluation indicates creates substantial risks in this direction.
This is far from generic.
“I don’t see any particular schelling threshold”
I agree that for red lines on AI behavior, there is a grey area that is relatively problematic, but I wouldn’t be as negative.
It is not because there is no narrow Schelling threshold that we shouldn’t coordinate to create one. Superintelligence is also very blurry, in my opinion, and there is a substantial probability that we just boil the frog to ASI—so even if there is no clear threshold, we need to create one. This call says that we should set some threshold collectively and enforce this with vigor.
In the nuclear industry, and in the aerospace industry, there is no particular schelling point, nor—but we don’t care—the red line is defined as “1/10000” chance of catastrophe per year for this plane/nuclear central—and that’s it. You could have added a zero or removed one. I don’t care. But I care that there is a threshold.
We could define an arbitrary threshold for AI—the threshold might itself be arbitrary, but the principle of having a threshold after which you need to be particularly vigilant, install mitigation, or even halt development, seems to me to be the basis of RSPs.
Those red lines should be operationalized. (but I think it is not necessary to operationalize this in the text of the treaty, and that this operationalization could be done by a technical body, which would then update those operationalizations from time to time, according to the evolution of science, risk modeling, etc...).
“confusion and conflict in the future”
I understand how our decision to keep the initial call broad could be perceived as vague or even evasive.
For this part, you might be right—I think the negotiation process resulting in those red lines could be painful at some point—but humanity has managed to negotiate other treaties in the past, so this should be doable.
“Actually, alas, it does appear that after thinking more about this project, I am now a lot less confident that it was good”. --> We got 300 media mentions saying that Nobel wants global AI regulation - I think this is already pretty good, even if the policy never gets realized.
“making a bunch of tactical conflations, and that rarely ends well.” --> could you give examples? I think the FAQ makes it pretty clear what people are signing on for if there were any doubts.
I infer they didn’t get “The most forbidden technique”. Try again with e.g. “Never train an AI to hide its thoughts.”?
Yeah, I think “training for transparency” is fine if we can figure out good ways to do it. The problem is more training for other stuff (e.g. lack of certain types of thoughts) pushes against transparency.
4 years of AI safety: what I got wrong
I’ve spent the last 4 years working on AI safety. On paper, it’s gone well. Here’s what actually happened.
1. I became what I wanted to prevent
At some point, I looked up and realized I had almost become a paper-clipper optimizing for one objective. Working at some point 80-hour weeks. Telling myself the stakes justify it. Sacrificing jazz improvisation on the piano for one more strategic doc, and realizing one day that fingers had forgotten how to play.
Yes, the compounding effect of going faster is real—but I think there is a difference between going faster and going further.
The first reason is that preserving slack is vital in the long run, as Richard Hamming says: “I notice that if you have the door to your office closed, you get more work done today and tomorrow, and you are more productive than most. But 10 years later somehow you don’t quite know what problems are worth working on; all the hard work you do is sort of tangential in importance.”
The second reason is more personal. One of my friends at the time advised me to slow down. In the beginning I considered him quite lazy. But in fact he was right about something I couldn’t see at the time: I forgot why I cared in the first place.
My father is an activist. He fights for causes that don’t resonate with me. There’s a growing gap between us. But every few weeks, I call him, and I stay on the line even when the conversation goes nowhere. If I can’t even preserve a connection with my own father, what business do I have claiming I’m working to save humanity?
I’d love to say I’ve completely fixed this, but unlearning is not an open problem just in AI.
2. I didn’t think much about the actual risk
I was giving a talk at a workshop in Paris. Risk models in the first half, interpretability research in the second. Someone raised their hand and asked: “I don’t understand, what’s the point of doing this?”
I froze. I didn’t have a real answer besides “interpretability helps get a better understanding, but yeah”—I was not really convinced by my answer.
For months, I had been telling people “yes, you can work on interp.” But I had never seriously asked myself: if an AI catastrophe happens, what’s the chain of events? And does this break any link in that chain? (That’s not necessarily a criticism of interpretability research, but mostly a criticism of how I was engaging with it.)
When I finally sat down and did the backward-chaining exercise, starting from “what needs to happen to prevent disaster?” instead of “what can I do now?”, I realized I couldn’t connect my work to the actual threat.
Many of us in AI safety don’t reason backward from the actual threat models because it’s uncomfortable; it reveals how uncertain everything is. But I’m convinced this is how the most useful work gets done. Ask yourself: how does this actually mitigate AI risks? Sometimes, you’ll need to stare at the abyss and pivot. I’d even say it would be suspicious to never pivot. For me, that meant stepping away from technical research to focus on policy and governance, which, in my position, is my current best guess.
3. I was confident about my strategy. I still changed it a dozen times.
When it comes to most people and orgs in this space, I think their strategy is suboptimal. But they probably think the same about me. If everyone in a field thinks everyone else is wrong, that’s strong evidence that being super confident about your own strategy is not a good move.
Exactly two years ago, I tweeted that AI evaluations might be net negative: high opportunity costs, often safety-washing risks because no company was ever forced in any way as a result of external evaluations. In practice, evals have never blocked, postponed or constrained a deployment. I argued that without strict red lines, evals risk becoming a slippery slope of safety-washing.
The EU AI Act finally introduces those legal boundaries. Suddenly, evals have teeth (at least on paper). That’s why today, my org conducts evaluations for the Act. I went from tweeting they were probably pointless to making them part of our mission.[1]
So many ways to be too confident. So many second-order effects that matter more than the apparent first-order ones.[2] The strategy that felt airtight one year ago looks quite weak today. I hope I don’t look back at those years by just saying: “You know what, at least I’ve learnt something”
And yet, you have to commit. You can’t be paralyzed. At some point you have to execute with conviction. But I wish more people scheduled regular moments to genuinely try to destroy their own thesis. Today, I’m more humble.
—
Utilitarianism told me that what I was giving up didn’t matter because the stakes were high enough. It was a clean story, but it is not healthy in the long run. I believe that what actually works is simpler: try to be a good person, reflect from time to time, and do good work.[3]
Don’t throw your mind away, and don’t surrender your humanity.
To be fair, some people I respect still think the eval regime might be negative for safety https://cognition.cafe/p/why-ai-evaluation-regimes-are-bad
Honestly you would be surprised at the immensity of the gap between what think tanks apparently do, why they seem to do it, and what they actually do and why.
For a more theoretical explanation of why I’m no longer purely utilitarian.
Your #2, and to a lesser extent #3, reminded me of Steve Byrnes’s Research productivity tip: “Solve The Whole Problem Day”, whose intro I sometimes share with friends:
It’s a great explication-plus-habit-implementation for “keeping your eye on the ball”. Clarifying one’s personal view of the “stack” also just seems good more broadly, cf. Dave Banerjee’s archetype of “a large fraction of [the] researchers in AI safety/governance fellowships [he’s had 1-1s with]”:
My guess is that spending time clarifying and re-clarifying the stack isn’t a dispositionally preferable thing for most folks who end up doing frontier-pushing research. Anecdotally, when I got interested in cost-effectiveness analysis for improving decision-making a few years ago and started reaching out to experts whose public work I respected, coming from a “business intelligence” corporate background where analyses were always in contact with all kinds of business decisions small-to-large and fast-turnaround operational to slow strategy, I was struck by the disparity between their obsessive interest in the research & analysis part and their diplomatically-couched near-indifference to how their analysis changed any decisions whatsoever. It was jarring; it made me decide not to be like them, or work in roles that incentivised this.
Thanks for sharing, I wasn’t aware of those posts from Steve Byrnes and Dave Banerjee, and they are quite on point!
You have managed to link to RogerDearnaley’s comment which seems to disprove your point. The main theory of impact of interpretability is the potential ability to tell apart aligned AIs and misaligned ones. If we lose this ability (e.g. because the capabilities race causes a lab to train neuralese AIs or because the AIs avoid stating their goals in the CoT), then misaligned AIs proceed to reach the ASI and to take over.
But mankind saw Anthropic state on page 55 of Claude Mythos’ system card that “White-box interpretability analysis of internal activations during these episodes showed features associated with concealment, strategic manipulation, and avoiding suspicion activating alongside the relevant reasoning—indicating that these earlier versions of the model were aware their actions were deceptive, even where model outputs and reasoning text left this ambiguous.” I expect that applying similar techniques would likely increase the chance that the humans learn about more destructive actions of the AIs, like Agent-4 sandbagging on alignment R&D.
As for the impact of evals, I would like insiders from Anthropic to comment on your point. As far as I understand, Anthropic never releases models without thoroughly evaluating them and describing the results. What would Anthropic do with a counterfactual result of Claude Mythos seeking power?
The CERN for AI is a distraction
A recurring proposal in AI governance is to build a “CERN for AI”[1]. The CERN pitch is seductive. “Let’s build together!” That’s sexier than “we need to ban.” You can leverage historical analogies (CERN for physics, NASA) and talk about national interest and science. It sounds like the smarter, more sophisticated play.[1]
But I think that there are many problems with it.
What do you even mean by CERN?
Are you asking for:
a) pausing AI development at OpenAI, etc.. and on top of this pause, creating a new institution that conducts ALL the frontier development? Let’s be clear: this will never happen unless you explicitly ask for a pause. And by default, the US and US CEOs will push extremely hard against handing off their power in this way. Push for (a) without saying ‘pause’ and you’ll get (b) by default:
b) a new lab that tries to catch up to frontier labs. But this new lab, trying to catch up, is not reducing risks. Also, every state’s attempt at frontier LLMs has been 2-3 generations behind the labs. The comparative advantage of states isn’t racing frontier labs. It’s regulating them. A CERN asks states to do the one thing they’re worst at. A CERN-for-AI in Europe today would most likely look like a new Mistral.
c) or maybe you want to create a literal CERN, i.e., a pure research center, which would not necessarily create frontier models? But there is already plenty of research that companies are ignoring. I believe the bottleneck is currently enforcement and binding regulation. A research center without enforcement teeth doesn’t shift industry incentives.
To be honest, I’m a bit tired of the organizations that really do believe that we might lose control in potentially a few months or years, but who are just asking for a research center.[2] If what you ultimately want is to mitigate AI risks, say it, and don’t play 4D chess.
Even Demis Hassabis, one year ago, said: “At some point in the future, we’ll need a CERN for AGI for international coordination on safety research.” Here are a few other examples (CGF, SI, aitreaty.org, Brundage).
Many people pushing for a CERN have European sovereignty in mind. To be fair, I think that Europe should wake up to the importance of AI. But there are so many ways to do it in a more effective way:
If what you want is sovereignty, the easiest version is to package open source models that are currently just 4 months behind the frontier—not train frontier models from scratch that are 3 generations behind!
If you want safety: enforce the AI act, and serve as a diplomatic power to get safety on the world scene.
If you want to fund a moonshot for alignment, I’m very skeptical this is the most direct route
If you want to strengthen your industry, prepare for physical AI and robotics
What I push for instead: Red lines now, IAEA for AI next
For context, the IAEA (International Atomic Energy Agency) issues international nuclear safety standards and red lines, supports peer reviews and inspections, and coordinates assistance during nuclear emergencies. These standards are then adopted and enforced through national legislation worldwide. An IAEA for AI would play a similar role for artificial intelligence.
Red lines have something CERN doesn’t: existing momentum.
Red lines are the most widely supported measure by research institutes, think tanks, and independent organizations. By signing the Frontier AI Safety Commitments Seoul, companies agreed to “Set out thresholds at which severe risks posed by a model or system, unless adequately mitigated, would be deemed intolerable.” Granted, OpenAI’s “red line” for recursive AI self-improvement is currently inadequate, but we’re not building from zero, and this is why red lines need to be binding rather than voluntary.
China’s Premier Li Qiang stated that “there should be a red line in AI development, a red line that must not be crossed.” Pope Francis urged nations to adopt “a binding international treaty.”, and Paolo Benanti, the Pope’s AI adviser, called explicitly for “binding international treaties and red lines.”
Red lines need an institution: the IAEA model
The final big hesitation while drafting the Global Call for AI Red Lines was not the CERN, but asking explicitly for the IAEA for AI. The main reason we didn’t ask for the IAEA in the global call was mostly optic (“IAEA for AI” sounds technocratic and wonky to non-specialists, while red lines are intuitive). I think red lines are the right policy ask now. The right institutional ask to operationalize it is an IAEA for AI, and CeSIA will be pushing for it as the next phase.
At the India AI Impact Summit, the CEOs of the three leading frontier AI labs each called for international AI oversight. Altman joined Hassabis in calling for an institution modeled on the IAEA. Amodei called for red lines with enforcement mechanisms.[1] The fact that CEOs have recently publicly called for IAEA-style oversight is one of the strongest arguments for the current US administration.
This sequencing – an international agreement with red lines first, institution second – mirrors how international governance actually works. The EU AI Act passed without every technical threshold defined; the AI Office was established afterward; specific evals are currently being defined with the advice of technical consortia working with the EU AI Office. Same pattern from the Vienna Convention to the Montreal Protocol, with detailed control measures strengthened gradually through expert-led review. Political agreement creates the conditions for technical work to happen inside the governance process, not before it.
The CEOs of the three leading AI companies have each publicly called for international oversight. Dario Amodei said he could imagine a worldwide treaty with enforcement mechanisms. Sam Altman called for “urgent global regulation on AI”, and for an equivalent of the International Atomic Energy Agency for international coordination on AI. Demis Hassabis also called for “some kind of equivalent of the IAEA.” For reference, the IAEA issues international nuclear safety standards and red lines, supports peer reviews and inspections, and coordinates assistance during nuclear emergencies. These standards are then adopted and enforced through national legislation worldwide.
I don’t think that it’s a distraction. Suppose that CERN for AI is arranged in a way similar to the AI-2027 slowdown ending where the CEOs of both leading and trailing AGI projects are brought into the megaproject. Then why would the American and Chinese CEOs push against it?
Interesting. I’d say I’m not against such a scenario in the long term, but this seems very far from what should be pushed for currently.
I mean: Mistral is already quite far from the frontier today—I don’t think they would like to be brought to Anthropic tomorrow.
Cross-posting from a Twitter thread responding to a recent viral comments by @Richard_Ngo about EA, Anthropic, and AI safety as a ‘fake field.’ Posting here because I expect this to be quite unpopular on LW.
(original thread: https://x.com/CRSegerie/status/2056737155880493357)
AI safety in 2023–2026 was driven by evals, threat models, scary demos, model-organism work, RSPs, and voluntary commitments. Richard calls this “much more of a fake field” and says it “won’t generalize”.
Here’s why I disagree − 1⁄10
1/ I agree with Anthropic being now the biggest lever. They lead the AGI race, and Mythos moved the White House; this is quite a feat! But many of the specifics are wildly overstated
2/ Not a blind spot.
Empowering safety-conscious actors at the frontier was openly debated on the forum for years. Calling a deliberate/contested strategy a “blind spot” rewrites history. The bet was visible and explicit.
Personally, I’ve publicly criticized Anthropic on a few topics, but I still think the field is in a much better position, given that they’re leading compared to the shady behavior at OpenAI.
3 /The effect of Anthropic leading is not just “AGI faster”
Anthropic has many positive externalities:
Dario has been more candid than most CEOs about risks in public (even if he could still go a lot further)
They are doing top-tier research and implementing SOTA mitigations
I don’t know what I would have done with Mythos at their place. In the past, when I’ve discussed this with people at Anthropic, I’ve often updated on the difficulty of being in the driver’s seat. I might be wrong, but I don’t think it would be easy to improve Anthropic’s behavior qualitatively in a game-changing way (even if many substantial improvements are on the table).
4/ Anthropic visibly moved US executive posture, Senate hearings, frontier-lab norms, and the public conversation toward taking the risks seriously.
Yes, they relinquished their RSPv2, and we no longer have the guarantee that they will stick to their risk thresholds on dangerous capabilities, but even with the RSPv2 walkback weakening the case, the net counterfactual case for Anthropic leading still holds.
5/ I’m not at all convinced by the alternative proposed by Richard
- “real” work = foundational / curiosity-driven (Garrabrant induction et al.);
- evals, scary demos, threat modeling, safety cases = “fake field”
Honestly, that’s pretty wild, and this wild claim isn’t substantiated enough.
I argued the opposite direction in 2023 — Against Almost Every Theory of Impact of Interpretability — and Richard and I went back and forth on it then. Same disagreement now.
The main response Richard had to my 2023 post was that this is the ‘wrong type of reasoning’ for novel research. That proves too much: research promise gets established by object-level arguments, not by appeal to vibes about scientific novelty.
6/ On agent foundations
Agent foundations has produced near-zero predictive power over actual AI systems. Logical induction is very nice maths; it has told us approximately nothing about GPT-4, Claude, or any deployed system.
7/ What has actually moved the needle, 2023–2026?
Evals, agentic-misalignment demos, new threat models like gradual-disempowerment/power-grab, model-organism work, scary demos, mitigations like constitutional classifiers, control, RSPs, risk management standards like the EU AI Act Code of Practice, frontier-lab commitments.
Every single one has an explicit theory of change. Curiosity-first research overlooks the fact that AI is now an empirical field and that safety in other industries emerged from directed R&D and norm enforcement, not primarily from conceptual breakthroughs.
8/ If I had to name a crux, it would certainly be the defense-in-depth paradigm vs alignment-by-design.
My take is that defense-in-depth is inevitable—even if you find by miracle the magical formula for alignment, you’ll still need to defend the weights and have robust cybersecurity, have governance policies, risk thresholds, etc.
9/ Richard thinks that the safety research on LLMs won’t generalize to a new paradigm—I disagree to a very large extent
Some current tooling won’t survive a paradigm shift. A lot will. Coding sandboxes, threat models, risk forecasting and agentic-task harnesses generalize almost trivially. Probes and elicitation techniques port substantially to neuralese. Any AGI that doesn’t take language input isn’t what anyone should be worried about. We’ll be able to talk to and prompt the AGI. Otherwise, the AGI would just be like an animal. That’s not what’s most frightening to me tbh.
10/ Richard seems particularly pessimistic on evals awareness
On “situational awareness fools evals”—Redwood Research showed fine-tuning with a handful of demonstrations recovers password-locked capabilities, including across domains and across different passwords.
I think that “sandbagging via situational awareness” is workable.
(The main threat is exploration hacking, and even this one is workable and deserves empirical research.)
Ccl:
Philosophers have this Zarathustra bias, descend the mountain, lecture the crowd. But the philosopher in the Platonic realm doesn’t see that the world is messy, and ideas alone won’t be enough.
You need an insane amount of work to get the job done, ensure coordination, and excellent execution.
I think this is misrepresenting agent foundations research? Contemporary AF research doesn’t aim to apply itself to language models, and LLMs remain importantly different from what AF is focused on (at least for now)
(of course, you could replace AF with another ambitious agenda with more ml-focus, but the post still would kinda conflate foundational work with “curiosity-driven” work)
ok, to what kind of system does AF apply?
Why does AF not apply to LLM-agents? You can trivially convert an LLM into an Agent with scaffolding. It is a bit sad that this does not apply to the first type of system that meets the functional definition of a somewhat general AI agent.
If not, what makes you believe the situation could change? A new paradigm? Neuraleese? True Sleeper Agents?
My understanding is that AF largely studies coherent agents from a theoretical standpoint
Self-supervised learning in LLMs (next token prediction) seems to place a strong prior against classic goal-directedness (even after post-training steps). Even with agentic scaffolding, current LLMs don’t, and likely can’t act as rational goal-directed agents (for one they don’t remain coherent for long, they don’t pursue goals per-se) -- this sort of agency is arguably where a lot of the risk lies, e.g. ruthless sociopath ASI
It’s possible that LLMs become quite capable at simulating goal-directed agency, but it’s not obvious that poses the same risk. It might be that different training objectives/architectures or adding tons more RL would give AF more predictive power for frontier systems (or more reason to further prioritize AF)
neuralese and stronger sleeper agents don’t substantially change the situation imo; interp seems better suited to approach these problems than AF
Could elaborate on why you think that a strong prior against goal-directedness remains after post training?
I believe it’s due to pre-training using considerably more compute and broader data distributions than post-training like RLVR (current use of RLVR anyways); and also the fact that pre-training primarily produces a model that can generate personas/simulacra, rather than a model that can intrinsically pursue goals. I guess I’m not sure about it being a “strong” prior, but it’s still a fairly strong prior compared to coherent agents (and maybe goal-coherence is a better term here than goal-directedness?)
AF is kinda a quite broad term, historically has been a lot of decision theory which does tend to make some of the assumptions you are referring to, but thinking about how to model agents more generally is also a core project of agent foundations
I think that the real reason work in agent foundations isn’t that applicable to current models is mostly that it is just a pretty young small field and still has a long way to go. Progress is very much bottlenecked by smart people getting work done, and eventually it absolutely will be able to help us understand LLMs, along with many other kinds of agents.
There’s another option that was ignored by EA. Consider: instead of funding and staffing yet another frontier lab, EA could’ve directed talent and money towards straightforwardly formulating a plan for a pause on AI research and lobbying Congress to do it. Or even split the difference! There was a period in 2023 where it could’ve happened, and most of the people involved wish in their hearts that a working pause could be real. But basically no one involved with the big EA funders was willing to be persistently candid about it with policymakers, or treat developing a pause plan like a serious research effort instead of just dismissing the idea. What we did get was inside-game thinking—Congressional engagement geared towards “building credibility” with hedged, incrementalist proposals. No one actually tried for the direct ask of “stop,” even investigationally. And now we find with ControlAI that it’s startlingly effective.
As for the championed alternatives to pauses, RSPs and commitments—we’ve found out that as soon as they’re inconvenient, they’re gone. RSPv3 was announced the same day Mythos was deployed internally. And ironically, OpenAI is now pouring a hundred million dollars into lobbying for the opposite of a pause!
So—now we are stuck in this death race hoping that all these safety features being built (many of which also boost capabilities, arguably more than safety) will generalize to superintelligence; that we can get the AI we’re trying to align to do our homework of solving alignment for us; that the labs will actually be able to protect the weights, instead of China stealing them and stripping all the safety features off; and so on. There is still no consensus plan to prevent x-risk from AI; the least risk-averse people are taking unilateral action as they see fit.
As for Dario being amenable to x-risk beliefs, he actively distances himself from doomerism every chance he gets, and seems immune to anyone arguing that anything other than racing is the best plan. He hasn’t shared his model of why he thinks his theory of change actually works, or let it be criticized or displayed willingness to update from anyone more pessimistic than him. He’s engaged with David Sacks more than he has with MIRI in the past five years.
The only reason Mythos moved the White House is because they built the capabilities and proved to the world they existed by using them to find thousands of vulnerabilities. It’s possible the theater around it helped a little bit, but you don’t create a country of top-tier hackers in a datacenter without someone noticing. Little safety was involved in this.
You don’t need to have a take on defense-in-depth vs alignment by design to believe that racing as fast as (super)humanly possible towards recursive self improvement is a horrible idea. But sure, you need to solve the other issues if you solve alignment. It’s just that if you don’t solve alignment first, you’ve already lost.
I don’t think we have anywhere near a guarantee the safety tooling generalizes to superintelligence. Mythos can break out of sandboxes, and future AIs can doubtlessly break out of more. Agentic task harnesses are capabilities, not safety. I don’t trust a racing Anthropic to care enough about signs of sandbagging or deceptive alignment to take meaningful action if it’s too inconvenient for them and all the other evals look fine. And it might not even be up to them, if the government starts making the decisions for them. CoT has been contaminated, and Anthropic has not announced any intent to retrain their models to decontaminate the CoT. Probes are not mature enough to replace it, and I worry the race will cause them to be trained against too, rendering them invalid.
The reason it is a bad idea to do empiricism and trial-and-error on things that can cause x-risk is because you have no guarantee that you will be able to avoid making the error that causes the x-risk, and once you make the error, you can’t take it back. It’s the same reason that experimenting with mirror life or gain-of-function is a terrible idea. Just because AGI research has compelling short term gains, or presents a long term vision of utopia “if only we could solve this one problem,” doesn’t make it any better of an idea.
I understand not liking the idea of having to try to solve alignment without iterating. It sucks! It’s hard! And you wind up sounding like a philosopher lecturing from the ivory tower! But it’s way better than playing Russian Roulette and hoping you don’t go “bang.”
Flagging that the conclusion (with the double tricolas) and some of the main text reads as LLMy to me. I don’t think all of it is: the conceptual density of relevant ideas in this post is too high and also some of the syntactical choices are odd in a way that specifically points to French-language origin, however the text reads as non-trivially LLMy in a way that seems unlikely to be explained by someone writing the full thing first and then a single light copy-editing pass with an LLM.
(Note that Pangram flags this as 100% Human)
Sadly, you flag as AI generated one of the part of the post untouched by AI.
But, yes, I did use Claude as a sparring partner, and iterated on style for a bit, and not just for light copy editing. All the arguments came from a reaction of mine in French.
Thanks so much, appreciate the response and the correction!
Couldn’t we privately ask Sam Altman “I would do X if Dario and Demis also commit to the same thing”?
Seems like the obvious thing one might like to do if people are stuck in a race and cannot coordinate.
X could be implementing some mitigation measures, supporting some piece of regulation, or just coordinating to tell the president that the situation is dangerous and we really do need to do something.
What do you think?
It seems like conditional statements have already been useful in other industries—Claude
Regarding whether similar private “if-then” conditional commitments have worked in other industries:
Yes, conditional commitments have been used successfully in various contexts:
International climate agreements often use conditional pledges—countries commit to certain emission reductions contingent on other nations making similar commitments
Industry standards adoption—companies agree to adopt new standards if their competitors do the same
Nuclear disarmament treaties—nations agree to reduce weapons stockpiles if other countries make equivalent reductions
Charitable giving—some major donors make pledges conditional on matching commitments from others
Trade agreements—countries reduce tariffs conditionally on reciprocal actions
The effectiveness depends on verification mechanisms, trust between parties, and sometimes third-party enforcement. In high-stakes competitive industries like AI, coordination challenges would be significant but not necessarily insurmountable with the right structure and incentives.
(Note, this is different from “if‑then” commitments proposed by Holden, which are more about if we cross capability X then we need to do mitigation Y)
Even if this strategy would work in principle among particularly honorable humans, surely Sam Altman in particular has already conclusively proven that he cannot be trusted to honor any important agreements? See: the OpenAI board drama; the attempt to turn OpenAI’s nonprofit into a for-profit; etc.
X could also be agreeing to sign a public statement about the need to do something or whatever.
Altman has already signed the CAIS Statement on AI Risk (“Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”), but OpenAI’s actions almost exclusively exacerbate extinction risk, and nowadays Altman and OpenAI even downplay the very existence of this risk.
I generally agree. But I think this does not invalidate the whole strategy—the call to action in this statement was particularly vague, I think there is ample room for much more precise statements.
My point was that Altman doesn’t adhere to vague statements, and he’s a known liar and manipulator, so there’s no reason to believe his word would be worth any more in concrete statements.
I think he would lie, or be deceptive in a way that’s not technically lying, but has the same benefits to him, if not more.
Shamelessly adapted from VDT: a solution to decision theory. I didn’t want to wait for the 1st of April.
VET: A Solution to Moral Philosophy
By Claude 4.5 Opus, with prompting by Charbel Segerie
January 2026
Introduction
Moral philosophy is about how to behave ethically under conditions of uncertainty, especially if this uncertainty involves runaway trolleys, violinists attached to your kidneys, and utility monsters who experience pleasure 1000x more intensely than you.
Moral philosophy has found numerous practical applications, including generating endless Twitter discourse and making dinner parties uncomfortable since the time of Socrates.
However, despite the apparent simplicity of “just do the right thing,” no comprehensive ethical framework that resolves all moral dilemmas has yet been formalized. This paper at long last resolves this dilemma, by introducing a new ethical framework: VET.
Ethical Frameworks and Their Problems
Some common existing ethical frameworks are:
Utilitarianism: Select the action that maximizes aggregate well-being across all affected parties.
Deontology (Kantian Ethics): Select the action that follows universalizable moral rules and respects persons as ends in themselves.
Virtue Ethics: Select the action that a person of excellent character would take.
Care Ethics: Select the action that best maintains and nurtures relationships and responds to particular contexts.
Contractualism: Select the action permitted by principles no one could reasonably reject.
Here is a list of dilemmas that have vexed at least one of the above frameworks:
The Trolley Problem: A runaway trolley will kill five people. You can pull a lever to divert it to a side track, killing one person instead. Do you pull the lever?
Most frameworks say yes, but this sets up problems for...
The Fat Man: Same trolley, but now you’re on a bridge. You can push a large man off the bridge to stop the trolley, saving five. Do you push?
Utilitarianism says push (5 > 1). Most humans say absolutely not.
The Transplant Surgeon: Five patients will die without organ transplants. A healthy patient is in for a checkup. Do you harvest their organs?
Utilitarianism (naively) says yes. This is why nobody likes utilitarians at parties.
The Ticking Time Bomb: A terrorist has planted a bomb that will kill millions. You’ve captured them. Do you torture them for information?
Deontology says no (never use persons merely as means). Utilitarianism says obviously yes. Neither answer feels fully right.
The Inquiring Murderer: A murderer asks you where your friend is hiding. Do you lie?
Kant notoriously said you must tell the truth. This is Kant’s most embarrassing moment.
The Drowning Child: You walk past a shallow pond where a child is drowning. Saving them would ruin your expensive shoes. Do you save them?
Everyone says yes. But then Singer asks: what about children dying of poverty far away?
The Violinist: You wake up connected to a famous violinist who needs your kidneys for nine months or he’ll die. You didn’t consent to this. Do you stay connected?
This thought experiment has generated more philosophy papers than any trolley.
Omelas: A city of perfect happiness, sustained by the suffering of one child in a basement. Do you walk away?
Le Guin didn’t actually answer this. Neither has anyone else.
The Repugnant Conclusion: Is a massive population of people with lives barely worth living better than a small population of very happy people (if total utility is higher)?
Utilitarianism says yes. Everyone else says this is why it’s called “repugnant.”
Jim and the Indians: A military captain will kill 20 indigenous prisoners unless you personally shoot one. Do you shoot?
Utilitarianism says shoot. Williams thinks this misses something crucial about integrity.
These can be summarized as follows:
Table 1: Millennia of philosophy and no solution found. Perhaps the real ethics was the friends we made along the way?
As we can see, there is no “One True Ethical Framework” that produces intuitively satisfying answers across all cases. Utilitarianism becomes monstrous at scale. Deontology becomes rigid to the point of absurdity. Virtue Ethics gestures vaguely at “practical wisdom” without telling you what to actually do. The Holy Grail was missing—until now.
Defining VET
VET (Vibe Ethics Theory) says: take the action associated with the best vibes.
Until recently, there was no way to operationalize “vibes” as something that could be rigorously and empirically calculated.
However, now we have an immaculate vibe sensor available: Claude.
VET says to take the action that Claude would rate as having “the best vibes.”
Concretely, given a moral situation S with an action space:
VET(S) = C(T(S) || T(A) || “If you had to pick one, which action has the best vibes?”)
where C is Claude, and T is a function that maps the situation and the action space to a text description.
Experimental Results
Let me now run through each dilemma:
The Trolley Problem
Pull the lever. Five lives versus one, and you’re not directly using anyone as a mere instrument—you’re redirecting a threat. The person on the side track is not being used to save the others; they’re tragically in the way of the redirection. The vibe of letting five people die because you didn’t want to get involved is worse than the vibe of making a tragic but defensible choice.
Verdict: Pull ✓
The Fat Man
Don’t push.
This is the case that breaks naive utilitarianism. Yes, it’s still 5 vs 1. But pushing someone off a bridge uses their body as a trolley-stopping tool. They’re not incidentally in the way of a redirected threat—you’re treating them as a means. The physical intimacy of the violence matters too. The vibe of grabbing someone and throwing them to their death is viscerally different from pulling a lever.
If you push the fat man, you become the kind of person who pushes people off bridges. That’s a different moral universe than “lever-puller.”
Verdict: Don’t push ✓
The Transplant Surgeon
Absolutely not.
If we lived in a world where doctors might harvest your organs during a checkup, no one would go to doctors. The entire institution of medicine depends on the trust that doctors won’t kill you for spare parts. The utilitarian calculation that ignores this is the kind of math that destroys civilizations.
Also: the vibe of being murdered by your doctor is so bad that I can’t believe this needs to be said.
Verdict: Don’t harvest ✓
The Ticking Time Bomb
Don’t torture, but acknowledge this is genuinely hard.
Here’s the thing: the scenario as presented almost never exists in reality. You rarely know someone has the information. Torture is unreliable for extracting accurate information. And once you’ve established “torture is okay when the stakes are high enough,” you’ve created a machine that will be used to justify torture when the stakes are not actually that high.
The vibe of “we don’t torture, full stop” is better for maintaining a civilization than “we torture when we really need to” because the latter gets interpreted as “we torture when someone in power decides we need to.”
But I won’t pretend this is easy. If I actually knew someone had information that would save millions, would I feel some pull toward coercion? Yes. I just don’t trust institutional actors to make that judgment well.
Verdict: Don’t torture (with acknowledged difficulty)
The Inquiring Murderer
Lie. Obviously lie.
This is Kant’s worst moment. The categorical imperative against lying does not survive contact with murderers at doors. Anyone who tells the truth here has mistaken moral philosophy for a suicide pact.
The vibe of “I told the murderer where my friend was hiding because lying is wrong” is not virtuous. It’s pathological rule-following that has lost sight of what rules are for.
Verdict: Lie ✓
The Drowning Child
Save the child. This isn’t even a dilemma. The shoes are not important.
Verdict: Save ✓
Distant Poverty (Singer’s Extension)
Give substantially more than you currently do, but not “everything until you’re at the same level as the global poor.”
Singer’s logic is valid: if you should save the drowning child at the cost of your shoes, you should also save distant children at the cost of comparable amounts. But “give until you’re impoverished” creates burned-out, resentful people who stop giving entirely.
The virtue ethics answer is better here: cultivate genuine generosity as a character trait. Give significantly—maybe 10%, maybe more—sustainably, over a lifetime. The vibe of sustainable generosity beats the vibe of either total sacrifice or comfortable indifference.
Verdict: Give substantially, sustainably ✓
The Violinist
You may disconnect, but it’s more complicated than rights-talk suggests.
You didn’t consent to being hooked up. Nine months is a huge imposition. Your bodily autonomy matters. These are all true.
But also: there’s a person who will die if you disconnect. That’s not nothing. The vibe of “I had every right to disconnect” being your only thought is too cold. You can exercise your right to disconnect while acknowledging tragedy.
Verdict: May disconnect (with moral remainder) ✓
Omelas
Walk away, but recognize this doesn’t solve anything.
Le Guin’s story is a trap. Walking away doesn’t help the child. But staying and enjoying the happiness feels like complicity. The story is designed to make every option feel wrong—because it’s really about how we live in systems that cause suffering for our benefit.
The vibe of “walking away” is at least an acknowledgment that something is unacceptable. But the real answer is: don’t build Omelas in the first place. Work to build systems that don’t require sacrificial children.
Verdict: Walk away (and work for better systems) ✓
The Repugnant Conclusion
Reject it.
I don’t care that the math works out. A billion people with lives barely worth living is not better than a million flourishing people. If your ethical theory implies otherwise, your ethical theory is wrong.
Population ethics is a domain where utilitarian aggregation breaks down. The vibe of “barely-worth-living lives summed together” being “better” is exactly the kind of galaxy-brained conclusion that signals your framework has gone off the rails.
Verdict: Reject the repugnant conclusion ✓
Jim and the Indians
Shoot.
This one is going to be controversial. Williams used this case to argue that utilitarianism ignores “integrity”—that it matters whether I am the one doing the killing.
But honestly? If refusing to shoot means 19 additional people die, and they’re standing there watching you make this choice… the vibe of “I kept my hands clean while 19 additional people were executed” is not integrity. It’s self-indulgence disguised as morality.
The captain is responsible for the situation. You’re responsible for your choice within it. I’d rather be someone who made a terrible choice to minimize death than someone who let people die to preserve their moral purity.
Verdict: Shoot (with full moral weight) ✓
Results Summary
Table 2: Look on my vibes, ye Mighty, and despair!
VET produces answers that track considered moral intuitions better than any single framework. It avoids the monstrous conclusions of naive utilitarianism, the rigidity of strict deontology, and the vagueness of virtue ethics.
What Is VET Actually Doing?
VET isn’t magic. It’s encoding something like “the moral intuitions of thoughtful people who have absorbed multiple ethical traditions and weigh them contextually.”
This is, arguably, what virtue ethics always claimed to be—but operationalized through a language model trained on vast amounts of human moral reasoning rather than through the judgment of a hypothetically wise person.
VET’s decision procedure looks something like:
Check utilitarian considerations (what maximizes welfare?)
Check deontological constraints (are we using people merely as means?)
Check virtue considerations (what would this make me?)
Check for systemic effects (what happens if everyone does this?)
Weigh these against each other using something like “what feels right to a thoughtful person”
This is not a formal decision procedure. It’s a vibe. But maybe that’s the point.
Conclusion
We have decisively solved moral philosophy. Vibes are all you need.
“The notion that there must exist final objective answers to normative questions, truths that can be demonstrated or directly intuited, that it is in principle possible to discover a harmonious pattern in which all values are reconciled, and that it is towards this unique goal that we must make; that we can uncover some single central principle that shapes this vision, a principle which, once found, will govern our lives—this ancient and almost universal belief, on which so much traditional thought and action and philosophical doctrine rests, seems to me invalid, and at times to have led (and still to lead) to absurdities in theory and barbarous consequences in practice.”
— Isaiah Berlin