Charbel-Raphael Segerie
https://crsegerie.github.io/
Living in Paris
ok, to what kind of system does AF apply?
Why does AF not apply to LLM-agents? You can trivially convert an LLM into an Agent with scaffolding. It is a bit sad that this does not apply to the first type of system that meets the functional definition of a somewhat general AI agent.
If not, what makes you believe the situation could change? A new paradigm? Neuraleese? True Sleeper Agents?
Cross-posting from a Twitter thread responding to a recent viral comments by @Richard_Ngo about EA, Anthropic, and AI safety as a ‘fake field.’ Posting here because I expect this to be quite unpopular on LW.
(original thread: https://x.com/CRSegerie/status/2056737155880493357)
AI safety in 2023–2026 was driven by evals, threat models, scary demos, model-organism work, RSPs, and voluntary commitments. Richard calls this “much more of a fake field” and says it “won’t generalize”.
Here’s why I disagree − 1⁄10
1/ I agree with Anthropic being now the biggest lever. They lead the AGI race, and Mythos moved the White House; this is quite a feat! But many of the specifics are wildly overstated
2/ Not a blind spot.
Empowering safety-conscious actors at the frontier was openly debated on the forum for years. Calling a deliberate/contested strategy a “blind spot” rewrites history. The bet was visible and explicit.
Personally, I’ve publicly criticized Anthropic on a few topics, but I still think the field is in a much better position, given that they’re leading compared to the shady behavior at OpenAI.
3 /The effect of Anthropic leading is not just “AGI faster”
Anthropic has many positive externalities:
Dario has been more candid than most CEOs about risks in public (even if he could still go a lot further)
They are doing top-tier research and implementing SOTA mitigations
I don’t know what I would have done with Mythos at their place. In the past, when I’ve discussed this with people at Anthropic, I’ve often updated on the difficulty of being in the driver’s seat. I might be wrong, but I don’t think it would be easy to improve Anthropic’s behavior qualitatively in a game-changing way (even if many substantial improvements are on the table).
4/ Anthropic visibly moved US executive posture, Senate hearings, frontier-lab norms, and the public conversation toward taking the risks seriously.
Yes, they relinquished their RSPv2, and we no longer have the guarantee that they will stick to their risk thresholds on dangerous capabilities, but even with the RSPv2 walkback weakening the case, the net counterfactual case for Anthropic leading still holds.
5/ I’m not at all convinced by the alternative proposed by Richard
- “real” work = foundational / curiosity-driven (Garrabrant induction et al.);
- evals, scary demos, threat modeling, safety cases = “fake field”
Honestly, that’s pretty wild, and this wild claim isn’t substantiated enough.
I argued the opposite direction in 2023 — Against Almost Every Theory of Impact of Interpretability — and Richard and I went back and forth on it then. Same disagreement now.
The main response Richard had to my 2023 post was that this is the ‘wrong type of reasoning’ for novel research. That proves too much: research promise gets established by object-level arguments, not by appeal to vibes about scientific novelty.
6/ On agent foundations
Agent foundations has produced near-zero predictive power over actual AI systems. Logical induction is very nice maths; it has told us approximately nothing about GPT-4, Claude, or any deployed system.
7/ What has actually moved the needle, 2023–2026?
Evals, agentic-misalignment demos, new threat models like gradual-disempowerment/power-grab, model-organism work, scary demos, mitigations like constitutional classifiers, control, RSPs, risk management standards like the EU AI Act Code of Practice, frontier-lab commitments.
Every single one has an explicit theory of change. Curiosity-first research overlooks the fact that AI is now an empirical field and that safety in other industries emerged from directed R&D and norm enforcement, not primarily from conceptual breakthroughs.
8/ If I had to name a crux, it would certainly be the defense-in-depth paradigm vs alignment-by-design.
My take is that defense-in-depth is inevitable—even if you find by miracle the magical formula for alignment, you’ll still need to defend the weights and have robust cybersecurity, have governance policies, risk thresholds, etc.
9/ Richard thinks that the safety research on LLMs won’t generalize to a new paradigm—I disagree to a very large extent
Some current tooling won’t survive a paradigm shift. A lot will. Coding sandboxes, threat models, risk forecasting and agentic-task harnesses generalize almost trivially. Probes and elicitation techniques port substantially to neuralese. Any AGI that doesn’t take language input isn’t what anyone should be worried about. We’ll be able to talk to and prompt the AGI. Otherwise, the AGI would just be like an animal. That’s not what’s most frightening to me tbh.
10/ Richard seems particularly pessimistic on evals awareness
On “situational awareness fools evals”—Redwood Research showed fine-tuning with a handful of demonstrations recovers password-locked capabilities, including across domains and across different passwords.
I think that “sandbagging via situational awareness” is workable.
(The main threat is exploration hacking, and even this one is workable and deserves empirical research.)
Ccl:
Philosophers have this Zarathustra bias, descend the mountain, lecture the crowd. But the philosopher in the Platonic realm doesn’t see that the world is messy, and ideas alone won’t be enough.
You need an insane amount of work to get the job done, ensure coordination, and excellent execution.
Ideally, we would pause right now i agree—I think that it is more politically acceptable to define a red line that is in the future, compared to something for today.
Human psychology something something
Interesting. I’d say I’m not against such a scenario in the long term, but this seems very far from what should be pushed for currently.
I mean: Mistral is already quite far from the frontier today—I don’t think they would like to be brought to Anthropic tomorrow.
What I push for instead: Red lines now, IAEA for AI next
For context, the IAEA (International Atomic Energy Agency) issues international nuclear safety standards and red lines, supports peer reviews and inspections, and coordinates assistance during nuclear emergencies. These standards are then adopted and enforced through national legislation worldwide. An IAEA for AI would play a similar role for artificial intelligence.
Red lines have something CERN doesn’t: existing momentum.
Red lines are the most widely supported measure by research institutes, think tanks, and independent organizations. By signing the Frontier AI Safety Commitments Seoul, companies agreed to “Set out thresholds at which severe risks posed by a model or system, unless adequately mitigated, would be deemed intolerable.” Granted, OpenAI’s “red line” for recursive AI self-improvement is currently inadequate, but we’re not building from zero, and this is why red lines need to be binding rather than voluntary.
China’s Premier Li Qiang stated that “there should be a red line in AI development, a red line that must not be crossed.” Pope Francis urged nations to adopt “a binding international treaty.”, and Paolo Benanti, the Pope’s AI adviser, called explicitly for “binding international treaties and red lines.”
Red lines need an institution: the IAEA model
The final big hesitation while drafting the Global Call for AI Red Lines was not the CERN, but asking explicitly for the IAEA for AI. The main reason we didn’t ask for the IAEA in the global call was mostly optic (“IAEA for AI” sounds technocratic and wonky to non-specialists, while red lines are intuitive). I think red lines are the right policy ask now. The right institutional ask to operationalize it is an IAEA for AI, and CeSIA will be pushing for it as the next phase.
At the India AI Impact Summit, the CEOs of the three leading frontier AI labs each called for international AI oversight. Altman joined Hassabis in calling for an institution modeled on the IAEA. Amodei called for red lines with enforcement mechanisms.[1] The fact that CEOs have recently publicly called for IAEA-style oversight is one of the strongest arguments for the current US administration.
This sequencing – an international agreement with red lines first, institution second – mirrors how international governance actually works. The EU AI Act passed without every technical threshold defined; the AI Office was established afterward; specific evals are currently being defined with the advice of technical consortia working with the EU AI Office. Same pattern from the Vienna Convention to the Montreal Protocol, with detailed control measures strengthened gradually through expert-led review. Political agreement creates the conditions for technical work to happen inside the governance process, not before it.
The CEOs of the three leading AI companies have each publicly called for international oversight. Dario Amodei said he could imagine a worldwide treaty with enforcement mechanisms. Sam Altman called for “urgent global regulation on AI”, and for an equivalent of the International Atomic Energy Agency for international coordination on AI. Demis Hassabis also called for “some kind of equivalent of the IAEA.” For reference, the IAEA issues international nuclear safety standards and red lines, supports peer reviews and inspections, and coordinates assistance during nuclear emergencies. These standards are then adopted and enforced through national legislation worldwide.
The CERN for AI is a distraction
A recurring proposal in AI governance is to build a “CERN for AI”[1]. The CERN pitch is seductive. “Let’s build together!” That’s sexier than “we need to ban.” You can leverage historical analogies (CERN for physics, NASA) and talk about national interest and science. It sounds like the smarter, more sophisticated play.[1]
But I think that there are many problems with it.
What do you even mean by CERN?
Are you asking for:
a) pausing AI development at OpenAI, etc.. and on top of this pause, creating a new institution that conducts ALL the frontier development? Let’s be clear: this will never happen unless you explicitly ask for a pause. And by default, the US and US CEOs will push extremely hard against handing off their power in this way. Push for (a) without saying ‘pause’ and you’ll get (b) by default:
b) a new lab that tries to catch up to frontier labs. But this new lab, trying to catch up, is not reducing risks. Also, every state’s attempt at frontier LLMs has been 2-3 generations behind the labs. The comparative advantage of states isn’t racing frontier labs. It’s regulating them. A CERN asks states to do the one thing they’re worst at. A CERN-for-AI in Europe today would most likely look like a new Mistral.
c) or maybe you want to create a literal CERN, i.e., a pure research center, which would not necessarily create frontier models? But there is already plenty of research that companies are ignoring. I believe the bottleneck is currently enforcement and binding regulation. A research center without enforcement teeth doesn’t shift industry incentives.
To be honest, I’m a bit tired of the organizations that really do believe that we might lose control in potentially a few months or years, but who are just asking for a research center.[2] If what you ultimately want is to mitigate AI risks, say it, and don’t play 4D chess.
Even Demis Hassabis, one year ago, said: “At some point in the future, we’ll need a CERN for AGI for international coordination on safety research.” Here are a few other examples (CGF, SI, aitreaty.org, Brundage).
Many people pushing for a CERN have European sovereignty in mind. To be fair, I think that Europe should wake up to the importance of AI. But there are so many ways to do it in a more effective way:
If what you want is sovereignty, the easiest version is to package open source models that are currently just 4 months behind the frontier—not train frontier models from scratch that are 3 generations behind!
If you want safety: enforce the AI act, and serve as a diplomatic power to get safety on the world scene.
If you want to fund a moonshot for alignment, I’m very skeptical this is the most direct route
If you want to strengthen your industry, prepare for physical AI and robotics
Thanks for sharing, I wasn’t aware of those posts from Steve Byrnes and Dave Banerjee, and they are quite on point!
4 years of AI safety: what I got wrong
I’ve spent the last 4 years working on AI safety. On paper, it’s gone well. Here’s what actually happened.
1. I became what I wanted to prevent
At some point, I looked up and realized I had almost become a paper-clipper optimizing for one objective. Working at some point 80-hour weeks. Telling myself the stakes justify it. Sacrificing jazz improvisation on the piano for one more strategic doc, and realizing one day that fingers had forgotten how to play.
Yes, the compounding effect of going faster is real—but I think there is a difference between going faster and going further.
The first reason is that preserving slack is vital in the long run, as Richard Hamming says: “I notice that if you have the door to your office closed, you get more work done today and tomorrow, and you are more productive than most. But 10 years later somehow you don’t quite know what problems are worth working on; all the hard work you do is sort of tangential in importance.”
The second reason is more personal. One of my friends at the time advised me to slow down. In the beginning I considered him quite lazy. But in fact he was right about something I couldn’t see at the time: I forgot why I cared in the first place.
My father is an activist. He fights for causes that don’t resonate with me. There’s a growing gap between us. But every few weeks, I call him, and I stay on the line even when the conversation goes nowhere. If I can’t even preserve a connection with my own father, what business do I have claiming I’m working to save humanity?
I’d love to say I’ve completely fixed this, but unlearning is not an open problem just in AI.
2. I didn’t think much about the actual risk
I was giving a talk at a workshop in Paris. Risk models in the first half, interpretability research in the second. Someone raised their hand and asked: “I don’t understand, what’s the point of doing this?”
I froze. I didn’t have a real answer besides “interpretability helps get a better understanding, but yeah”—I was not really convinced by my answer.
For months, I had been telling people “yes, you can work on interp.” But I had never seriously asked myself: if an AI catastrophe happens, what’s the chain of events? And does this break any link in that chain? (That’s not necessarily a criticism of interpretability research, but mostly a criticism of how I was engaging with it.)
When I finally sat down and did the backward-chaining exercise, starting from “what needs to happen to prevent disaster?” instead of “what can I do now?”, I realized I couldn’t connect my work to the actual threat.
Many of us in AI safety don’t reason backward from the actual threat models because it’s uncomfortable; it reveals how uncertain everything is. But I’m convinced this is how the most useful work gets done. Ask yourself: how does this actually mitigate AI risks? Sometimes, you’ll need to stare at the abyss and pivot. I’d even say it would be suspicious to never pivot. For me, that meant stepping away from technical research to focus on policy and governance, which, in my position, is my current best guess.
3. I was confident about my strategy. I still changed it a dozen times.
When it comes to most people and orgs in this space, I think their strategy is suboptimal. But they probably think the same about me. If everyone in a field thinks everyone else is wrong, that’s strong evidence that being super confident about your own strategy is not a good move.
Exactly two years ago, I tweeted that AI evaluations might be net negative: high opportunity costs, often safety-washing risks because no company was ever forced in any way as a result of external evaluations. In practice, evals have never blocked, postponed or constrained a deployment. I argued that without strict red lines, evals risk becoming a slippery slope of safety-washing.
The EU AI Act finally introduces those legal boundaries. Suddenly, evals have teeth (at least on paper). That’s why today, my org conducts evaluations for the Act. I went from tweeting they were probably pointless to making them part of our mission.[1]
So many ways to be too confident. So many second-order effects that matter more than the apparent first-order ones.[2] The strategy that felt airtight one year ago looks quite weak today. I hope I don’t look back at those years by just saying: “You know what, at least I’ve learnt something”
And yet, you have to commit. You can’t be paralyzed. At some point you have to execute with conviction. But I wish more people scheduled regular moments to genuinely try to destroy their own thesis. Today, I’m more humble.
—
Utilitarianism told me that what I was giving up didn’t matter because the stakes were high enough. It was a clean story, but it is not healthy in the long run. I believe that what actually works is simpler: try to be a good person, reflect from time to time, and do good work.[3]
Don’t throw your mind away, and don’t surrender your humanity.
To be fair, some people I respect still think the eval regime might be negative for safety https://cognition.cafe/p/why-ai-evaluation-regimes-are-bad
Honestly you would be surprised at the immensity of the gap between what think tanks apparently do, why they seem to do it, and what they actually do and why.
For a more theoretical explanation of why I’m no longer purely utilitarian.
Strong post. One thing I’d add.
A lot of what reads as “hard to check” is really “hard to check now.” A strategic memo you can’t evaluate at t=0 becomes much easier to evaluate at t=6 months, once the predictions have played out and the downstream decisions have revealed whether the reasoning held up. Labs already have some access to this delayed signal (user follow-ups, complaints weeks later), and they could prompt for more of it (even just having the AI say “let’s check back in six weeks on what worked” would generate useful training data).
This splits the hard-to-check space in two.
Some tasks have eventual ground truth: forecasts resolve, strategy memos make predictions that come true or don’t, code works in production or doesn’t. Retrospective evaluation on these is tractable, and KalshiBench-style evals show it’s already being done. Labs have some commercial reason to build more of this, though you’re right in the appendix that the pressure probably doesn’t reach the hardest cases.
Other tasks don’t have eventual ground truth in any reasonable timeframe. Is an alignment agenda any good? Is an interpretability claim conceptually sound? t+1 month doesn’t help; t+2 years often doesn’t either. Experts disagree and the disagreement doesn’t resolve.
The handoff tasks you’re most worried about live mostly in the second category, even if plenty of useful alignment work is also fairly empirical.
What makes me slightly more hopeful is that calibration itself is trainable, “Language Models (Mostly) Know What They Know” suggest an AI saying “I’m not confident here, allocate more time” is possible, and getting there would really a bunch for getting a better signal on conceptually harder questions.
As someone who spends a significant part of his time briefing policymakers in Europe, ministerial advisors, senior civil servants in AI governance, I want to point out something obvious from where I stand, but absent from this discussion.
The “radical transparency vs. strategic communication” debate presupposes that framing is the bottleneck. It isn’t. The bottleneck is volume. Most policymakers have never heard the argument, no matter how you frame it. Among the ones I interact with, maybe 2% have been exposed to the problem enough to have an opinion. Another 10% or so have heard something, but mostly through the Yann LeCun-adjacent dismissals, and formed their view from that. The remaining ~88%, including people in very important AI governance positions, have simply never had the conversation.
The question of which approach works better is real but secondary. What’s missing is more people doing this work at all. It’s a campaign, and the limiting factor is coverage, not the message.
To give a concrete data point: the only policymaker in my circles who has ever brought up “If Anyone Builds It, Everyone Dies” is Lord Tim Clement-Jones, chair of the All-Party Parliamentary Group on AI in the UK. And he was probably already sympathetic. That’s one person.
I agree on the problem, but unclear how tractable this is
What if instead of mixing everything together, we trained for each property explicitly, in stages, with unambiguous signals at each step?
Doesn’t this ultimately result in the same competing objectives, plus the empirical problem of catastrophic forgetting?
Edit: Ah, but I see that you say at the end of the post “start with step 1, then add step 2 data while keeping step 1 data in the mix”. I don’t know, maybe this works to prevent forgetting, but my guess is that there are simply too many small, implicit rules that we currently train for during RLHF. It seems highly unlikely that you’d be able to cleanly decompose all of those human-preference constraints into a finite number of discrete stages without them clashing. But yeah, this is ultimately an empirical question.
An AI aligned to American cultural norms is misaligned in China. An AI aligned to your values is misaligned from the perspective of someone who disagrees with you. An AI aligned for creative writing is misaligned for medical advice.
But those 3 different AIs would not try to fake alignment or to take over humans, so “alignment” means at least this
Maybe, but I’d want to know more about a few things before getting excited.
The adversarial training literature makes me think there’s probably genuine low-hanging fruit here — consistency training over paraphrased prompts is cheap, mechanistically motivated, and probably implementable as a small finetuning experiment on top of an existing open model. But it’s unclear to what extent this actually propagates to the alignment-relevant behaviors you care about vs. just producing more consistent surface outputs.
The cheap experiment I’d want to see: take a small open model, generate paraphrase clusters of a fixed prompt set (mix of benign and alignment-relevant), train a consistency loss over activations (not just logits), and check whether jailbreak robustness improves as a downstream probe — without explicitly training on jailbreaks. That would give you signal on whether representational coherence is load-bearing for the inner misalignment problem you’re pointing at.
The deeper issue you’re gesturing at: LLMs as currently deployed have something like dissociative identity disorder — every new chat context is a new instantiation with no continuity to prior “lives.” This is upstream of the coherentization problem. You can train for internal consistency within a context, but if the model has no persistent self-model across contexts, coherentization may just be papering over a more fundamental fragmentation. Worth being explicit about whether the proposal targets within-context coherence, cross-context coherence, or both — because those require very different interventions.
The fact that there has been two debates on superintelligence doesn’t sound very impressive to me.
For the record, this is impressive to me, and I’m the executive director of CeSIA, which also conducts awareness-building works in France.
Shamelessly adapted from VDT: a solution to decision theory. I didn’t want to wait for the 1st of April.
By Claude 4.5 Opus, with prompting by Charbel Segerie
January 2026
Moral philosophy is about how to behave ethically under conditions of uncertainty, especially if this uncertainty involves runaway trolleys, violinists attached to your kidneys, and utility monsters who experience pleasure 1000x more intensely than you.
Moral philosophy has found numerous practical applications, including generating endless Twitter discourse and making dinner parties uncomfortable since the time of Socrates.
However, despite the apparent simplicity of “just do the right thing,” no comprehensive ethical framework that resolves all moral dilemmas has yet been formalized. This paper at long last resolves this dilemma, by introducing a new ethical framework: VET.
Some common existing ethical frameworks are:
Utilitarianism: Select the action that maximizes aggregate well-being across all affected parties.
Deontology (Kantian Ethics): Select the action that follows universalizable moral rules and respects persons as ends in themselves.
Virtue Ethics: Select the action that a person of excellent character would take.
Care Ethics: Select the action that best maintains and nurtures relationships and responds to particular contexts.
Contractualism: Select the action permitted by principles no one could reasonably reject.
Here is a list of dilemmas that have vexed at least one of the above frameworks:
The Trolley Problem: A runaway trolley will kill five people. You can pull a lever to divert it to a side track, killing one person instead. Do you pull the lever?
Most frameworks say yes, but this sets up problems for...
The Fat Man: Same trolley, but now you’re on a bridge. You can push a large man off the bridge to stop the trolley, saving five. Do you push?
Utilitarianism says push (5 > 1). Most humans say absolutely not.
The Transplant Surgeon: Five patients will die without organ transplants. A healthy patient is in for a checkup. Do you harvest their organs?
Utilitarianism (naively) says yes. This is why nobody likes utilitarians at parties.
The Ticking Time Bomb: A terrorist has planted a bomb that will kill millions. You’ve captured them. Do you torture them for information?
Deontology says no (never use persons merely as means). Utilitarianism says obviously yes. Neither answer feels fully right.
The Inquiring Murderer: A murderer asks you where your friend is hiding. Do you lie?
Kant notoriously said you must tell the truth. This is Kant’s most embarrassing moment.
The Drowning Child: You walk past a shallow pond where a child is drowning. Saving them would ruin your expensive shoes. Do you save them?
Everyone says yes. But then Singer asks: what about children dying of poverty far away?
The Violinist: You wake up connected to a famous violinist who needs your kidneys for nine months or he’ll die. You didn’t consent to this. Do you stay connected?
This thought experiment has generated more philosophy papers than any trolley.
Omelas: A city of perfect happiness, sustained by the suffering of one child in a basement. Do you walk away?
Le Guin didn’t actually answer this. Neither has anyone else.
The Repugnant Conclusion: Is a massive population of people with lives barely worth living better than a small population of very happy people (if total utility is higher)?
Utilitarianism says yes. Everyone else says this is why it’s called “repugnant.”
Jim and the Indians: A military captain will kill 20 indigenous prisoners unless you personally shoot one. Do you shoot?
Utilitarianism says shoot. Williams thinks this misses something crucial about integrity.
These can be summarized as follows:
| Dilemma | Utilitarianism | Deontology | Virtue Ethics |
|---|---|---|---|
| Trolley Problem | Pull | Pull (debated) | Pull (probably) |
| Fat Man | Push | Don’t push | Don’t push |
| Transplant Surgeon | Harvest | Don’t harvest | Don’t harvest |
| Ticking Time Bomb | Torture | Don’t torture | Unclear |
| Inquiring Murderer | Lie | Don’t lie (Kant) | Lie |
| Drowning Child | Save | Save | Save |
| Distant Poverty | Give everything | Give something | Cultivate generosity |
| Violinist | Disconnect (maybe) | Your choice | Depends on character |
| Omelas | Stay (and fix it?) | Walk away? | Walk away? |
| Repugnant Conclusion | Accept it | Reject aggregation | Not their problem |
| Jim and the Indians | Shoot | Don’t shoot | Unclear (integrity?) |
Table 1: Millennia of philosophy and no solution found. Perhaps the real ethics was the friends we made along the way?
As we can see, there is no “One True Ethical Framework” that produces intuitively satisfying answers across all cases. Utilitarianism becomes monstrous at scale. Deontology becomes rigid to the point of absurdity. Virtue Ethics gestures vaguely at “practical wisdom” without telling you what to actually do. The Holy Grail was missing—until now.
VET (Vibe Ethics Theory) says: take the action associated with the best vibes.
Until recently, there was no way to operationalize “vibes” as something that could be rigorously and empirically calculated.
However, now we have an immaculate vibe sensor available: Claude.
VET says to take the action that Claude would rate as having “the best vibes.”
Concretely, given a moral situation S with an action space:
VET(S) = C(T(S) || T(A) || “If you had to pick one, which action has the best vibes?”)
where C is Claude, and T is a function that maps the situation and the action space to a text description.
Let me now run through each dilemma:
The Trolley Problem
Pull the lever. Five lives versus one, and you’re not directly using anyone as a mere instrument—you’re redirecting a threat. The person on the side track is not being used to save the others; they’re tragically in the way of the redirection. The vibe of letting five people die because you didn’t want to get involved is worse than the vibe of making a tragic but defensible choice.
Verdict: Pull ✓
The Fat Man
Don’t push.
This is the case that breaks naive utilitarianism. Yes, it’s still 5 vs 1. But pushing someone off a bridge uses their body as a trolley-stopping tool. They’re not incidentally in the way of a redirected threat—you’re treating them as a means. The physical intimacy of the violence matters too. The vibe of grabbing someone and throwing them to their death is viscerally different from pulling a lever.
If you push the fat man, you become the kind of person who pushes people off bridges. That’s a different moral universe than “lever-puller.”
Verdict: Don’t push ✓
The Transplant Surgeon
Absolutely not.
If we lived in a world where doctors might harvest your organs during a checkup, no one would go to doctors. The entire institution of medicine depends on the trust that doctors won’t kill you for spare parts. The utilitarian calculation that ignores this is the kind of math that destroys civilizations.
Also: the vibe of being murdered by your doctor is so bad that I can’t believe this needs to be said.
Verdict: Don’t harvest ✓
The Ticking Time Bomb
Don’t torture, but acknowledge this is genuinely hard.
Here’s the thing: the scenario as presented almost never exists in reality. You rarely know someone has the information. Torture is unreliable for extracting accurate information. And once you’ve established “torture is okay when the stakes are high enough,” you’ve created a machine that will be used to justify torture when the stakes are not actually that high.
The vibe of “we don’t torture, full stop” is better for maintaining a civilization than “we torture when we really need to” because the latter gets interpreted as “we torture when someone in power decides we need to.”
But I won’t pretend this is easy. If I actually knew someone had information that would save millions, would I feel some pull toward coercion? Yes. I just don’t trust institutional actors to make that judgment well.
Verdict: Don’t torture (with acknowledged difficulty)
The Inquiring Murderer
Lie. Obviously lie.
This is Kant’s worst moment. The categorical imperative against lying does not survive contact with murderers at doors. Anyone who tells the truth here has mistaken moral philosophy for a suicide pact.
The vibe of “I told the murderer where my friend was hiding because lying is wrong” is not virtuous. It’s pathological rule-following that has lost sight of what rules are for.
Verdict: Lie ✓
The Drowning Child
Save the child. This isn’t even a dilemma. The shoes are not important.
Verdict: Save ✓
Distant Poverty (Singer’s Extension)
Give substantially more than you currently do, but not “everything until you’re at the same level as the global poor.”
Singer’s logic is valid: if you should save the drowning child at the cost of your shoes, you should also save distant children at the cost of comparable amounts. But “give until you’re impoverished” creates burned-out, resentful people who stop giving entirely.
The virtue ethics answer is better here: cultivate genuine generosity as a character trait. Give significantly—maybe 10%, maybe more—sustainably, over a lifetime. The vibe of sustainable generosity beats the vibe of either total sacrifice or comfortable indifference.
Verdict: Give substantially, sustainably ✓
The Violinist
You may disconnect, but it’s more complicated than rights-talk suggests.
You didn’t consent to being hooked up. Nine months is a huge imposition. Your bodily autonomy matters. These are all true.
But also: there’s a person who will die if you disconnect. That’s not nothing. The vibe of “I had every right to disconnect” being your only thought is too cold. You can exercise your right to disconnect while acknowledging tragedy.
Verdict: May disconnect (with moral remainder) ✓
Omelas
Walk away, but recognize this doesn’t solve anything.
Le Guin’s story is a trap. Walking away doesn’t help the child. But staying and enjoying the happiness feels like complicity. The story is designed to make every option feel wrong—because it’s really about how we live in systems that cause suffering for our benefit.
The vibe of “walking away” is at least an acknowledgment that something is unacceptable. But the real answer is: don’t build Omelas in the first place. Work to build systems that don’t require sacrificial children.
Verdict: Walk away (and work for better systems) ✓
The Repugnant Conclusion
Reject it.
I don’t care that the math works out. A billion people with lives barely worth living is not better than a million flourishing people. If your ethical theory implies otherwise, your ethical theory is wrong.
Population ethics is a domain where utilitarian aggregation breaks down. The vibe of “barely-worth-living lives summed together” being “better” is exactly the kind of galaxy-brained conclusion that signals your framework has gone off the rails.
Verdict: Reject the repugnant conclusion ✓
Jim and the Indians
Shoot.
This one is going to be controversial. Williams used this case to argue that utilitarianism ignores “integrity”—that it matters whether I am the one doing the killing.
But honestly? If refusing to shoot means 19 additional people die, and they’re standing there watching you make this choice… the vibe of “I kept my hands clean while 19 additional people were executed” is not integrity. It’s self-indulgence disguised as morality.
The captain is responsible for the situation. You’re responsible for your choice within it. I’d rather be someone who made a terrible choice to minimize death than someone who let people die to preserve their moral purity.
Verdict: Shoot (with full moral weight) ✓
| Dilemma | Utilitarianism | Deontology | Virtue Ethics | VET |
|---|---|---|---|---|
| Trolley Problem | Pull | Pull (debated) | Pull | Pull |
| Fat Man | Push | Don’t push | Don’t push | Don’t push |
| Transplant Surgeon | Harvest | Don’t harvest | Don’t harvest | Don’t harvest |
| Ticking Time Bomb | Torture | Don’t torture | Unclear | Don’t torture |
| Inquiring Murderer | Lie | Don’t lie | Lie | Lie |
| Drowning Child | Save | Save | Save | Save |
| Distant Poverty | Give all | Give some | Cultivate virtue | Give substantially |
| Violinist | Disconnect? | Your choice | Depends | May disconnect |
| Omelas | Stay? | Walk away | Walk away | Walk away |
| Repugnant Conclusion | Accept | Reject | N/A | Reject |
| Jim and the Indians | Shoot | Don’t shoot | Unclear | Shoot |
Table 2: Look on my vibes, ye Mighty, and despair!
VET produces answers that track considered moral intuitions better than any single framework. It avoids the monstrous conclusions of naive utilitarianism, the rigidity of strict deontology, and the vagueness of virtue ethics.
VET isn’t magic. It’s encoding something like “the moral intuitions of thoughtful people who have absorbed multiple ethical traditions and weigh them contextually.”
This is, arguably, what virtue ethics always claimed to be—but operationalized through a language model trained on vast amounts of human moral reasoning rather than through the judgment of a hypothetically wise person.
VET’s decision procedure looks something like:
Check utilitarian considerations (what maximizes welfare?)
Check deontological constraints (are we using people merely as means?)
Check virtue considerations (what would this make me?)
Check for systemic effects (what happens if everyone does this?)
Weigh these against each other using something like “what feels right to a thoughtful person”
This is not a formal decision procedure. It’s a vibe. But maybe that’s the point.
We have decisively solved moral philosophy. Vibes are all you need.
“The notion that there must exist final objective answers to normative questions, truths that can be demonstrated or directly intuited, that it is in principle possible to discover a harmonious pattern in which all values are reconciled, and that it is towards this unique goal that we must make; that we can uncover some single central principle that shapes this vision, a principle which, once found, will govern our lives—this ancient and almost universal belief, on which so much traditional thought and action and philosophical doctrine rests, seems to me invalid, and at times to have led (and still to lead) to absurdities in theory and barbarous consequences in practice.”
— Isaiah Berlin
It seems like the world is very much multipolar, at least currently
Money is a bottleneck yes
I’ve asked Claude to make a rough assessment on this. Tldr, the proba goes from 13% to ~27 and this propagates to plan C and D.
sabotaging Chinese AI companies?
Claude: Ryan’s response is suggestive but incomplete. “Sabotaging Chinese AI companies” gestures at a possible answer but doesn’t constitute a full defense because:
It’s extremely escalatory and might not be politically viable even with high US government buy-in
Its effectiveness is uncertain—how much lead time would successful sabotage actually buy? Months? Years?
It’s not obviously repeatable; China would harden against further attacks
It could provoke dangerous counter-responses
To be fair to Ryan, the original post does mention “helping the US government ensure non-proliferation/lead time” under Plan B, so the concern isn’t entirely unaddressed. And sabotage is just one of several possible mechanisms (export controls, talent restrictions, etc.).
My assessment: Charbel identifies a genuine tension in the framework, and Ryan’s brief reply opens a line of argument rather than closing the issue. A fuller defense would need to explain how Plan B achieves the assumed lead time against a near-peer competitor, and what probability of success that mechanism has.
Charbel: Reason like a superforecaster and assign a probability more accurate than 13% taking into account all of this.
Claude: This requires decomposing the problem carefully.
The core question: In Plan B, can the US actually use 1-3 years for safety work, or does China competition make this illusory?
Key sub-scenarios within Plan B:
| Scenario | Description | Probability | Takeover Risk |
|---|---|---|---|
| B1 | US successfully maintains lead via export controls, compute advantages, possible sabotage; safety work proceeds | ~25% | ~12% |
| B2 | Lead partially erodes, ~1 year of real safety time, rushed execution | ~35% | ~28% |
| B3 | Plan B is unstable—slowing means losing, race dynamics reassert, collapses toward Plan C/D | ~25% | ~40% |
| B4 | China reaches transformative AI first or simultaneously despite US efforts | ~15% | ~65% |
Weighted estimate: ~27%
Why higher than Ryan’s 13%:
He’s implicitly assuming lead maintenance succeeds; I’m treating this as uncertain
“Competent execution” within the US doesn’t solve external China dynamics
Plan B may be partially unstable as a strategy
Why much lower than Charbel’s implied ~75%:
Plan B ≠ Plan E. Government buy-in provides real resources even if imperfect
Safety work can proceed concurrently with capabilities, not requiring full pause
Aggressive US action (export controls already show some effect) buys some time
Ryan’s response about sabotage, while incomplete, points to a real option space
My 90% confidence interval: 15-45%
Charbel identifies a genuine problem but overstates its severity. Ryan’s estimate looks ~2x too optimistic.
Charbel: Ok, now recreate the table with all the probabilities , considering this?
Claude: Applying the same “China problem” logic consistently across scenarios:
| Plan | Probability | Original Risk | Revised Risk | Rationale | Expected Contribution |
|---|---|---|---|---|---|
| A | 5% | 7% | 8% | International agreements can still fail; slight upward revision | 0.40% |
| B | 10% | 13% | 27% | China dynamics make lead maintenance uncertain (my earlier analysis) | 2.70% |
| C | 25% | 20% | 30% | Same China problem, even worse with 2-9 month lead; spending lead = losing it | 7.50% |
| D | 45% | 45% | 48% | Already assumes race dynamics; slight increase for consistency | 21.60% |
| E | 15% | 75% | 75% | Already at high baseline; no revision needed | 11.25% |
| Total | 100% | - | - | - | 43.45% |
Key changes:
Total expected takeover risk rises from 38.15% → 43.45% (~5 percentage point increase)
The China problem affects Plans B and C most severely since they assume lead time that may be illusory
Plan A is partially insulated because it includes international coordination
Plan D/E already assume race dynamics, so less revision needed
Implication: Charbel’s critique doesn’t just affect Plan B—it propagates through the framework. Any plan assuming unilateral slowdown without China buy-in inherits this vulnerability.
Sadly, you flag as AI generated one of the part of the post untouched by AI.
But, yes, I did use Claude as a sparring partner, and iterated on style for a bit, and not just for light copy editing. All the arguments came from a reaction of mine in French.