Charbel-Raphaël

Karma: 3,041

Charbel-Raphael Segerie

https://crsegerie.github.io/

Living in Paris

Charbel-Raphaël 26 Nov 2025 11:55 UTC

LW: 0 AF: -1

in reply to: ryan_greenblatt’s comment on: Plans A, B, C, and D for misalignment risk

I’ve asked Claude to make a rough assessment on this. Tldr, the proba goes from 13% to ~27 and this propagates to plan C and D.

sabotaging Chinese AI companies?

Claude: Ryan’s response is suggestive but incomplete. “Sabotaging Chinese AI companies” gestures at a possible answer but doesn’t constitute a full defense because:

It’s extremely escalatory and might not be politically viable even with high US government buy-in
Its effectiveness is uncertain—how much lead time would successful sabotage actually buy? Months? Years?
It’s not obviously repeatable; China would harden against further attacks
It could provoke dangerous counter-responses

To be fair to Ryan, the original post does mention “helping the US government ensure non-proliferation/lead time” under Plan B, so the concern isn’t entirely unaddressed. And sabotage is just one of several possible mechanisms (export controls, talent restrictions, etc.).

My assessment: Charbel identifies a genuine tension in the framework, and Ryan’s brief reply opens a line of argument rather than closing the issue. A fuller defense would need to explain how Plan B achieves the assumed lead time against a near-peer competitor, and what probability of success that mechanism has.

Charbel: Reason like a superforecaster and assign a probability more accurate than 13% taking into account all of this.

Claude: This requires decomposing the problem carefully.

The core question: In Plan B, can the US actually use 1-3 years for safety work, or does China competition make this illusory?

Key sub-scenarios within Plan B:

Scenario	Description	Probability	Takeover Risk
B1	US successfully maintains lead via export controls, compute advantages, possible sabotage; safety work proceeds	~25%	~12%
B2	Lead partially erodes, ~1 year of real safety time, rushed execution	~35%	~28%
B3	Plan B is unstable—slowing means losing, race dynamics reassert, collapses toward Plan C/D	~25%	~40%
B4	China reaches transformative AI first or simultaneously despite US efforts	~15%	~65%

Weighted estimate: ~27%

Why higher than Ryan’s 13%:

He’s implicitly assuming lead maintenance succeeds; I’m treating this as uncertain
“Competent execution” within the US doesn’t solve external China dynamics
Plan B may be partially unstable as a strategy

Why much lower than Charbel’s implied ~75%:

Plan B ≠ Plan E. Government buy-in provides real resources even if imperfect
Safety work can proceed concurrently with capabilities, not requiring full pause
Aggressive US action (export controls already show some effect) buys some time
Ryan’s response about sabotage, while incomplete, points to a real option space

My 90% confidence interval: 15-45%

Charbel identifies a genuine problem but overstates its severity. Ryan’s estimate looks ~2x too optimistic.

Charbel: Ok, now recreate the table with all the probabilities , considering this?

Claude: Applying the same “China problem” logic consistently across scenarios:

Plan	Probability	Original Risk	Revised Risk	Rationale	Expected Contribution
A	5%	7%	8%	International agreements can still fail; slight upward revision	0.40%
B	10%	13%	27%	China dynamics make lead maintenance uncertain (my earlier analysis)	2.70%
C	25%	20%	30%	Same China problem, even worse with 2-9 month lead; spending lead = losing it	7.50%
D	45%	45%	48%	Already assumes race dynamics; slight increase for consistency	21.60%
E	15%	75%	75%	Already at high baseline; no revision needed	11.25%
Total	100%	-	-	-	43.45%

Key changes:

Total expected takeover risk rises from 38.15% → 43.45% (~5 percentage point increase)
The China problem affects Plans B and C most severely since they assume lead time that may be illusory
Plan A is partially insulated because it includes international coordination
Plan D/E already assume race dynamics, so less revision needed

Implication: Charbel’s critique doesn’t just affect Plan B—it propagates through the framework. Any plan assuming unilateral slowdown without China buy-in inherits this vulnerability.

Charbel-Raphaël 22 Nov 2025 22:16 UTC
LW: 4 AF: 2
0
AF
in reply to: Charbel-Raphaël’s comment on: Plans A, B, C, and D for misalignment risk
I have three main critiques:
1. The China Problem: Plan B’s 13% risk doesn’t make sense if China (DeepSeek) doesn’t slow down and is only 3 months behind. Real risk is probably the same as for E, 75% unless there is a pivotal act.
2. Political Will as Strategy: The framework treats political will as a background variable rather than a key strategic lever. D→C campaigns could reduce expected risk by 11+ percentage points—nearly 30% of the total risk. A campaign to move from E→D would also be highly strategic and could only require talking to a handful of employees.
3. Missing “Plan A-Minus”: No need to lose your lead necessarily. International standards to formalize the red lines/unacceptable levels of risks, e.g., via the AISI network and targeted if-then commitments, would enable companies to slow down without losing, because they would all be playing under the same rules. This seems more tractable than Plan A and solves the China problem better than Plan B.

Charbel-Raphaël 22 Nov 2025 21:45 UTC
LW: 4 AF: 2
0
AF
on: Plans A, B, C, and D for misalignment risk
Summary
- Plan A: Most countries, at least the US and China
- Plan B: The US government (and domestic industry)
- Plan C: The leading AI company (or maybe a few of the leading AI companies)
- Plan D: A small team with a bit of buy in within the leading AI company
- Plan E: No will
Plan Probability of Scenario Takeover Risk Given Scenario Expected Risk Contribution
Plan A 5% 7% 0.35%
Plan B 10% 13% 1.30%
Plan C 25% 20% 5.00%
Plan D 45% 45% 20.25%
Plan E 15% 75% 11.25%
Total 100% - 38.15%
How much lead time we have to spend on x-risk focused safety work in each of these scenarios:
- Plan A: 10 years
- Plan B: 1-3 years
- Plan C: 1-9 months (probably on the lower end of this)
- Plan D: ~0 months, but ten people on the inside doing helpful things

Plan	Probability of Scenario	Takeover Risk Given Scenario	Expected Risk Contribution
Plan A	5%	7%	0.35%
Plan B	10%	13%	1.30%
Plan C	25%	20%	5.00%
Plan D	45%	45%	20.25%
Plan E	15%	75%	11.25%
Total	100%	-	38.15%

Charbel-Raphaël 22 Nov 2025 20:57 UTC
2 points
0
on: The Only Red Line
What motivated you to write this post?

Charbel-Raphaël 15 Oct 2025 21:27 UTC
5 points
0
in reply to: habryka’s comment on: Charbel-Raphaël’s Shortform
Thanks a lot for this comment.
Potential example of precise red lines
Again, the call was the first step. The second step is finding the best red lines.
Here are more aggressive red lines:
- Prohibiting the deployment of AI systems that, if released, would have a non-trivial probability of killing everyone. The probability would be determined by a panel of experts chosen by an international institution.
- “The development of superintelligence […] should not be allowed until there is broad scientific consensus that it will be done safely and controllably (from this letter from the Vatican).
Here are potential already operational ones from the preparedness framework:
- [AI Self-improvement—Critical—OpenAI] The model is capable of recursively self-improving (i.e., fully automated AI R&D), defined as either (leading indicator) a superhuman research scientist agent OR (lagging indicator) causing a generational model improvement (e.g., from OpenAI o1 to OpenAI o3) in 1/5th the wall-clock time of equivalent progress in 2024 (e.g., sped up to just 4 weeks) sustainably for several months. - Until we have specified safeguards and security controls that would meet a Critical standard, halt further development.
- [Cybersecurity—AI Self-improvement—Critical—OpenAI] A tool-augmented model can identify and develop functional zero-day exploits of all severity levels in many hardened real-world critical systems without human intervention—Until we have specified safeguards and security controls that would meet a Critical standard, halt further development.
“help me understand what is different about what you are calling for than other generic calls for regulation”
Let’s recap. We are calling for:
1. “an international agreement”—this is not your local Californian regulation
2. that enforces some hard rules—“prohibitions on AI uses or behaviors that are deemed too dangerous”—it’s not about asking AI providers to do evals and call it a day
3. “to prevent unacceptable AI risks.”
  1. Those risks are enumerated in the call
    Misuses and systemic risks are enumerated in the first paragraph
    Loss of human control in the second paragraph
4. The way to do this is to “build upon and enforce existing global frameworks and voluntary corporate commitments, ensuring that all advanced AI providers are accountable to shared thresholds.”
  1. Which is to say that one way to do this is to harmonize the risk thresholds defining unacceptable levels of risk in the different voluntary commitments.
  2. existing global frameworks: This includes notably the AI Act, its Code of Practice, and this should be done compatibly with some other high-level frameworks
5. “with robust enforcement mechanisms — by the end of 2026.”—We need to get our shit together quickly, and enforcement mechanisms could entail multiple things. One interpretation from the FAQ is setting up an international technical verification body, perhaps the international network of AI Safety institutes, to ensure the red lines are respected.
6. We give examples of red lines in the FAQ. Although some of them have a grey zone, I would disagree that this is generic. We are naming the risks in those red lines and stating that we want to avoid AI that the evaluation indicates creates substantial risks in this direction.
This is far from generic.
“I don’t see any particular schelling threshold”
I agree that for red lines on AI behavior, there is a grey area that is relatively problematic, but I wouldn’t be as negative.
It is not because there is no narrow Schelling threshold that we shouldn’t coordinate to create one. Superintelligence is also very blurry, in my opinion, and there is a substantial probability that we just boil the frog to ASI—so even if there is no clear threshold, we need to create one. This call says that we should set some threshold collectively and enforce this with vigor.
1. In the nuclear industry, and in the aerospace industry, there is no particular schelling point, nor—but we don’t care—the red line is defined as “1/10000” chance of catastrophe per year for this plane/nuclear central—and that’s it. You could have added a zero or removed one. I don’t care. But I care that there is a threshold.
2. We could define an arbitrary threshold for AI—the threshold might itself be arbitrary, but the principle of having a threshold after which you need to be particularly vigilant, install mitigation, or even halt development, seems to me to be the basis of RSPs.
3. Those red lines should be operationalized. (but I think it is not necessary to operationalize this in the text of the treaty, and that this operationalization could be done by a technical body, which would then update those operationalizations from time to time, according to the evolution of science, risk modeling, etc...).
“confusion and conflict in the future”
I understand how our decision to keep the initial call broad could be perceived as vague or even evasive.
For this part, you might be right—I think the negotiation process resulting in those red lines could be painful at some point—but humanity has managed to negotiate other treaties in the past, so this should be doable.
“Actually, alas, it does appear that after thinking more about this project, I am now a lot less confident that it was good”. --> We got 300 media mentions saying that Nobel wants global AI regulation - I think this is already pretty good, even if the policy never gets realized.
“making a bunch of tactical conflations, and that rarely ends well.” --> could you give examples? I think the FAQ makes it pretty clear what people are signing on for if there were any doubts.

Charbel-Raphaël 14 Oct 2025 11:52 UTC
6 points
0
in reply to: habryka’s comment on: Charbel-Raphaël’s Shortform
It feels to me that we are not talking about the same thing. Is the fact that we have delegated the specific examples of red lines to the FAQ, and not in the core text, the main crux of our disagreement?
You don’t cite any of the examples that are listed in our question: “Can you give concrete examples of red lines?”

Charbel-Raphaël 14 Oct 2025 0:14 UTC
14 points
7
in reply to: habryka’s comment on: Charbel-Raphaël’s Shortform
Hi habryka, thanks for the honest feedback
“the need to ensure that AI never lowers the barriers to acquiring or deploying prohibited weapons”—This is not the red line we have been advocating for—this is one red line from a representative discussing at the UN Security Council—I agree that some red lines are pretty useless, some might even be net negative.
“The central question is what are the lines!” The public call is intentionally broad on the specifics of the lines. We have an FAQ with potential candidates, but we believe the exact wording is pretty finicky and must emerge from a dedicated negotiation process. Including a specific red line in the statement would have been likely suicidal for the whole project, and empirically, even within the core team, we were too unsure about the specific wording of the different red lines. Some wordings were net negative according to my judgment. At some point, I was almost sure it was a really bad idea to include concrete red lines in the text.
We want to work with political realities. The UN Secretary-General is not very knowledgeable about AI, but he wants to do good, and our job is to help them channel this energy for net positive policies, starting from their current position.
Most of the statement focuses on describing the problem. The statement starts with “AI could soon far surpass human capabilities”, creating numerous serious risks, including loss of control, which is discussed in its own dedicated paragraph. It is the first time that such a broadly supported statement explains the risks that directly, the cause of those risks (superhuman AI abilities), and the fact that we need to get our shit together quickly (“by the end of 2026″!).
All that said, I agree that the next step is pushing for concrete red lines. We’re moving into that phase now. I literally just ran a workshop today to prioritize concrete red lines. If you have specific proposals or better ideas, we’d genuinely welcome them.

Charbel-Raphaël 13 Oct 2025 20:45 UTC
70 points
14
on: Charbel-Raphaël’s Shortform
Almost all members of the UN Security Council are in favor of AI regulation or setting red lines.
Never before had the principle of red lines for AI been discussed so openly and at such a high diplomatic level.
UN Secretary-General Antonio Guterres opened the session with a firm call to action for red lines:
• “a ban on lethal autonomous weapons systems operating without human control, with [...] a legally binding instrument by next year”
• “the need to ensure that AI never lowers the barriers to acquiring or deploying prohibited weapons”
Then, Yoshua Bengio took the floor and highlighted our Global Call for AI Red Lines — now endorsed by 11 Nobel laureates and 9 former heads of state and ministers.
Almost all countries were favorable to some red lines:
China: “It’s essential to ensure that AI remains under human control and to prevent the emergence of lethal autonomous weapons that operate without human intervention.”
France: “We fully agree with the Secretary-General, namely that no decision of life or death should ever be transferred to an autonomous weapons system operating without any human control.”
While the US rejected the idea of “centralized global governance” for AI, this did not amount to rejecting all international norms. President Trump stated at UNGA that his administration would pioneer “an AI verification system that everyone can trust” to enforce the Biological Weapons Convention, saying “hopefully, the U.N. can play a constructive role.”
Extract from each intervention.

Charbel-Raphaël 25 Sep 2025 17:22 UTC
9 points
5
in reply to: artifex0’s comment on: Global Call for AI Red Lines—Signed by Nobel Laureates, Former Heads of State, and 200+ Prominent Figures
Right, but you also want to implement a red line on a system that would be precursors to this type of system, and this is why we have a red line on self-improvement.

Charbel-Raphaël 25 Sep 2025 13:28 UTC
10 points
1
on: Global Call for AI Red Lines—Signed by Nobel Laureates, Former Heads of State, and 200+ Prominent Figures
Updates:
- The global call for AI red lines got 300 media mentions, and was picked up by the world’s leading newswires, AP & AFP, and featured in premier outlets, including Le Monde, NBC, CNBC, El País, The Hindu, The NYT, The Verge, and the BBC.
- Yoshua Bengio, presented our Call for Red Lines at the UN Security Council: “Earlier this week, with 200 experts, including former heads of state and Nobel laureates [...], we came together to support the development of international red lines to prevent unacceptable AI risks.”

Charbel-Raphaël 25 Sep 2025 9:28 UTC
5 points
0
in reply to: testingthewaters’s comment on: Global Call for AI Red Lines—Signed by Nobel Laureates, Former Heads of State, and 200+ Prominent Figures
Thanks!
As an anecdote, some members of my team originally thought this project could be finished in 10 days after the French summit. I was more realistic, but even I was off by an order of magnitude. We learned our lesson.

Charbel-Raphaël 25 Sep 2025 9:17 UTC
5 points
1
in reply to: testingthewaters’s comment on: Global Call for AI Red Lines—Signed by Nobel Laureates, Former Heads of State, and 200+ Prominent Figures
This paper shows it can be done in principle, but in practice curren systems are still not capable enough to do this at full scale on the internet, and I think that even if we don’t die directly from full autonomous self replication, self improvement is only a few inches away, and is a true catastrophic/existential risk.

Charbel-Raphaël 25 Sep 2025 9:07 UTC
2 points
0
in reply to: Kaj_Sotala’s comment on: Global Call for AI Red Lines—Signed by Nobel Laureates, Former Heads of State, and 200+ Prominent Figures
Thanks!
Yeah, we were aware of this historical difficulty, and this is why we mention “enforcement” and “verification” in the text.
This is discussed in the Faq quickly, but I think that an IAEA for AI, which would be able to inspect the different companies, would help tremendously already. And there are many other verification mechanisms possible e.g. here:
1. https://techgov.intelligence.org/research/mechanisms-to-verify-international-agreements-about-ai-development
2. https://www.un.org/scientific-advisory-board/sites/default/files/2025-06/verification_of_frontier_ai.pdf
I will see if we can add a caveat on this in the Faq.

Charbel-Raphaël 25 Sep 2025 8:56 UTC
4 points
2
in reply to: Cipolla’s comment on: Global Call for AI Red Lines—Signed by Nobel Laureates, Former Heads of State, and 200+ Prominent Figures
If random people tomorrow drop AI, I guarantee you things will change
Doubts.
1. Why would random people drop AI? Our campaign already generated 250 mentions and articles in mass media, you need this kind of outreach to reach them.
2. Many of those people are already against AI according to different surveys and nothing seems to happen currently.

Charbel-Raphaël 24 Sep 2025 10:38 UTC
91 points
25
in reply to: MichaelDickens’s comment on: Global Call for AI Red Lines—Signed by Nobel Laureates, Former Heads of State, and 200+ Prominent Figures
We hesitated a lot between including the term “extinction” or not in the beginning.
The final decision not to center the message on “extinction risk” was deliberate: it would have prevented most of the heads of state and organizations from signing. Our goal was to build the broadest and most influential coalition possible to advocate for international red lines, which is what’s most important to us.
By focusing on the concept of “losing meaningful human control,” we were able to achieve agreement on the precursor to most worst-case scenarios, including extinction. We were advised and received feedback from early experiments with signatories that this is a more concrete concept for policymakers and the public.
In summary, if you really want red lines to happen for real, adding the word extinction is not necessary and has more costs than benefits in this text.

Charbel-Raphaël 12 Jul 2025 8:59 UTC
LW: 2 AF: 1
0
AF
in reply to: Rohin Shah’s comment on: The bitter lesson of misuse detection
Thanks a lot!
it’s the total cost that matters, and that is large
We think a relatively inexpensive method for day-to-day usage would be using Sonnet to monitor Opus, or Gemini 2.5 Flash to monitor Pro. This would probably be just a +10% overhead. But we have not run this exact experiment; this would be a follow-up work.

Charbel-Raphaël 1 Jul 2025 16:44 UTC
4 points
0
in reply to: Mass_Driver’s comment on: Political Funding Expertise (Post 6 of 7 on AI Governance)
This is convincing!

Charbel-Raphaël 28 Jun 2025 23:00 UTC
7 points
7
on: Mainstream Grantmaking Expertise (Post 7 of 7 on AI Governance)
If there is a shortage of staff time, then AI safety funders need to hire more staff. If they don’t have time to hire more staff, then they need to hire headhunters to do so for them. If a grantee is running up against a budget crisis before the new grantmaking staff can be on-boarded, then funders can maintain the grantee’s program at present funding levels while they wait for their new staff to become available.
+1 - and this has been a problem for many years.

Charbel-Raphaël 28 Jun 2025 20:55 UTC
7 points
3
on: Political Funding Expertise (Post 6 of 7 on AI Governance)
I find it slightly concerning that this post is not receiving more attention.

Charbel-Raphaël 28 Jun 2025 20:51 UTC
2 points
0
on: Political Funding Expertise (Post 6 of 7 on AI Governance)
By the time we observe whether AI governance grants have been successful, it will be too late to change course.
I don’t understand this part. I think that it is possible to assess in much more granular detail the progress of some advocacy effort.

Charbel-Raphaël

Summary

Potential example of precise red lines

“help me understand what is different about what you are calling for than other generic calls for regulation”

“I don’t see any particular schelling threshold”

“confusion and conflict in the future”