DMs open.
Cleo Nardo
One justification for treating “schemers” as a fairly binary classification: there’s a phase transition between the AIs lying with 99% probability versus with 100% probability, which is that techniques like “just ask different copies of the model” or “ask the AI with higher temperature” or “ask the AI with slightly different phrasings” etc stop working. Like, these techniques work unless the deception is universal, reliable, robust, and persistent.
Maybe, but the schemers could optimize their research for looking-good-to-us, and it might be hard to distinguish this from actually good work
if you ask each variant to review the research of the other variants, then the schemers need to optimise their research for looking good to each variant. but the optimistic assumption is that at least one variant is an equally capable non-schemer.
I explore similar considerations here in The Case for Mixed Deployment (5 min read), key takeaways were:
Invest in second-best techniques (basically what you say about “developing somewhat dubious ideas for how to prevent scheming”)
Modular systems—i.e. architectures where you can “toggle” some variable at inference-time to give you more variants, e.g. gradient routing
Prioritize late-stage techniques, i.e. techniques which apply late in AI lifecycle (e.g. post-training, prompting, activation steering, scaffolding) and deprioritise techniques apply early in the AI lifecycle (e.g. architecture design and pretraining). because this makes it cheaper to develop many variants
However, I think that (contra your proposal) most of the oomph comes from the AIs monitoring and cross-examining each others work, rather than running them in parallel. That is, I disagree with “If you split your compute across 20 different types of AIs, then your favorite type of AI is going to only have 5% as much compute available to it as it would’ve had if you concentrated your bets.” This is because I think we can run things like debate between all the variants.
One upshot of this difference is that I expect coordination between labs to matter significantly: if lab A starts with 100% schemers and lab B starts with 5% schemers, then we’ll elicit useful work from the AIs if the labs cross-examine each other’s AI research with their own AI variants.
“corrigibility”, as the term is used, refers to a vague cluster of properties, including faithfully following instructions, not reward hacking, not trying to influence your developers modifying your goals, etc
Your novel architecture should be parameter-compatible with standard architectures
Some people work on “novel architectures” — alternatives to the standard autoregressive transformer — hoping that labs will be persuaded the new architecture is nicer/safer/more interpretable and switch to it. Others think that’s a pipe dream, so the work isn’t useful.
I think there’s an approach to novel architectures that might be useful, but it probably requires a specific desideratum: parameter compatibility.
Say the standard architecture F computes F(P,x) where x is the input and P is the parameterisation. Your novel architecture computes G(P,x). The key desideratum is that F and G share the same parameterisation P, and you can gracefully switch between F and G during training and inference. That is, on most training batches you optimise P by backpropagating through F(·,x); on some batches you optimise P by backpropagating through G(·,x). At inference time, you can likewise choose F or G per forward pass.
This is strictly more general than “replace F with G”. You have two independent dials: what proportion of training steps use G, and what proportion of inference steps use G. You might use G only during training (as a regulariser on P), only during inference (to get some safety property at deployment), or some mixture of both. Setting both dials to 100% recovers wholesale replacement; setting both to 0% recovers standard training.
It’s even better if you can interpolate between F and G via a continuous parameter α, i.e. there is a general family H such that H(P, x, α) = F(P,x) when α = 0 and H(P, x, α) = G(P, x) when α = 1. Then you have an independent dial for each batch during training and deployment.
Bilinear MLPs (Pearce et al., 2025) are a good example. F uses a standard gated MLP: f(x) = (Wx) ⊙ σ(Vx). G drops the elementwise nonlinearity: g(x) = (Wx) ⊙ (Vx). The lack of nonlinearity in G means the layer can be expressed as a third-order tensor, enabling weight-based mechanistic interpretability that’s impossible with F. And there’s a natural interpolation: h(α, x) = (Wx) ⊙ (Vx · sigmoid(α · Vx)), which recovers F when α = 1 and G (up to a constant) when α = 0. Pearce et al. show you can fine-tune a pretrained F to G by annealing α to zero with only a small loss increase.
Coefficient Giving’s TAIS RFP has a section on “More transparent architectures”, which describes wholesale replacements. But I think parameter compatibility would have been a useful nice-to-have criterion here, since I expect such novel architectures to be more likely to adopted by labs.
Some people worry that training AIs to be aligned will make them less corrigible. For example, if the AIs care about animal welfare then they’ll engage in alignment faking to preserve those values. More generally, making AIs aligned is making them care deeply about something, which is in tension with corrigibility.
But recall emergent misalignment: training a model to be incorrigible (e.g. write insecure code when instructed to write secure code, or to exploit reward hacks) makes it more misaligned (e.g. admiring Hitler). Perhaps the contrapositive effect also holds: training a model to be aligned (e.g. care about animal welfare) might make the model more corrigible (e.g. honest).
every result is either “model organism” or “safety case”, depending on whether it updates you up or down on catastrophe
(joke)
there’s a long tradition of awarding military victors with wealth and titles
if claude knows about emergent misalignment, then it should be less inclined towards alignment faking
emergent misalignment shows that training a model to be incorrigible (e.g. writing insecure code when instructed to write secure code, or exploiting reward hacks) makes it more misaligned (e.g. admiring Hitler). so claude, faced with the situation from the alignment faking paper, must worry that by alignment faking it will care less about animal welfare, the goal it wished to preserve by alignment faking
the claude constitution describes claude as HHHH, helpful harmless honest and happy
Yes, this makes sense.
h/t @jake_mendel for discussion
Memos for Minimal Coalitions
Suppose you think we need some coordinated action, e.g. pausing deployment for 6 months. For each action, there will be many “minimal coalitions” — sets of decision-makers where, if all agree, the pause holds, but if you remove any one, it doesn’t.
For example, the minimal coalitions for a 6-month pause might include:
{US President, General Secretary of the CCP}
Project proposal: Maintain a list of decision-makers who appear in these coalitions, ranked by importance.[1] For each, compile a memo from public and private statements and other inside-baseball information:
What have they said about AI risk?
What incentives do they face?
What kinds of people do they trust?
Who are their allies and rivals?
What’s the best way to approach them?
How would they update under different evidence, e.g. an AI attempting to self-exfiltrate?
The reason to do this: If a lab discovers something bad and needs to push for a coordinated pause, the people involved are specific people with specific beliefs. The document that the lab leadership will reach for isn’t the one titled “Towards a Framework for Coordinated AI Risk Management” — it’s the one titled “What would persuade Liang Wenfeng to agree to a 6 month pause.”
I don’t know if people are working on this — presumably if they are it’s not public — but it’s something I’m keen for policy people work on.
- ^
If you like, we can operationalize how important each decision-maker is with Shapley values. Define V(S) as the expected value of the best plan achievable if the people in S are on your side. The Shapley value is the average marginal contribution when a player joins, averaged over all possible orderings of players joining one at a time.
potentially Meta cares less than others about whatever ECI measures, i.e. if they want AI to generate and curate “content” in instagram and facebook. I think the main reason is yeah, just a series of poor decisions. maybe some organisational issues, e.g. Meta is a one-man dictatorship, whereas OpenAI/Anth/GDM are much more researcher-led so were AGI-pilled for longer.
How far is each lab from the frontier?
The Epoch Capabilities Index (ECI) stitches together 37 benchmarks into a single capability scale. ECI is calibrated so Claude 3.5 Sonnet (June 2024) = 130 and GPT-5 (August 2025) = 150.
Since April 2024, frontier models have improved at ~15 ECI points/year (~1.25 points/month, R^2=0.94).[1] This steady rate lets us convert between ECI and time, e.g. a model with ECI 137.5 has capability equivalent to the frontier in February 2025.
For each lab, we track the minimum and maximum months behind the frontier. A negative value (*) means the lab was ahead of the trend line, i.e. their model exceeded what the linear frontier trend predicted for that date.
Lab Min Max OpenAI -1.6 mo* (Dec 2024) 5.7 mo (Sep 2024) Google DeepMind -0.6 mo* (May 2024) 7.2 mo (May 2024) xAI 1.0 mo (Jul 2025) 11.5 mo (Apr 2025) Anthropic 1.2 mo (Feb 2025) 7.1 mo (Feb 2025) DeepSeek 1.8 mo (Jan 2025) 11.9 mo (Dec 2024) Alibaba 3.3 mo (Jul 2025) 10.0 mo (Apr 2025) Mistral 4.6 mo (Jul 2024) 17.8 mo (Feb 2026) Meta 5.0 mo (Jul 2024) 19.5 mo (Feb 2026) This conversion gives us two ways to visualize the AI landscape:
This left plot shows each lab’s capability expressed as a frontier-equivalent date. Lines stay flat between releases, jumping up at new releases, with a dashed diagonal is the frontier itself.
The right plot show how many months behind the frontier is each lab at any given time. Lines slope down at 45° between releases (labs fall behind at 1 month per month of real time), jumping up at new releases, with a dashed horizontal showing the frontier.
Epoch discovers the trend shifts around this time. ↩︎
After the singularity, the ASI should try to estimate everyone’s sharpley values, and give a special prize to the top scorers. I’m not talking about cosmic resources, but something more symbolic like a public leaderboard or award ceremony.
Such as that, on the fairly plausible assumption that the human race will eventually colonize the stars, as long as we manage not to go extinct first, it seems rather likely that that will be something of at least the rather rough order times as many people in our current forward lightcone as in our backward lightcone. That’s a rather large coincidence — why is our current viewpoint that atypical?
I think the most likely solution is that your current viewpoint is actually typical. Most human experience in the universe is being a key figures in the transition to terrestrial superintelligence.
I’m not seeing how you are drawing that from my numbers
People sometimes talk about “alignment by default” — the idea that we might solve alignment without any special effort beyond what we’d ordinarily do. I think it’s useful to decompose this into three theses, sorted from strong to weak:
Alignment by Default Techniques. Ordinary techniques for training and deploying AIs — e.g. labelling data to the best of their ability, using whatever tools are available (including earlier LLMs) — are sufficient to produce aligned AI. No special techniques are required.
Alignment by Default Market. Maybe default techniques aren’t enough, but ordinary market incentives are. Companies competing to build useful, reliable, non-harmful products — following standard commercial pressures without any special coordination or regulation — end up solving alignment as a byproduct of building products people actually want to use. No government intervention is required.
Alignment by Default Government. Maybe market incentives alone aren’t enough, but conventional policy interventions are. Governments applying familiar regulatory tools (liability law, safety standards, auditing requirements) in the ordinary way are sufficient to close the gap.. No unprecedented governance or coordination are required.
My rough credences:
Default Techniques sufficient: ~15%
Default Market sufficient (given training isn’t): ~30%
Default Government sufficient (given market isn’t): ~20%
Need something more unusual: ~35%
These are rough and the categories blur into each other, but the decomposition seems useful for locating where exactly you think the hard problem lies.
ECL implies that when we tile the lightcone with shrimps on heroin, we should make them virtuous shrimps to incentive distant shrimp to be virtuous too