I’m broadly interested in AI strategy and want to figure out the most effective interventions to get good AI outcomes.
Thomas Larsen
Some meandering thoughts on alignment
A nearcast of how we might go about solving alignment using basic current techniques, assuming little/no substnative government intervention is:
During the beginning of takeoff, we do control, attempting to prevent catastrophic actions (e.g. major rogue internal deployments (RIDs)), while trying to elict huge amounts of AI labour.
At some point, the value of scaling AI capabilities while trying to maintain control will be very low, because the main bottleneck on elictation of useful AI labour is human oversight. If you are pausing the AI all the time to wait for the humans to understand what the AIs are doing, it doesn’t help to make the AI smarter. At this point AI companies pause / slowdown as much as possible (e.g. unilaterally or via coordinating with other companies), but this isn’t for very long. During the pause, try to build an AI system that you would trust to manage the intelligence explosion.
Handoff to the AI system, i.e. allow the AI to manage the training run, conduct large, open ended experiments, etc, that humans cannot effectively oversee. The AIs should be corrigible and attempt to ensure that their successors are corrigibile, and still check in with humans as much as possible post-handoff, keep them informed about the situation. At this point, it’s correct to handoff to the AIs because the alternative—scaling to vastly superhuman AI without letting the human-slightly superhuman AIs do a huge amount of reserach—is even more dangerous.
My best guess (overly detailed story) for how this plan goes if you try to implement it in the real world is:
High-stakes control basically works (during steps 1-2). We prevent AIs from doing major RIDs or subsantially subverting safety/ human oversight.
Elicitation will totally fail on open-ended, hard to check tasks such (like preparadigmatic agent foundations reserach, etc). We’ll get some useful work on the more prosaic agendas, e.g. misalignment analogies, control, making RL less hackable, etc.
We’ll iterate towards a handoff MVP that looks reasonable. It’ll be much nicer-seeming that current models; it’ll have a better understanding of morality, what we mean by corrigibility, it’ll be more adversarially robust, etc. And, it won’t be deceptively aligned, i.e. during training, it won’t be thinking that it should act aligned in training in order to achieve long term power. But, it will be playing the training game to some extent, it’ll be thinking about the human oversight process, it’ll be very aware it’s in training for the vast majority of training because most RLVR envs will obviously be training envs, etc. Then the leading AI company will (hopefully deliberately) give up on control in order to let their AIs do more research.
After this, three things could happen:
The AIs actually manage the situation well. They aren’t perfectly aligned, but they are able to manage the situation with their successors. One way this might go is, for example, the human level AIs, which aren’t smart enough to make scalable alignment, do control, except way better than biological humans could do because they have dramatically more auditing capacity. Then, using the AIs they are controlling that are somewhat smarter than the best humans, they make vast progress on mech interp/gofai/agent foundations/uploads/ some other alignment approach.
The AIs were adversarially misaligned. As soon as the AIs became smart enough to subvert the control without the humans noticing, they did so successfully (and perhaps it also tried before and got caught, but alas, race dynamics and motivated reasoning led people to continue)
The AIs try to align their successors but fail. Probably the AIs will have extremely little time and be under huge time pressure from their leadership in order to continue the intelligence explosion. Under these circumstances, they might do motivated reasoning, they might be basically corrigible and following the instructions of the AI company. Also, one option in this category is that the AIs technically weren’t “adversarially misaligned” when you handed off to them, they genuinely were nice. But, the distribution shift of “now they can take over the world if they wanted to, and get way more thinking time, and know all of that, and the AIs get much smarter”, is enough to break the corrigibility of the orgiinal model. But it’s not really clear what the distinction is between the original model we handded off to and its successor, so I’m just including this as one category.
#3 seems like the most plausible outcome to me right now, but I don’t think 1 or 2 are implausible, conditional on the above nearcast, I’m something like 25/25/50% on 1/2/3 respectively. However, the failure probability from adversarial misalignment vs incompetence depends dramatically on when you handoff; earlier handoff leads to higher risk from incompetence → future misaligned AIs, while later leads to higher risk from misalignment, (and tbc, it’s not just that there’s a tradeoff here, it may be that there is nowhere we can handoff with current alignment techniques + progress generated between now and then).
Maybe the biggest missing piece for me, when people tell the above story for how this goes well, is the low-stakes / elicitation (even though I’m very compelled by the high-stakes / catastrophe prevention side of control in the human range). The main techniques that I know about for elictation are:
Just make the models generally more smart. This works great, until you get into the regime where the AIs are adversarially misaligned, and then they start sandbagging (intentionally give you garbage work). To mitigate the sandbaggin we’ll need another technique.
Train the models on labeled human data of good conceptual reasoning / research taste. We don’t have a ton of examples of this, it’s off policy, etc.
Use held-out examples of good human alignment reserach / open-ended, taste load bearing research as a validation set, and then iterate against that.
Something something debate
Humans sampling a small fraction of the research and looking into it in depth to validate. (a big worry with this is that for most research, especially paradigm shifting research, there are long serial chains of reasoning that are only understandable after a understanding the precursors, e.g., how basically any math textbook works. this makes me way more optimstic about elicitation on domains where humans can factor all the work into independent subcomponents, which seems much more true for things like “build uploads” than “solve alignment”.
Relatedly, a huge question in my mind is the extent to which we’ll need paradigm-shifting alignment research to prevent adversarial misalignment before the handoff capability threshold. I pretty strongly believe that we’ll need these techniques before superintelligence, but I don’t see a strong reason for it emerging at a particular location (though “IDK, probably somewhere in the human range” seems like a pretty good default guess). If adversarial misalignment emerges before handoff, the eliciation problem becomes more load bearing because the AIs wil probably be sandbagging/sabotaging research. And if it isn’t mitigatable with prosaic techniques, you also need to elicit paradigm-shiting research work out of the AIs, which seems really tough.
On your argument that object-level reasoning obsoletes base rates: This is somewhat circular. You have inside views about what it means to reason well about AI progress, and superforecasters disagree. You’re partially bootstrapping from your own beliefs to dismiss theirs.
Oops, “object level reasoning obsoletes base rates” is not what I was trying to argue… my view is that the action is mostly in selecting the right base rate, i.e. that AI is more analogous to a new species than a normal technology.
Also I don’t agree that it’s circular. I think one of the correct reasons to defer to someone is them making correct arguments (as evaluated by my inside view), and that doesn’t apply. I definitely agree that I’m bootstrapping from my views to dismiss theirs. Now, there might be other reasons to defer to someone (for example, the other reasons I gave above), but I was arguing specifically against reason #3 above here.
Thanks, I hadn’t looked through those before.
https://goodjudgment.io/AI/Question_4_High-Impact_Failures.html jumped out to me:
Seems… obviously crazy?
I think that the views of superforecasters on AI / AI risk should be basically no update.
It seems to me like the main reasons to defer to someone are:
They have a visibly good track record on the relevant domain. It has to be the literal domain, because people often have good views on their area of expertise, but crazy views elsewhere.
They are highly selected for having good beliefs in the domain. For example, if a mathematician tells me something that seems surprising about their area of expertise, I will tend to strongly believe them, despite not being able to evaluate their reasoning. The general reason for this is because mathematics is a verifiable domain, mathematicians are strongly selected for being correct about math. Other domains I’d basically defer to people in are historians about literal historical facts, physicists about well-established physics results, engineers about how cars work, etc. This consideration weakens as disciplines become less verifiable: I’m not very inclined to defer to philosophers, sociologists, psychologists, etc.
They make correct arguments about the domain (and very few incorrect arguments). If it’s the case that you can talk to someone and they can consistently make clear rock-solid arguments that change your mind regularly, it is justified to defer to them on bottom line conclusions, even if you can’t follow the arguments all the way through.
They are much smarter than you and are probably being honest. If someone (or, eventually, an AI) is clearly much smarter than you, and they are being honest (e.g. because they seem like an honest person), then you should probably defer to them substantially. (Of course, this isn’t even fully general, e.g. a few hundred years ago, many of the smartest people around were superstitious, which would have led you astray.)
Now I’ll go through and argue why these don’t apply.
I think the track record of superforecasters on AI looks quite bad. Superforecasters consistently massively under-estimate AI progress on benchmarks, see, e.g., here, here, here, and here. On open-ended forecasting competitions, e.g. AI 2025, the top people (who I recognized, which is biased) all seemed like AI risk domain area experts, not superforecasters (though I’m also not sure if any superforecasters participated). You might object that AI benchmarks don’t track real world impact, and that the real world impact was much lower. I would doubt that superforecasters would have made reasonable predictions of real world impact (such as revenue) over the last few years of AI progress, but I’m not aware of any systematic predictions made by superforecasters on real world impacts. Yes, superforecasters are often good at making geopolitical forecasts, but in practice this doesn’t seem to transfer well to the domain that I care about.
I think the selection effect for being a superforecaster seems nowhere near as strong as a mathematician / physicist, and I think this evidence gets swamped by the above observations of how good they actually seem to be doing on the domain, which seem very poor.
I’ve talked with some superforecasters about AI but not many. Generally, my sense is that they are smart and reasonable people, but don’t know very much about AI. It’s hard to make this argument reasoning transparent; but if you are someone in AI who thinks that it’s appropriate to defer to superforecasters, I think it would be a good idea to try to set up a meeting and talk with one of the people you are deferring to and see if they are actually making reasonable arguments that seem grounded in technical reality. In my experience, the main arguments I’ve encountered are outside view-y considerations like “base rate of extinction from new tech is low” which get obsoleted by strong object level arguments (e.g. for the abnormality of superintelligence and how its more analagous to a new smarter species).
I don’t think the selection effect for being a superforecaster on general intelligence is strong enough for this argument to apply any more than I should defer to e.g. someone who got a 1600 on the SAT.
I also think that it would be better if people deferred less in general because I think that group epistmics go much better if each person in the group attempts to understand the situation as well as possible themselves. If everyone constantly tries to update on everyone else’s views, then (i) there’s way less novel intellectual thoughts, (ii) there are deference cascades / group think, and (iii) people end up with beliefs that are far less crisp / clearly justified than the person who originated the belief (bc communication is hard to do at high fidelity). I generally think people should communicate based on their own inside view, without deference, even if they expect the average correctness of the things that they are saying to go down because the information value of the communication will be much higher than the alternative.
Thanks! I had read that post when it came out but hadn’t remembered it also drew this distinction.
Sorry for the slow response. I wrote up some of my thoughts on scheming here: https://www.lesswrong.com/posts/q8fdFZSdpruAYkhZi/thomas-larsen-s-shortform?commentId=P8GTDD5CLMxr9tczv
Scheming seems like an unnatural concept to me. I think we can do better. (note: many/most of these thoughts un-original to me)
Scheming is typically used as a binary, i.e. “is the AI scheming”, whereas the typical human usage of the word scheming is much more continuous. It’s not very useful to group humans into “schemers” vs “non-schemers”; most people attempt to achieve goals to some extent, and sometimes this involves deceiving other people.
Joe Carlsmith uses the following taxonomy to define scheming, i.e., a schemer is an AI which “plays the training game, without intrinsically caring about reward-on episode, in order to get power for itself or other AIs”. This definition entirely refers to what the training does in training, not evaluation or deployment, and seems very similar to “deceptive alignment”
I think it’s plausible that AIs start misbehaving in the scary way during deployment, without “scheming” according to the Carlsmith definition. A central reason this might happen is because the AI was given longer to think during deployment during training, and put the pieces together about wanting to gain power, and hence wanting to explicitly subvert human oversight. Carlsmith’s definition of scheming seems to come pretty far away from the normal english meaning of the word schemer, which would prototypically include the above example.
I do think there is something important to talk about here, which scheming is trying to get at, which I’ll try to lay out here.
It seems very important to distinguish the type of misalignment where the AI is deliberately thinking about and trying to undermine human oversight and safeguards, from all other types of misalignment, where the AI might happen to take misaligned actions without deliberately trying to obfuscate human understanding that they made a mistake.
Unlike “schemers”, I think “insider threats” is a more crisp dividing line with humans. Most employees of most companies are kinda normal people doing their jobs without trying to subvert the overall goals of the company. Of course, there are constantly myopic power struggles between people in many companies that don’t serve the best interests of the company, and many employees do something “schemey” during corporate politics in order to get what they want to happen. But they don’t have a hidden long term objective that comes strongly apart from the interest of their employer, like literal spies would.
I think it’s unclear whether the standard use of scheming within the alignment community applies to myopic powerseeking of the type that managers in giant corporations do, whereas these people are not “insider threats” wrt the company, they just have some motivations which run at cross purposes, but aren’t systematically attempting to undermine it.
I think this suggests a more crisp concept here to describe AIs, which I like to call “adversarial misalignment”, which is basically “are the AIs acting like insider threats or not?”. This is different from “misalignment” more broadly, because I would say that the current batch of models (e.g. Opus 4.6) are misaligned in a weaker sense, i.e. they aren’t consistently honest with me and they often don’t try as hard as I would like them to on many tasks, despite them being capable of succeeding on those tasks. I think in practice people use “scheming” along these lines in many cases.
Adversarial misalignment wouldn’t include myopic powerseeking during deployment that wasn’t in service of a long run goal, didn’t entail also sabotaging oversight, etc, and so in the human case, wouldn’t include normal employees doing some amount of corporate politics.
This concept is different from Carlsmith’s notion of scheming because it’s not primarily about training, it’s primarily about deployment, though it also seems quite correlated with AIs intentionally acting nice during training/eval in order to get power later. I think we should call this (intentionally acting nice during training/eval in order to get power later) “deceptive alignment”.
Here are my largest disagreements with AI 2027.
I think the timelines are plausible but solidly on the shorter end; I think the exact AI 2027 timeline to fully automating AI R&D is around my 12th percentile outcome. So the timeline is plausible to me (in fact, similarly plausible to my views at the time of writing), but substantially faster than my median scenario (which would be something like early 2030s).
I think that the AI behaviour after the AIs are superhuman is a little wonky and, in particular, undersells how crazy wildly superhuman AI will be. I expect the takeoff to be extremely fast after we get AIs that are better than the best humans at everything, i.e., within a few months of AIs that are broadly superhuman, we have AIs that are wildly superhuman. I think wildly superhuman AIs would be somewhat more transformative more quickly than AI 2027 depicts. I think the exact dynamics aren’t possible to predict, but I expect craziness along the lines of (i) nanotechnology, leading to things like the biosphere being consumed by tiny self replicating robots which double at speeds similar to the fastest biological doubling times (between hours (amoebas) and months (rabbits)). (ii) extremely superhuman persuasion and political maneuvering, sufficient to let the AI steer policy to a substantially greater extent than it did in AI 2027. In AI 2027, the AI gained enough political power to prevent humans from interacting with ongoing intelligence and industrial explosion (which they were basically on track to do anyways), whereas my best guess is that the AI would gain enough political power to do defacto whatever it wanted, and would therefore result in the AI consolidating power faster (and not keep up the charade of humans being in charge for a period of several years). I also think there are many unknown unknowns downstream of ASI which are really hard to account for in a scenario like AI 2027, but nonetheless are likely to change the picture a lot.
I think the slowdown ending is a bit unrealistic: I think it’s unlikely (e.g. maybe 40%) that we get a slowdown of a few months, and I think its unlikely that a few month slowdown is sufficient to avoid misaligned AI takeover (e.g. maybe 30%). I do think a small slowdown would reduce risk, and so is good on the margin, I just don’t think it’s enough. I also don’t think the slowdown ending is what we should be aiming for (and AI 2027 doesn’t claim that).
I have various other small disagreements with AI 2027, but, overall, I stand by the bulk of it. I continue to think that the main contribution was sketching out a plausible concrete scenario for how AI takeoff could go where all the parts fit together and it makes sense end-to-end, and I continue to think that the picture is essentially correct. I think the overall sketch is much more important than the specific dates and numbers, which, as we’ve said many times, are just one hyper specific way things could go. I continue to expect something basically like AI 2027 to happen, except with the above updates and with lots of additional, hard to predict texture and details occuring along the way.
I agree that AI successor-alignment is probably easier than the human AI alignment problem.
One additional difficulty for the AIs is that they need to solve the alignment problem in a way that humans won’t notice/understand (or else the humans could take the alignment solution and use it for themselves / shutdown the AIs). During the regime before human obsolescence, if we do a reasonable job at control, I think it’ll be hard for them to pull that off.
Also, the above isn’t even mentioning bio x-risk mediated by humans, or by trailing AIs during the chaos of takeoff. My guess is those risks are substantially lower, e.g. maybe 1% and 2% respectively; again don’t feel confident.
Note: These are all rough numbers, I’d expect I’d shift substantially about all of this on further debate.
Suppose we made humanity completely robust to biorisk, i.e. we did sufficient preparation such that the risk of bio catastrophe (including AI mediated biocatastrophe) was basically 0.[1] How much would this reduce total x-risk?
The basic story for any specific takeover path not mattering much is that the AIs, conditional on them being wanting to take over, will self-improve until they find they find the next easiest takeover path and do that instead. I think that this is persuasive but not fully because:
AIs need to worry about their own alignment problem, meaning that they may not be able to self improve in an unconstrained fashion. We can break down the possibilities into (i) the AIs are aligned with their successors (either by default or via alignment being pretty easy), (ii) the AIs are misaligned with their successors but they execute a values handshake, or (iii) the AIs are misaligned with their successors (and they don’t solve this problem or do a values handshake). At the point of full automation of the AI R&D process (which I currently think of as the point at which AIs become more useful than humans for making AI progress, i.e., if we remove all AIs, progress slows by more than if we remove the humans), conditional on the AIs being misaligned, I currently think the relative likelihood of (i), (ii) and (iii) is 1:1:3, and the probability flows from (iii) into (ii) and (i) as the AIs get smarter.[2]
Raising the capability threshold that the AIs need to takeover gives humanity more time. During this time, we might make alignment progress or notice that the AIs are misaligned and slow down / pause / add mitigations.
So it might be important for misaligned AIs to attempt a takeover early in the intelligence explosion. Specifically, we can ask “how much x-risk is averted if the probability of misaligned AI takeover before TED AI goes to 0?”, which attempts to capture all the worlds in which AIs attempt to takeover before TED AI. I think my overall risk reduction is something like 1⁄4. In other words, there’s something like a 3⁄4 chance the AIs lurk (and can goal guard or align successors), or can backdoor successors, or control their successors.
Now, conditional on the pre-TED AIs attempting to takeover, what are the different routes that they might use? The most salient options to me are:
Bioweapon, then outcompete rest of humanity (e.g. via robots or human cultists)
Political operation / persuasion to quickly capture labs / govts (sufficient to prevent them from building other AIs while retaining your ability to foom in the future, once you’ve solved alignment, and to prevent safety research from happening that could expose you).
Macro-scale robot uprising (e.g. with drones / humanoid robots).
Nanotech
Other exotic tech / unknown unknowns.
I think the bio path seems to me to be the most compelling path here by a fair amount; it maybe gets another 1⁄3 probability of this outcome. So, just from the risk of pre-TED AI AIs attemptinng to takeover, we have somehting like 1⁄3 * 1⁄4 = 1⁄12 probability. If you multiply that by my likelihood of AI takeover, which is around 70%, you get ~6% risk flowing from this route. Then, I update up to ~8% from other AIs, e.g. post-TED AIs relying on biorisk as a route to takeover.
So my overall view on how much x-risk flows through bio-catastrophe is around 8%.
- ^
Note that what exactly counts as a bio x-risk is slightly unclear, e.g. at some point the AIs can build drones / nanotech to get into the bio-bunkers, and it’s unclear what counts.
- ^
This breakdown isn’t exhaustive, because another salient possibility is that the AIs are clueless, e.g., they are misaligned with their successors but don’t realize it, similar to Agent 3 in AI 2027.
I think that people overrate bayesian reasoning and underrate “figure out the right ontology”.
Most of the way good thinking happens IMO is by finding and using a good ontology for thinking about some situation, not by probabilistic calculation. When I learned calculus, for example, it wasn’t mostly that I had uncertainty over a bunch of logical statements, which I then strongly updated on learning the new theorems, it was instead that I learned a bunch of new concepts, which I then applied to reason about the world.
I think AI safety generally has much better concepts for thinking about the future of AI than others, and this is a key source of alpha we have. But, there are obviously still a huge number of disagreements remaining within AI safety. I would guess that debates would be more productive if we more explicitly focused on the ontology/framing that each other are using to reason about the situation, and then discussed to what extent that framing captures the dynamics we think are important.
I think it would be good if more people say things like “I think that’s a bad concept, because it obscures consideration X, which is important for thinking about the situation”.
Here are some widely used concepts I think are bad and I wish became less load bearing in AI safety discourse:
“Fast” and “slow” takeoff; takeoff speeds in general. I think these concepts are very unclear and not super natural. There are various operationalizations of these (e.g. Paul’s “slow takeoff” = a gdp doubling over the course of the 4 years before the first single year that gdp doubles). This is obviously arbitrary, and I don’t see why world’s that meet this definition are worthwhile reasoning about separately from worlds that don’t meet this definition. I also think its easy to smuggle in lots of other correlations here, like slow takeoff = people are woken up, or slow takeoff = alignment is much easier, etc.
I feel more excited about talking about things like “I think milestone X will happen at date Y”, or “My median is that milestone X and Z are Y time apart”. For example, I think the concepts of automating coding, automating all AI research, automating ~the whole economy, and increasing earth’s energy output 1000x are all useful capability milestones, and it’s useful to talk about these.
“scheming”. I think the definition of scheming is pretty unclear, and changes a bunch depending on the context.
Under some definitions I believe that the AIs are always going to be scheming, under others, it seems kind of narrow and unnatural.
I somewhat prefer the concepts from the “alignment over time” box in AI 2027.
“gradual disempowerment”
I think that this conflates a bunch of scenarios / threat models together, some of which don’t make sense, some of which do make sense, but I don’t think that the solutions are very related.
Gradual disempowerment is often presented as not requiring any misalignment on behalf of the AIs. If there are AIs that are aligned with any human principals, we get a situation where, even without any coordination, AIs compete on behalf of their principals, and then give the principals whatever surplus they are able to produce.
Given that framing, we can now talk specific threat models. For example, maybe their is no surplus: warfare/competition eats away all the additional resources, and space is consumed purely by the optimal self replicators. Alternatively, maybe the AIs weren’t actually acting in the interests of humanity. Finally, maybe the process of competing hard was existentially catastrophic early on, e.g., maybe it resulted in the oceans being boiled (and humans didn’t take appropriate countermeasures), resulting in extinction.
Note: I’m not saying that all concepts that I think are bad can be saved or crystallized into good concept, often the best idea is to just forget about the original concept and use better/crisper concepts.
I should also maybe give some concepts I think are generally good: p(doom) from AI takeover, timelines, x-risk, s-risk, recursive self improvement, internal/external deployment, AI control, AI alignment. I don’t think any of these concepts are fully crisp, all have somewhat important edge cases, but I think they are good enough abstractions to be very useful for thinking about the future.
Relatedly, I’m also interested in the simple method of extrapolating AI revenue growth trends until AI revenue is most of the world economy. That seems like a decent proxy for when AGI will be achieved. I trust this method less than our model for obvious reasons, but I still put some weight on it. What does it say? Well, it says “Early 2030s.” OK.
I’m curious why you trust revenue extrapolation less than the model. Intuitively revenue seems like a better thing to extrapolate to me than benchmarks or flops or whatever because it’s much less gameable and there’s a much more clear threshold for AGI (revenue is similar size to GDP).
Agree with what Habryka said. Also, Daniel, I, and other AIFP people would update and care about being cooperative / feedback. If anyone is interested in giving feedback on our new scenario about a positive vision post-AGI (about either the content, or the name/branding), please email me.
Also to reiterate: AI 2027 was obviously not a confident prediction of AGI in 2027, it was a scenario where AI happened in 2027, which seems like a plausible and IMO ~modal timeline, and we clearly stated this on the website.
I think that to ban ASI you’d need to ban something like AGI because of intelligence explosion dynamics, so not clear it makes a big difference.
>proper global UBI is *enormously* expensive (h/t @yelizarovanna)
This seems wrong. There will be huge amounts of wealth post-ASI. Even a relatively small UBI (e.g. 1% of AI companies) will be enough to support way better QOL for everyone on earth. Moreover, everything will become way cheaper because of efficiency gains downstream of AI. Even just at AGI, I think it’s plausible that physical labour is something like 10x cheaper and cognitive labour is something like 1000x cheaper.
So, Korea and Japan are building over 100x as much ship per worker-year as the US
Can you spell out why you think this? Do Korea and Japan produce ~100x more ships than the US?
I think a central consideration should be what can actually be enforced, i.e., if it were true that ASI can be built with 1e20 FLOPs / 27-H100-hours (which I think is possible); this would be very impractical to enforce, and so we should build our plan around not having to enforce that.
Thanks for writing this paper.
Why do we need to halt for so long? In short, AI alignment is probably a difficult technical problem, and it is hard to be confident about solutions. Pausing for a substantial period gives humanity time to be careful in this domain rather than rushing. Pausing for a shorter amount of time (e.g., 5 years) might reduce risk substantially compared to the current race, but it also might not be enough. In general, world leaders should weigh the likelihood and consequence of different risks and benefits against each other for different lengths of a pause. Section 2 discusses some of the reasons why the AI alignment problem may be difficult. Generally, experts vary in their estimates of the difficulty of this problem and the likelihood of catastrophe, with some expecting the problem to be very hard [Grace et al., 2025, ControlAI, 2025, Wikipedia, 2025]. Given this uncertainty about how difficult this problem is, we should prepare to pause for a long time, 8 in case more effort is needed. Our agreement would allow for a long halt, even if world leaders later came to believe a shorter one was acceptable. We also contend that there are other problems which need to be addressed during a halt even if one presumes that alignment can be quickly solved, and these problems are also of an uncertain difficulty. These include risks of power concentration, human misuse of AIs, mass-unemployment, and many more. World leaders will likely want at least years to understand and address these problems. The international agreement proposed in this paper is primarily motivated by risks from AI misalignment, but there are numerous other risks that it would also help reduce.
I agree with a lot of this, but I do think this paper ambiguates a bit between “we need to halt for decades” and “we might need to halt for decades”. I agree with the latter but not the former,.
I also think that in the cases where alignment is solvable sooner, then it might matter a lot that we accelerated alignment in the meantime.
I get that it’s scary to have to try to bifurcate alignment and capabilities progress because governments are bad at stuff, but I think it’s a mistake to ban AI research, because it will have very negative consequences on the rate of AI alignment research. I think that we should try hard to figure out what can be done safely (e.g. via things like control evals), and then do alignment work on models that we can empirically study that are as capable as possible while incurring minimal risks.
Serial time isn’t the only input that matters: having smarter AIs is helpful as research assistants and to do experiments directly on the smarter AIs, having lots of compute to do alignment experiments is nice, having lots of money and talent going into AI alignment is helpful. I think you guys should emphasize and think about the function you are trying to maximize more clearly (i.e. how much do you really care about marginal serial time vs marginal serial time with smart AIs to do experiments on).
Hypothesis: alignment-related properties of an ML model will be mostly determined by the part(s) of training that were most responsible for capabilities.
If you take a very smart AI model with arbitrary goals/values and train it to output any particular sequence of tokens using SFT, it’ll almost certainly work. So can we align an arbitrary model by training them to say “I’m a nice chatbot, I wouldn’t cause any existential risk, … ”? Seems like obviously not, because the model will just learn the domain specific / shallow property of outputting those particular tokens in that particular situation.
On the other hand, if you train an AI model from the ground up with a hypothetical “perfect reward function” that always gives correct ratings to the behaviour of the AI, (and you trained on a distribution of tasks similar to the one you are deploying it on) then I would guess that this AI, at least until around the human range, will behaviorally basically act according to the reward function.
A related intuition pump here for the difference is the effect of training someone to say “I care about X” by punishing them until they say X consistently, vs raising them consistently with a large value set / ideology over time. For example, students are sometimes forced to write “I won’t do X” or “I will do Y” 100 times, and usually this doesn’t work at all. Similarly, randomly taking a single ethics class during high school usually doesn’t cause people to enduringly act according to their stated favorite moral theory. However, raising your child Catholic, taking them to Catholic school, taking them to church, taking them to Sunday school, constantly talking to them about the importance of Catholic morality is in practice fairly likely to make them a pretty robust Catholic.
There are maybe two factors being conflated above: (1) the fraction of training / upraising focused on goal X, and (2) the extent to which goal X was getting the capabilities. The reason why I think (2) is a more important / better explanation than (1) is because probably the heuristics that are actually driving the long horizon goal directed behaviors of the model are going to be whatever parts of the models will arise from the long horizon goal directed capabilities training.
Regardless, there’s some sort of spectrum from deep to shallow alignment training for ML models / humans, ranging across:
idealized RL training with a perfect reward function that’s used to train the model in all circumstances
raising a human to consistently care about some set of values their parents have, constantly bringing it up / rewarding good behaviour according to them
High school ethics class
One-off writing tasks of “I won’t do X”
I think that current alignment techniques seem closest to high school ethics classes in their depth, because the vast majority of training is extremely unrelated to ethics / alignment / morality (like high school),. Training is mostly RLVR on coding/math/etc or pretraining, plus a bit of alignment training on the side. I think I’d feel more robust about it sticking if it was closer to what a parent highly focused on raising an ethical child would do, and would start to feel pretty good about the situation if most of the ways that the AI learned capabilities were downstream of a good feedback signal (though I want to think about this a bit more).