Thomas Larsen comments on Plans A, B, C, and D for misalignment risk

Thomas Larsen 9 Oct 2025 18:15 UTC
LW: 29 AF: 14
0
AF
One framing that I think might be helpful for thinking about “Plan A” vs “shut it all down” is: “Suppose that you have the political will for an n-year slowdown, i.e. after n years, you are forced to handoff trust to superhuman AI systems (e.g. for n = 5, 10, 30). What should the capability progression throughout the slowdown be?” This framing forces a focus on the exit condition / plan to do handoff, which I think is an underdiscussed weakness of the “shut it all down” plan.
I think my gut reaction is that the most important considerations are: (i) there are a lot of useful things you can do with the AIs, so I want more time with the smarter AIs, and (ii) I want to scale through the dangerous capability range slowly and with slack (as opposed to at the end of the slowdown).
- this makes me think that particularly for a shorter slowdown (e.g. 5 years), you want to go fast at the beginning (e.g. scale to ~max controllable AI over the first year or two), and then elicit lots of work out of those AIs for the rest of the time period.
- A key concern for the above plan is that govts/labs botch the measurement of “max controllable AI”, and scale too far.
- But it’s not clear to me how a further delay helps with this, unless you have a plan for making the institutions better over time, or pursuing a less risky path (e.g. ignoring ML and doing human intelligence augmentation).
- Going slower, on the other hand, definitely does help, but requires not shutting it all down.
- More generally, it seems good to do something like “extend takeoff evenly by a factor of n”, as opposed to something like “pause for n-1 years, and then do a 1 year takeoff”.
- I am sympathetic to shut all down and go for human augmentation: I do think this reduces AI takeover risk a lot, but this requires a very long pause, and it requires our institutions to bet big on a very unpopular technology. I think that convincing governments to “shut it all down” without an exit strategy at all seems quite difficult as well.
Ofc this framing also ignores some important considerations, e.g. choices about the capability progression effect both difficulty of enforcement/verification (in both directions: AI lie detectors/ai verification is helpful, while making AIs closer to the edge is a downside), as well as willingness to pay over time (e.g. scary demos or AI for epistemics might help increase WTP)
- Raemon 9 Oct 2025 18:41 UTC
  LW: 4 AF: 2
  0
  AF Parent
  This framing feels reasonable-ish, with some caveats.^[1]
  I am assuming we’re starting the question at the first stage where either “shut it down” or “have a strong degree of control over global takeoff” becomes plausibly politically viable. (i.e. assume early stages of Shut It Down and Controlled Takeoff both include various partial measures that are more immediately viable and don’t give you the ability to steer capability-growth that hard)
  But, once it becomes a serious question “how quickly should we progress through capabilities”, then one thing to flag is, it’s not like you know “we get 5 years, therefore, we want to proceed through those years at X rate.” It’s “we seem to have this amount of buy-in currently...” and the amount of buy-in could change (positively or negatively).
  Some random thoughts on things that seem important:
  - I would want to do at least some early global pause on large training runs, to check if you are actually capable of doing that at all. (in conjunction with some efforts attempting to build international goodwill about it)
  - One of the more important things to do as soon as it’s viable, is to stop production of more compute in an uncontrolled fashion. (I’m guessing this plays out with some kind of pork deals for nVidia and other leaders^[2], where the early steps are ‘consolidate compute’, and then them producing the chips that are more monitorable, and which they get to make money from, but also are sort of nationalized). This prevents a big overhang.
  - Before I did a rapid-growth of capabilities, I would want a globally set target of “we are able to make some kind of interpretability strides or evals that let us make better able to predict the outcome of the next training run.” (
  If it’s not viable to do that, well, then we don’t. (but, then we’re not really having a real convo about how slow the takeoff should ideally be, just riding the same incentive wave we’re currently riding with slightly more steering). ((We can instead have a convo about how to best steer given various murky conditions, which seems like a real important convo, I’m just responding here to this comment’s framing))^[3]
  If we reach a point where humanity has demonstrated the capability of “stop training on purpose, stop uncontrolled compute production, and noticeably improve our ability to predict the next training run”, then I’m not obviously opposed to doing relatively rapid advancement, but, it’s not obviously better to do “rapid to the edge” than “do one round where there are predictions/incentives/prizes somehow for people to accurately predict how the next training rounds go, then evaluate that, then do it again.”
  1. ^
    I think there’s at least some confusion where people are imagining the simplest/dumbest version of Shut It Down, and imagining “Plan A” is nuanced and complicated. I think the actual draft treaty has levers that are approximately the same levers you’d want to do this sort of controlled takeoff.
  2. ^
    I’m not sure how powerful nVidia is an an interest group. Maybe it is important to avoid them getting a deal like this so they’re less of an interest group with power at the negotiating table.
  3. ^
    FYI my “Ray detects some political bs motivations in himself” alarm is tripping as I write this paragraph. It currently seems right to me but let me know if I’m missing something here.
  - Thomas Larsen 9 Oct 2025 19:15 UTC
    4 points
    0
    Parent
    I think I mostly am on board with this comment. Some thoughts:
    Before I did a rapid-growth of capabilities, I would want a globally set target of “we are able to make some kind of interpretability strides or evals that let us make better able to predict the outcome of the next training run.” (
    this feels a bit overly binary to me. I think that understanding-based safety cases will be necessary for ASI. But behavioral methods seem like they might be sufficient before hand.
    I don’t know what you mean by “rapid growth”. It seems like you might be imagining the “shut it all down → solve alignment during pause → rapidly scale after you’ve solved alignment” plan. I think we probably should never do a “rapid scaleup”
    Another reaction I have is that a constraint to coordination will probably be “is the other guy doing a blacksite which will screw us over”. So I think there’s a viability bump at the point of “allow legal capabiliites scaling at least as fast as the max size blacksite that you would have a hard time detecting”.
    I would want to do at least some early global pause on large training runs, to check if you are actually capable of doing that at all. (in conjunction with some efforts attempting to build international goodwill about it)
    So I think this paragraph isn’t really right, because “slowdown’ != ‘pause’, and slowdowns might still be really really helpful and enough to get you a long way.
    One of the more important things to do as soon as it’s viable, is to stop production of more compute in an uncontrolled fashion. (I’m guessing this plays out with some kind of pork deals for nVidia and other leaders^[2], where the early steps are ‘consolidate compute’, and then them producing the chips that are more monitorable, and which they get to make money from, but also are sort of nationalized). This prevents a big overhang.
    I actually currently think that you want to accelerate compute production, because hardware scaling seems safer than software scaling. I’m not sure exactly what you mean by “in an uncontrolled fashion”.. if you mean “have a bunch of inspectors making sure the flow of new chips isn’t being smuggled to illegal projects”, then I agree with this, on my initial read I thought you meant something like “pause chip production until they start producing GPUS with HEMs in them”, which I think is probably bad.
    In other words I think that you want to create a big compute overhang during a pause. The downside is obvious, but the upsides are:
    compute is controllable, far more than software, and so differentially advances legal projects.
    more compute for safety. We want to be able to pay a big safety tax, more compute straightforwardly helps.
    extra compute progress funges against software progress, which is scarier.
    compute is destroyable (e.g. we can reverse and destroy compute, if we want to eat an overhang), but software progress mostly isn’t (you can’t unpublish reserach).
    (this comment might be confusing because I typed it quickly, happy to clarify if you want)
    - Raemon 9 Oct 2025 19:47 UTC
      4 points
      0
      Parent
      So I think this paragraph isn’t really right, because “slowdown’ != ‘pause’, and slowdowns might still be really really helpful and enough to get you a long way.
      I think “everyone agrees to a noticeably smaller next-run-size” seems like a fine thing to do as the first coordination attempt.
      I think there is something good about having an early step (maybe after that one), which somehow forces people to actually orient on “okay, suppose we actually had to prioritize interpretability and evals now until they were able to keep pace with capabilities, how would we seriously do that?”
      (I don’t currently have a good operationalize of this that seems robust, but, it seems plausible by the time we’re meaningfully able to decide to do anything like this, someone may have come up with a good policy with that effect. I can definitely see this backfiring and causing people to get better at some kind of software that is then harder to control).
      I actually currently think that you want to accelerate compute production, because hardware scaling seems safer than software scaling. I’m not sure exactly what you mean by “in an uncontrolled fashion”.. if you mean “have a bunch of inspectors making sure the flow of new chips isn’t being smuggled to illegal projects”, then I agree with this, on my initial read I thought you meant something like “pause chip production until they start producing GPUS with HEMs in them”, which I think is probably bad.
      In other words I think that you want to create a big compute overhang during a pause. The downside is obvious, but the upsides are:
      compute is controllable, far more than software, and so differentially advances legal projects.
      more compute for safety. We want to be able to pay a big safety tax, more compute straightforwardly helps.
      extra compute progress funges against software progress, which is scarier.
      compute is destroyable (e.g. we can reverse and destroy compute, if we want to eat an overhang), but software progress mostly isn’t (you can’t unpublish reserach).
      Mmm, nod I can see it. I’d need to think more to figure out a considered opinion on this but seems a-priori reasonable.
      I think one of the things I want is to have executed each type of control you might want to exert, at least for a shorter period of time, to test whether you’re able to do it at all. But, having the early compute steps be more focused on “they have remote-shutdown options but can continue production” or at least a policy-level “there are enforcers sitting outside the compute centers that could choose to forcibly shut it down fairly quickly”.