MichaelDickens comments on The case for AI safety capacity-building work

MichaelDickens 10 Mar 2026 4:11 UTC
3 points
0
My big concern is, to what extent do capacity-building efforts create pipelines into AI capabilities work that shortens timelines? This happens to at least some extent, but I don’t know how big an effect it is.
- Cleo Nardo 10 Mar 2026 7:23 UTC
  8 points
  2
  Parent
  See Ryan Kidd on How MATS addresses “mass movement building” concerns. One useful-ish framing from that post is is that the safety is more neglected than capabilities, so you should expect the marginal FTE on safety to add more than the marginal FTE on capabilities.
  I’ll make up numbers to illustrate the point: let’s say there are 10,000 capabilities researchers and 1000 AI safety researchers — that’s a rough 10:1 ratio. Now say MATS funnels 500 people into capabilities and 500 into AI safety. Now it’s 10,500:1500 ratio, or 7:1, which is much better. I think 50% of the MATS mentees going into capabilities is very pessimistic.
  According to the MATS Alumni Impact Analysis, the ratios ares 78%:
  - 49% are “Working/interning on AI alignment/control.”
  - 29% are “Conducting alignment research independently.”
  - 1.4% are “Working/interning on AI capabilities.”
  If we (perhaps unfairly) ignore the independent researcher, then that’s 3 to capabilities for every 100 to safety!
  This also ignores the effects of having capabilities researchers more clued up / familiar with AI safety — you might think this is net-positive (bc they are marginally more likely to update on new evidence and pivot to safety) or net-positive (bc they misuse our technical insights). I think this probably shakes out to be net-positive.
  - the gears to ascension 10 Mar 2026 12:49 UTC
    7 points
    4
    Parent
    But if locally robust alignment makes faster progress on capabilities without being a version of locally robust alignment that can become asymptotically robust alignment and give us a win node, then it’s not only not as obvious, but might well be primarily made of backfire. I don’t believe the most extreme version of this claim, but I used to, I know people who do. The “people going into capabilities” is also including “local alignment that doesn’t generalize to asymptotic alignment”.
    
    The naive way to get asymptotic, and possibly the only way, is to have locally robust alignment that is reliably always ahead of impactfulness of deployment. But if impactfulness of deployment is always operating at 110% of local robustness, that seems like a recipe for disaster. So the question is ultimately one about how to stay in the local validity window indefinitely, even as that local window gets harder to be sure we’ve maintained.
    
    Also there’s the whole thing about how alignment is fundamentally about figuring out what to ask for, and being robust about that is a confusing question at best. Most of MATS seems too focused on the (very hard!) problems of achieving local-alignment-good-enough-to-keep-going to spend the necessary effort to succeed at the really hard parts later (which is not obviously a mistake on MATS’s part, but might well be).
    
    It’s much easier to get reliable robustness for local alignment when you don’t have to solve “what should a thing overwhelmingly smarter than us do?” yet, but if there’s a hole coming where a thing moderately smarter than us has the ability to confuse us into choosing a plan that seems good to it but which is not what we want, we may fall off the manifold we would have wanted to be on if we had been able to see things coming without getting confused.
  - Ryan Kidd 12 Mar 2026 22:52 UTC
    4 points
    1
    Parent
    I now think the net-neutral ratio of new median safety to capabilities researchers is probably closer to 1:100 (though I think 1:10 isn’t crazy). However, the longitudinal MATS career data indicates our alumni are employed in safety to capabilities roles 8:1.
- MichaelDickens 10 Mar 2026 4:12 UTC
  8 points
  0
  Parent
  Loosely construed, the foundings of DeepMind, OpenAI, and Anthropic were the result of x-risk capacity-building. If you count those (and it’s not clear that you should), then capacity-building has probably been strongly net negative to date.
  - abergal 11 Mar 2026 2:09 UTC
    5 points
    4
    Parent
    I’ve heard variants of this argument, and I overall haven’t found them that persuasive for reasons close to the ones Habryka gives—I think if you carve things up such that capacity-building work has been responsible for e.g. speeding up the creation of frontier AI labs, you should also credit it for the broader movement focused on catastrophic risks, and my intuition is the counterfactual world without that movement would be worse off overall. I don’t buy that the acceleration has been substantial enough that the unaccelerated world would have bought a lot of time for societal improvements useful for addressing catastrophic risks; instead, it feels to me like the unaccelerated world would be facing these risks more blindly and with less time to usefully prepare.
    I also think given the massive amount of non-GCR-related interest and resources in AI now, the forward-looking acceleration effects seem likely to be much smaller than any historic effect. I generally think the ratio of “meaningfully adding to the talent pool of people working on catastrophic risks” to “meaningfully accelerating AI capabilities” for most CB programs will look extremely favorable.
  - habryka 10 Mar 2026 4:30 UTC
    3 points
    0
    Parent
    I don’t think it’s obvious that even if you count those, capacity-building has been strongly net-negative to date, but I do think it’s pretty plausible.
    Like, if you were to count the costs as broadly as “all the labs are downstream of capacity-build work” then you also need to count the benefits as broadly. And a broadly known public track record of being concerned about these problems for a long time, and that it’s motivated by altruism, and that you tried to solve the problem for a long time, and being one of the few memetic centers in the world that people draw on to figure out what to do about this whole AI situation is quite valuable, possibly more valuable than the acceleration-effects of things like Deepmind, OpenAI and Anthropic.
    (that said, my actual take here is that the biggest issue with most capacity-building work is that it actively undermines the things that other capacity-building work has been highly successful at, so that ultimately some capacity building work is predictably extremely good for the world, and some is predictably extremely bad for the world).
    - testingthewaters 10 Mar 2026 20:33 UTC
      2 points
      0
      Parent
      
      the biggest issue with most capacity-building work is that it actively undermines the things that other capacity-building work has been highly successful at, so that ultimately some capacity building work is predictably extremely good for the world, and some is predictably extremely bad for the world
      
      Wait what does this mean? Is there some kind of dichotomy I’m not aware of?
      - habryka 10 Mar 2026 20:36 UTC
        2 points
        −1
        Parent
        Maybe? I am not saying the dichotomy is common-knowledge, but I feel pretty confident predicting which capacity-building work will be quite bad in-expectation and which will be quite good (this doesn’t mean there isn’t variance within those categories with many orgs or people having sign-flipped impact from their reference class, but that I am happy to register predictions at the class level with like reasonably-high confidence).
        testingthewaters 10 Mar 2026 23:12 UTC
        2 points
        0
        Parent
        I would then like to know which is which (DM is okay if you feel that would be somewhat controversial, it’s also alright if you want to keep your opinions to yourself)
        habryka 10 Mar 2026 23:44 UTC
        2 points
        0
        Parent
        Sorry, I am not saying there is a classifier here that is like one sentence long. At a high level I think “is it largely funneling people into places where the incentives will point towards building more powerful AI systems and/or becoming personally more powerful, or is it putting people into positions where their primary incentives are to help other people make sense of what is going with some grounding in accuracy of their beliefs” is the best short classifier I have, but I didn’t intend to communicate there is some super short description of the classifier!
        testingthewaters 11 Mar 2026 1:29 UTC
        2 points
        0
        Parent
        No worries, thanks for elaborating