ryan_greenblatt comments on The Problem

ryan_greenblatt 13 Aug 2025 1:36 UTC
9 points
−6
Suppose that we ended up with AIs that were ~perfectly aligned (to what the company/project that trained these AIs wanted) which had capabilities that dominate top human experts in AI R&D, AI safety research, philosophy, and most digital research fields in general. These AIs are sufficiently aligned that they try hard to do the right thing even on hard-to-check, conceptual, and open-ended tasks. Let’s say these AIs aren’t much more capable than needed for this capability bar (and the capability profile works out such that these AIs aren’t qualitatively wildly superhuman as seems likely to me).

These AIs are sufficiently aligned and capable that they dominate humans at doing safety work and have better epistemics than groups of human experts.

I think the situation would probably be fine if the AI company tasked these AIs with proceeding with AI development with 20% of resources spent on alignment/safety of future models and 80% on advancing capabilities.

I claim that either you have to think that takeoff is very fast or that alignment is very difficult to think things are likely to go wrong given this situation:
- At the point of handoff, you maybe have ~3 million AI labor years / year and each instance maybe runs at effectively 50x speed. (Obviously the error bars here are very wide.) 20% goes to alignment.
- So, holding this level of capability fixed, in 2 months, you get the equivalent of ~8 years of work from 10k people at the level of top human experts. In 6 months, you get 25 years of work from 10k people. This works occurs with pretty limited compute per person-year, but anything purely conceptual/theoretical can be worked on for a very long time.
- But, also capabilities are improving over this period, so you actually get substantially more labor than this, as long as the level of alignment can be preserved.
- At some point, AIs end up scheming if you haven’t done enough additional work by this point. But, it seems pretty likely you’d still be safe after enough AI progress that it’s as though the AI’s are 5x faster (taking into account qualitative intelligence improvements, I just mean that the AIs are as productive as roughly 5x faster versions of our prior AIs).
- So, within a short period (e.g. 2 months) after this additional AI progress (such that it’s as though the AIs are 5x faster), you have an insane amount of alignment work done.
- You could just think takeoff is very fast, or that alignment is very compute bottlenecked.
These AIs might also advise different actions than an ⁸⁰⁄₂₀ split to be clear! Like trying to buy lead time to spend on alignment.

This overall makes me pretty optimistic about scenarios where we reach this level of alignment in these not-yet-ASI level systems which sounds like a clear disagreement with your perspective. I don’t think this is all of the disagreement, but it might drive a bunch of it.

(To be clear, I think this level of alignment could totally fail to happen, but we seem to disagree even given this!)
What links here?
- We won’t get AIs smart enough to solve alignment but too dumb to rebel by Joe Rogero (6 Oct 2025 21:49 UTC; 28 points)
- Intent alignment seems incoherent by Joe Rogero (7 Oct 2025 23:01 UTC; 22 points)
- yams 13 Aug 2025 22:44 UTC
  4 points
  0
  Parent
  I think my response heavily depends on the operationalization of alignment for the initial AIs, and I’m struggling to keep things from becoming circular in my decomposition of various operationalizations. The crude response is that you’re begging the question here by first positing aligned AIs, but I think your position is that techniques which are likely to descend from current techniques could work well-enough for roughly human-level systems, and that’s where I encounter this sense of circularity.
  I think there’s a better-specified (from my end; you’re doing great) version of this conversation that focuses on three different categories of techniques, based on the capability level at which we expect each to be effective:
  1. Current model-level
  2. Useful autonomous AI researcher level
  3. Superintelligence
  However, I think that disambiguating between proposed agendas for 2 + 3 is very hard, and assuming agendas that plausibly work for 1 also work for 2 is a mistake. It’s not clear to me why the ‘it’s a god, it fucks you, there’s nothing you can do about that’ concerns don’t apply for models capable of:
  hard-to-check, conceptual, and open-ended tasks
  I feel pretty good about this exchange if you want to leave it here, btw! Probably I’ll keep engaging far beyond the point at which its especially useful (although we’re likely pretty far from the point where it stops being useful to me rn).
  - ryan_greenblatt 14 Aug 2025 1:04 UTC
    5 points
    −4
    Parent
    Ok, sound it sounds like your view is “indeed if we got ~totally aligned AIs capable of fully automating safety work (but not notably more capable than the bare minimum requirement for this), we’d probably fine (even if there is still a small fraction of effort spent on safety) and the crux is earlier than this”.
    
    Is this right? If so, it seems notable if the problem can be mostly reduced to sufficiently aligning (still very capable) human-ish level AIs and handing off to these systems (which don’t have the scariest properties of an ASI from an alignment perspective).
    
    I think your position is that techniques which are likely to descend from current techniques could work well-enough for roughly human-level systems, and that’s where I encounter this sense of circularity.
    
    I’d say my position is more like:
    
    Scheming might just not happen: It’s basically a toss up whether systems at this level of capability would end up scheming “by default” (as in, without active effort researching preventing scheming and just work motivated by commercial utility along the way). Maybe I’m at ~40% scheming for such systems, though the details alter my view a lot.
    The rest of the problem if we assume no scheming doesn’t obviously seem that hard: It’s unclear how hard it will be to make non-scheming AIs of the capability level discussed above be sufficiently aligned for the strong sense of alignment I discussed above. I think it’s unlikely that the default course gets us there, but it seems pretty plausible to me that modest effort along the way does. It just requires some favorable generalization of the sort that doesn’t seem that surprising and we’ll have some AI labor along the way to help. And, for this part of the problem, we totally can get multiple tries and study things pretty directly with empiricism using behavioral tests (though we’re still depending on some cleverness and transfer as we can’t directly verify the things we ultimately want the AI to do).
    Further prosaic effort seems helpful for both avoiding scheming and the rest of the problem: I don’t see strong arguments for thinking that at the level of capability we’re discussing scheming will be intractable to prosaic methods or experimentation. I can see why this might happen and I can certainly imagine worlds where no on really tries. Similarly, I don’t see a strong argument for further effort on relatively straightforward methods can’t help a bunch in getting you sufficiently aligned systems (supposing they aren’t scheming): we can measure what we want somewhat well with a bunch of effort and I can imagine many things which could make a pretty big difference (again, this isn’t to say that this effort will happen in practice).
    
    This isn’t to say that I can’t imagine worlds where pretty high effort and well orchestrated prosaic iteration totally fails. This seems totally plausible, especially given how fast this might happen, so risks seem high. And, it’s easy for me to imagine ways the world could be such that relatively prosaic methods and iteration is ~doomed without much more time than we can plausibly hope for, it’s just that these seem somewhat unlikely in aggregate to me.
    
    So, I’d be pretty skeptical of someone claiming that the risk of this type of approach would be <3% (without at the very least preserving the optionality for a long pause during takeoff depending on empirical evidence), but I don’t see a case for thinking “it would be very surprising or wild if prosaic iteration sufficed”.