ryan_greenblatt comments on Thoughts on “AI is easy to control” by Pope & Belrose

ryan_greenblatt 1 Dec 2023 23:06 UTC
LW: 17 AF: 8
6
AF
- I’m not sure we’re worrying about the same regimes.
  The regime I’m most worried about is:
  AI systems which are much smarter than the smartest humans
  ...
  It’s unclear to me whether the authors are discussing alignment in a regime like the one above, or a regime like “LLMs which are not much smarter than the smartest humans.” (I too am very optimistic about remaining safe in this latter regime.)
  ...
The AI Optimists don’t make this argument AFAICT, but I think optimism about effectively utilizing “human level” models should transfer to a considerable amount of optimism about smarter than human models due to the potential for using these “human level” systems to develop considerably better safety technology (e.g. alignment research). AIs might have structural advantages (speed, cost, and standardization) which make it possible heavily accelerate R&D^[1] even at around qualitatively “human level” capabilities. (That said, my overall view is that even if we had the exact human capability profile while also having ML structural advantages these systems would themselves pose substantial (e.g. 15%) catastrophic misalignment x-risk on the “default” trajectory because we’ll want to run extremely large numbers of these systems at high speeds.)
The idea of using human level models like this has a bunch of important caveats which mean you shouldn’t end up being extremely optimistic overall IMO^[2]:
- It’s not clear that “human level” will be a good description at any point. AIs might be way smarter than humans in some domains while way dumber in other domains. This can cause the oversight issues mentioned in the parent comment to manifest prior to massive acceleration of alignment research. (In practice, I’m moderately optimistic here.)
- Is massive effective acceleration enough? We need safety technology to keep up with capabilites and capabilities might also be accelerated. There is the potential for arbitrarily scalable approaches to safety which should make us somewhat more optimistic. But, it might end up being the case that to avoid catastrophe from AIs which are one step smarter than humans we need the equivalent of having the 300 best safety researchers work for 500 years and we won’t have enough acceleration and delay to manage this. (In practice I’m somewhat optimistic here so long as we can get a 1-3 year delay at a critical point.)
- Will “human level” systems be sufficiently controlled to get enough useful work? Even if systems could hypothetically be very useful, it might be hard to quickly get them actually doing useful work (particularly in fuzzy domains like alignment etc.). This objection holds even if we aren’t worried about catastrophic misalignment risk.
1. ^
  At least R&D which isn’t very limited by physical processes.
2. ^
  I think <1% doom seems too optimistic without more of a story for how we’re going to handle super human models.
- Vladimir_Nesov 2 Dec 2023 16:40 UTC
  LW: 5 AF: 3
  1
  AF Parent
  Plans that rely on aligned AGIs working on alignment faster than humans would need to ensure that no AGIs work on anything else in the meantime. The reason humans have no time to develop alignment of superintelligence is that other humans develop misaligned superintelligence faster. Similarly by default very fast AGIs working on alignment end up having to compete with very fast AGIs working on other things that lead to misaligned superintelligence. Preventing aligned AGIs from building misaligned superintelligence is not clearly more manageable than preventing humans from building AGIs.
  - ryan_greenblatt 2 Dec 2023 17:46 UTC
    LW: 5 AF: 2
    0
    AF Parent
    
    Plans that rely on aligned AGIs working on alignment faster than humans would need to ensure that no AGIs work on anything else in the meantime.
    
    This isn’t true. It could be that making an arbitrarily scalable solution to alignment takes X cognitive resources and in practice building an uncontrollably powerful AI takes Y cognitive resources with X < Y.
    
    (Also, this plan doesn’t require necessarily aligning “human level” AIs, just being able to get work out of them with sufficiently high productivity and low danger.)
    - Vladimir_Nesov 2 Dec 2023 18:17 UTC
      LW: 11 AF: 4
      7
      AF Parent
      I’m being a bit simplistic. The point is that it needs to stop being a losing or a close race, and all runners getting faster doesn’t obviously help with that problem. I guess there is some refactor vs. rewrite feel to the distinction between the project of stopping humans from building AGIs right now, and the project of getting first AGIs to work on alignment and global security in a post-AGI world faster than other AGIs overshadow such work. The former has near/concrete difficulties, the latter has nebulous difficulties that don’t as readily jump to attention. The whole problem is messiness and lack of coordination, so starting from scratch with AGIs seems more promising than reforming human society. But without strong coordination on development and deployment of first AGIs, the situation with activities of AGIs is going to be just as messy and uncoordinated, only unfolding much faster, and that’s not even counting the risk of getting a superintelligence right away.
- Bogdan Ionut Cirstea 2 Dec 2023 16:26 UTC
  5 points
  1
  Parent
  I’m on the optimists discord and I do make the above argument explicitly in this presentation (e.g. slide 4): Reasons for optimism about superalignment (though, fwiw, Idk if I’d go all the way down to 1% p(doom), but I have probably updated something like 10% to <5%, and most of my uncertainty now comes more from the governance / misuse side).
  On your points ‘Is massive effective acceleration enough?’ and ‘Will “human level” systems be sufficiently controlled to get enough useful work?’, I think conditioned on aligned-enough ~human-level automated alignment RAs, the answers to the above are very likely yes, because it should be possible to get a very large amount of work out of those systems even in a very brief amount of time—e.g. a couple of months (feasible with e.g. a coordinated pause, or even with a sufficient lead). See e.g. slides 9, 10 of the above presentation (and I’ll note that this argument isn’t new, it’s been made in variously similar forms by e.g. Ajeya Cotra, Lukas Finnveden, Jacob Steinhardt).
  - ryan_greenblatt 2 Dec 2023 17:49 UTC
    12 points
    10
    Parent
    I’m generally reasonably optimistic about using human level-ish systems to do a ton of useful work while simultaneously avoiding most risk from these systems. But, I think this requires substantial effort and won’t clearly go well by default.
  - Dakara 20 Nov 2024 14:51 UTC
    1 point
    0
    Parent
    Have you had any p(doom) updates since then or is it still around 5%?
    - Bogdan Ionut Cirstea 20 Nov 2024 17:10 UTC
      4 points
      1
      Parent
      Mostly the same, perhaps a minor positive update on the technical side (basically, from systems getting somewhat stronger—so e.g. closer to automating AI safety research—while still not showing very dangerous capabilities, like ASL-3, prerequisites to scheming, etc.). My views are even more uncertain / unstable on the governance side though, which probably makes my overall p(doom) (including e.g. stable totalitarianism, s-risks, etc.) more like 20% than 5% (I was probably mostly intuitively thinking of extinction risk only when giving the 5% figure a year ago; overall my median probably hasn’t changed much, but I have more variance, coming from the governance side).
      - Dakara 24 Nov 2024 17:08 UTC
        1 point
        0
        Parent
        If it’s not a big ask, I’d really like to know your views on more of a control-by-power-hungry-humans side of AI risk.
        
        For example, the first company to create intent-aligned AGI would be wielding incredible power over the rest of us. I don’t think I could trust any of the current leading AI labs to use that power fairly. I don’t think this lab would voluntarily decide to give up control over it either (intuitively, it would take quite something for anyone to give up such a source of power). Is there anything that can be done to prevent such a scenario?
        Bogdan Ionut Cirstea 24 Nov 2024 17:25 UTC
        3 points
        1
        Parent
        I’m very uncertain and feel somewhat out of depth on this. I do have quite some hope though from arguments like those in https://aiprospects.substack.com/p/paretotopian-goal-alignment.