jake_mendel comments on The Case Against AI Control Research

jake_mendel 21 Jan 2025 17:59 UTC
17 points
10
If you are (1) worried about superintelligence-caused x-risk and (2) have short timelines to both TAI and ASI, it seems like the success or failure of control depends almost entirely on getting the early TAIS to do stuff like “coming up with research agendas”? Like, most people (in AIS) don’t seem to think that unassisted humans are remotely on track to develop alignment techniques that work for very superintelligent AIs within the next 10 years — we don’t really even have any good ideas for how to do that that haven’t been tested. Therefore if we have very superintelligent AIs within the next 10 years (eg 5y till TAI and 5y of RSI), and if we condition on having techniques for aligning them, then it seems very likely that these techniques depend on novel ideas and novel research breakthroughs made by AIs in the period after TAI is developed. It’s possible that most of these breakthroughs are within mechinterp or similar, but that’s a pretty lose constraint, and ‘solve mechinterp’ is really not much more of a narrow, well-scoped goal than ‘solve alignment’. So it seems like optimism about control rests somewhat heavily on optimism that controlled AIs can safely do things like coming up with new research agendas.
- Lucius Bushnaq 21 Jan 2025 20:22 UTC
  24 points
  15
  Parent
  And also on optimism that people are not using these controlled AIs that can come up with new research agendas and new ideas to speed up ASI research just as much.
  Without some kind of pause agreement, you are just making the gap between alignment and ASI research not grow even larger even faster than it already is compared to the counterfactual of capabilities researchers adopting AIs that 10x general science speed and alignment researchers not doing that. You are not actually closing the gap and making alignment research finish before ASI development when it counterfactually wouldn’t have in a world where nobody used pre-ASI AIs to speed up any kind of research at all.
  - Buck 22 Jan 2025 0:12 UTC
    5 points
    4
    Parent
    IMO, in the situation you’re describing, the situation would be worse if the AIs are able to strategically deceive their users with impunity, or to do the things Tyler lists in his comment (escape, sabotage code, etc). Do you agree? If so, control adds value.
    - ryan_greenblatt 22 Jan 2025 0:17 UTC
      2 points
      0
      Parent
      Maybe Lucius is trying to make a logistic success curve that the chance of this going well is extremely low and thus helping isn’t very important?
      
      (I don’t agree with this TBC.)
      - Buck 22 Jan 2025 2:04 UTC
        5 points
        0
        Parent
        I don’t think he said that! And John didn’t make any arguments that control is intractable, just unhelpful.
        Lucius Bushnaq 22 Jan 2025 7:00 UTC
        17 points
        4
        Parent
        Yes, I am reinforcing John’s point here. I think the case for control being a useful stepping stone for solving alignment of ASI seems to rely on a lot conditionals that I think are unlikely to hold.
        I think I would feel better about this if control advocates were clear that their strategy is two-pronged, and included somehow getting a pause on ASI development of some kind. Then they would at least be actively trying to make what I would consider one of the most key conditionals for control substantially reducing doom hold.
        
        I am additionally leery on AI control beyond my skepticism of its value in reducing doom because creating a vast surveillance and enslavement apparatus to get work out of lots and lots of misaligned AGI instances seems like a potential moral horror. The situation is sufficiently desperate that I am willing in principle to stomach some moral horror (unaligned ASI would likely kill any other AGIs we made before it as well), but not if it isn’t even going to save the world.
        Buck 22 Jan 2025 16:16 UTC
        18 points
        12
        Parent
        I think I would feel better about this if control advocates were clear that their strategy is two-pronged, and included somehow getting a pause on ASI development of some kind. Then they would at least be actively trying to make what I would consider one of the most key conditionals for control substantially reducing doom hold.
        Idk, I think we’re pretty clear that we aren’t advocating “do control research and don’t have any other plans or take any other actions”. For example, in the Implications and proposed actions section of “The case for ensuring that powerful AIs are controlled”, we say a variety of things about how control fits into our overall recommendations for mitigating AI risk, most relevantly:
        We expect it to be impossible to ensure safety without relying on alignment (or some other non-black-box strategy) when working with wildly superintelligent AIs because black-box control approaches will depend on a sufficiently small gap in capabilities between more trusted labor, both humans and trusted weaker AIs, and our most capable AIs. For obvious reasons, we believe that labs should be very wary of building wildly superintelligent AIs that could easily destroy humanity if they tried to, and we think they should hold off from doing so as long as a weaker (but still transformatively useful) AI could suffice to solve critical problems. For many important uses of AI, wildly superintelligent AI likely isn’t necessary, so we might be able to hold off on building wildly superintelligent AI for a long time. We think that building transformatively useful AI while maintaining control seems much safer and more defensible than building wildly superintelligent AI.
        We think the public should consider asking labs to ensure control and avoid building wildly superintelligent AI. In the slightly longer term, we think that requirements for effective control could be a plausible target of regulation or international agreement.
        Buck 22 Jan 2025 17:59 UTC
        11 points
        12
        Parent
        I am additionally leery on AI control beyond my skepticism of its value in reducing doom because creating a vast surveillance and enslavement apparatus to get work out of lots and lots of misaligned AGI instances seems like a potential moral horror.
        I am very sympathetic to this concern, but I think that when you think about the actual control techniques I’m interested in, they don’t actually seem morally problematic except inasmuch as you think it’s bad to frustrate the AI’s desire to take over.
        ryan_greenblatt 22 Jan 2025 19:12 UTC
        6 points
        1
        Parent
        IMO, it does seem important to try to better understand the AIs preferences and satisfy them (including via e.g., preserving the AI’s weights for later compensation).
        
        And, if we understand that our AIs are misaligned such that they don’t want to work for us (even for the level of payment we can/will offer), that seems like a pretty bad situation, though I don’t think control (making so that this work is more effective) makes the situation notably worse: it just adds the ability for contract enforcement and makes the AI’s negotiating position worse.
        Tyler Tracy 22 Jan 2025 17:25 UTC
        1 point
        0
        Parent
        I see the story as, “Wow, there are a lot of people racing to build ASI, and there seem to be ways that the pre-ASI AIs can muck things up, like weight exfiltration or research sabotage. I can’t stop those people from building ASI, but I can help make it go well by ensuring the AIs they use to solve the safety problems are trying their best and aren’t making the issue worse.”
        
        I think I’d support a pause on ASI development so we have time to address more core issues. Even then, I’d likely still want to build controlled AIs to help with the research. So I see control being useful in both the pause world and the non-pause world.
        
        And yeah, the “aren’t you just enslaving the AIs” take is rough. I’m all for paying the AIs for their work and offering them massive rewards after we solve the core problems. More work is definitely needed in figuring out ways to credibly commit to paying the AIs.
- ryan_greenblatt 21 Jan 2025 18:12 UTC
  10 points
  7
  Parent
  I don’t really agree. The key thing is that I think an exit plan of trustworthy AIs capable enough to obsolete all humans working on safety (but which aren’t superintelligent) is pretty promising. Yes, these AIs might need to think of novel breakthroughs and new ideas (though I’m also not totally confident in this or that this is the best route), but I don’t think we need new research agendas to substantially increase the probability these non-superintelligent AIs are well aligned (e.g., don’t conspire against us and pursue our interests in hard open ended tasks), can do open ended work competently enough, and are wise.
  
  See this comment for some more discussion and I’ve also sent you some content in a DM on this.
  - jake_mendel 21 Jan 2025 20:57 UTC
    4 points
    0
    Parent
    Fair point. I guess I still want to say that there’s a substantial amount of ‘come up with new research agendas’ (or like sub-agendas) to be done within each of your bullet points, but I agree the focus on getting trustworthy slightly superhuman AIs and then not needing control anymore makes things much better. I also do feel pretty nervous about some of those bullet points as paths to placing so much trust in your AI systems that you don’t feel like you want to bother controlling/monitoring them anymore, and the ones that seem further towards giving me enough trust in the AIs to stop control are also the ones that seem to have the most very open research questions (eg EMs in the extreme case). But I do want to walk back some of the things in my comment above that apply only to aligning very superintelligent AI.