ryan_greenblatt comments on Training AI to do alignment research we don’t already know how to do

ryan_greenblatt 25 Feb 2025 22:40 UTC
3 points
1
A typical crux is that I think we can increase our chances of “real alignment” using prosaic and relatively unenlightened ML reasearch without any deep understanding.

I both think:
1. We can significantly accelerate prosaic ML safety research (e.g., of the sort people are doing today) using AIs that are importantly limited.
2. Prosaic ML safety research can be very helpful for increasing the chance of “real alignment” for AIs that we hand off to. (At least when this research is well executed and has access to powerful AIs to experiment on.)
This top level post is part of Josh’s argument for (2).
- Jeremy Gillen 25 Feb 2025 23:20 UTC
  4 points
  0
  Parent
  Yep this is the third crux I think. Perhaps the most important.
  To me it looks like you’re making a wild guess that “prosaic and relatively unenlightened ML research” is a very large fraction of the necessary work for solving alignment, without any justification that I know of?
  For all the pathways to solving alignment that I am aware of, this is clearly false. I think if you know of a pathway that just involves mostly “prosaic and relatively unenlightened ML research”, you should write out this plan, why you expect it to work, and then ask OpenPhil throw a billion dollars toward every available ML-research-capable human to do this work right now. Surely it’d be better to get started already?
  - ryan_greenblatt 26 Feb 2025 1:20 UTC
    3 points
    0
    Parent
    I don’t think “what is the necessary work for solving alignment” is a frame I really buy. My perspective on alignment is more like:
    
    Avoiding egregious misalignment (where AIs intentionally act in ways that make our tests highly misleading or do pretty obviously unintended/dangerous actions) reduces risk once AIs are otherwise dangerous.
    Additionally, we will likely to need to hand over making most near term decisions and most near term labor to some AI systems at some point. This going well very likely requires being able to avoid egregious misalignment (in systems capable enough to obsolete us) and also requires some other stuff.
    There is a bunch of “prosaic and relatively unenlightened ML research” which can make egregious misalignment much less likely and can resolve other problems needed for handover.
    Much of this work is much easier once you already have powerful AIs to experiment on.
    The risk reduction will depend on the amount of effort put in and the quality of the execution etc.
    The total quantity of risk reduction is unclear, but seems substantial to me. I’d guess takeover risk goes from 50% to 5% if you do a very good job at executing on huge amounts of prosaic and relatively unenlightened ML research at the relevant time. (This require more misc conceptual work, but not something that requires deep understanding persay.)
    
    I think if you know of a pathway that just involves mostly “prosaic and relatively unenlightened ML research”, you should write out this plan, why you expect it to work, and then ask OpenPhil throw a billion dollars toward every available ML-research-capable human to do this work right now. Surely it’d be better to get started already?
    
    I think my perspective is more like “here’s a long list of stuff which would help”. Some of this is readily doable to work on in advance and should be worked on, and some is harder to work on.
    
    This work isn’t extremely easy to verify or scale up (such that I don’t think “throw a billion dollars at it” just works), though I’m excited for a bunch more work on this stuff. (“relatively unenlightened” doesn’t mean “trivial to get the ML community work on this using money” and I also think that getting the ML community to work on things effectively is probably substantially harder than getting AIs to work on things effectively.) Note that I also think that control involves “prosaic and relatively unenlightened ML research” and I’m excited about scaling up research on this, but this doesn’t imply that what should happen is “OpenPhil throw a billion dollars toward every available ML-research-capable human to do this work right now”
    
    I’ve DM’d you my current draft doc on this, though it may be incomprehensible.
    - Jeremy Gillen 26 Feb 2025 3:50 UTC
      5 points
      0
      Parent
      Thanks, I appreciate the draft. I see why it’s not plausible to get started on now, since much of it depends on having AGIs or proto-AGIs to play with.
      I guess I shouldn’t respond too much in public until you’ve published the doc, but:
      If I’m interpreting correctly, a number of the things you intend to try involve having a misaligned (but controlled) proto-AGI run experiments involving training (or otherwise messing with in some way) an AGI. I hope you have some empathy the internal screaming I have toward this category of things.
      A bunch of the ideas do seem reasonable to want to try (given that you had AGIs to play with, and were very confident that doing so wouldn’t allow them to escape or otherwise gain influence). I am sympathetic to the various ideas that involve gaining understanding of how to influence goals better by training in various ways.
      There are chunks of these ideas that definitely aren’t “prosaic and relatively unenlightened ML research”, and involve very-high-trust security stuff or non-trivial epistemic work.
      I’d be a little more sympathetic to these kinda desperate last-minute things if I had no hope in literally just understanding how to build task-AGI properly, in a well understood way. We can do this now. I’m baffled that almost all of the EA-alignment-sphere has given up on even trying to do this. From talking to people this weekend this shift seems downstream of thinking that we can make AGIs do alignment work, without thinking this through in detail.
      The total quantity of risk reduction is unclear, but seems substantial to me. I’d guess takeover risk goes from 50% to 5% if you do a very good job at executing on huge amounts of prosaic and relatively unenlightened ML research at the relevant time
      Agree it’s unclear. I think the chance of most of the ideas being helpful depends on some variables that we don’t clearly know yet. I think 90% risk improvement can’t be right, because there’s a lot of correlation between each of the things working or failing. And a lot of the risk comes from imperfect execution of the control scheme, which adds on top.
      One underlying intuition that I want to express: The world where we are making proto-AGIs run all these experiments is pure chaos. Politically and epistemically and with all the work we need to do. I think pushing toward this chaotic world is much worse than other worlds we could push for right now.
      But if I thought control was likely to work very well and saw a much more plausible path to alignment among the “stuff to try”, I’d think it was a reasonable strategy.
      I also think that getting the ML community to work on things effectively is probably substantially harder than getting AIs to work on things effectively
      On some axes, but won’t there to be axes where AIs are more difficult than humans also? Sycophancy&slop being the most salient. Misalignment issues being another.
      This work isn’t extremely easy to verify or scale up (such that I don’t think “throw a billion dollars at it” just works),
      This makes sense now. But I think this line should make you worry about whether you can make controlled AIs do it.
      - ryan_greenblatt 26 Feb 2025 4:13 UTC
        2 points
        0
        Parent
        
        On some axes, but won’t there to be axes where AIs are more difficult than humans also? Sycophancy&slop being the most salient. Misalignment issues being another.
        
        Yes, I just meant on net. (Relative to the current ML community and given a similar fraction of resources to spend on AI compute.)
        Jeremy Gillen 26 Feb 2025 4:21 UTC
        2 points
        0
        Parent
        It’s not entirely clear to me that the math works out for AIs being helpful on net relative to humans just doing it, because of the supervision required, and the trust and misalignment issues.
        But on this question (for AIs that are just capable of “prosaic and relatively unenlightened ML research”) it feels like shot-in-the-dark guesses. It’s very unclear to me what is and isn’t possible.
        ryan_greenblatt 26 Feb 2025 5:02 UTC
        4 points
        0
        Parent
        I certainly agree it isn’t clear, just my current best guess.
    - jacquesthibs 20 Sep 2025 20:44 UTC
      2 points
      0
      Parent
      I’ve DM’d you my current draft doc on this, though it may be incomprehensible.
      Have you published this doc? If so, which one is it? If not, may I see it?