Jeremy Gillen comments on Training AI to do alignment research we don’t already know how to do

Jeremy Gillen 26 Feb 2025 3:50 UTC
5 points
0
Thanks, I appreciate the draft. I see why it’s not plausible to get started on now, since much of it depends on having AGIs or proto-AGIs to play with.
I guess I shouldn’t respond too much in public until you’ve published the doc, but:
- If I’m interpreting correctly, a number of the things you intend to try involve having a misaligned (but controlled) proto-AGI run experiments involving training (or otherwise messing with in some way) an AGI. I hope you have some empathy the internal screaming I have toward this category of things.
- A bunch of the ideas do seem reasonable to want to try (given that you had AGIs to play with, and were very confident that doing so wouldn’t allow them to escape or otherwise gain influence). I am sympathetic to the various ideas that involve gaining understanding of how to influence goals better by training in various ways.
- There are chunks of these ideas that definitely aren’t “prosaic and relatively unenlightened ML research”, and involve very-high-trust security stuff or non-trivial epistemic work.
- I’d be a little more sympathetic to these kinda desperate last-minute things if I had no hope in literally just understanding how to build task-AGI properly, in a well understood way. We can do this now. I’m baffled that almost all of the EA-alignment-sphere has given up on even trying to do this. From talking to people this weekend this shift seems downstream of thinking that we can make AGIs do alignment work, without thinking this through in detail.
The total quantity of risk reduction is unclear, but seems substantial to me. I’d guess takeover risk goes from 50% to 5% if you do a very good job at executing on huge amounts of prosaic and relatively unenlightened ML research at the relevant time
Agree it’s unclear. I think the chance of most of the ideas being helpful depends on some variables that we don’t clearly know yet. I think 90% risk improvement can’t be right, because there’s a lot of correlation between each of the things working or failing. And a lot of the risk comes from imperfect execution of the control scheme, which adds on top.
One underlying intuition that I want to express: The world where we are making proto-AGIs run all these experiments is pure chaos. Politically and epistemically and with all the work we need to do. I think pushing toward this chaotic world is much worse than other worlds we could push for right now.
But if I thought control was likely to work very well and saw a much more plausible path to alignment among the “stuff to try”, I’d think it was a reasonable strategy.
I also think that getting the ML community to work on things effectively is probably substantially harder than getting AIs to work on things effectively
On some axes, but won’t there to be axes where AIs are more difficult than humans also? Sycophancy&slop being the most salient. Misalignment issues being another.
This work isn’t extremely easy to verify or scale up (such that I don’t think “throw a billion dollars at it” just works),
This makes sense now. But I think this line should make you worry about whether you can make controlled AIs do it.
- ryan_greenblatt 26 Feb 2025 4:13 UTC
  2 points
  0
  Parent
  
  On some axes, but won’t there to be axes where AIs are more difficult than humans also? Sycophancy&slop being the most salient. Misalignment issues being another.
  
  Yes, I just meant on net. (Relative to the current ML community and given a similar fraction of resources to spend on AI compute.)
  - Jeremy Gillen 26 Feb 2025 4:21 UTC
    2 points
    0
    Parent
    It’s not entirely clear to me that the math works out for AIs being helpful on net relative to humans just doing it, because of the supervision required, and the trust and misalignment issues.
    But on this question (for AIs that are just capable of “prosaic and relatively unenlightened ML research”) it feels like shot-in-the-dark guesses. It’s very unclear to me what is and isn’t possible.
    - ryan_greenblatt 26 Feb 2025 5:02 UTC
      4 points
      0
      Parent
      I certainly agree it isn’t clear, just my current best guess.