Steven Byrnes comments on Alignment Proposal: Adversarially Robust Augmentation and Distillation

Steven Byrnes 18 Mar 2026 13:06 UTC
2 points
0
OK. The “parametrized family of continual learning algorithms” frame makes a lot of your earlier comments make more sense now. Thanks.
Next: I guess we’re assuming that (1) we have a parametrized family of continual learning algorithms, and that (2) human learning and thinking is part of that family (although we don’t know a priori which one), and that (3) that you can take some adult human “Joe”, and search through the parametrized family to find one that matches his behavior, and thus wind up with a Joe-imitating algorithm.
I’ll set aside for now whether these assumptions are plausible, and ask a different question: If we make those assumptions then … aren’t we already done? Just make a Joe-imitation and run a million copies of it at 100× speed, and have them work together on AI x-risk (pivotal act, alignment research, whatever).
To me, this seems much simpler than the iterative protocol you discuss in the OP, and equally viable if not more so. What am I missing?
- Cole Wyeth 18 Mar 2026 15:56 UTC
  4 points
  0
  Parent
  It may be hard to faithfully imitate a human for 1000 years (particularly since that sounds like quite a distributional shift / it’s not even clear what the right answer is since no human has lived for 1000 years). I believe we’re in agreement on this.
  Simulating a human for shorter times on multiple problems in parallel is powerful, but presumably comes at a steep capabilities cost relative to other easier options. So it is worth exploring gains from safe augmentation beyond pure speedup.
  Also, at some point we want a plan for scaling qualitatively past human level.