Cole Wyeth comments on Alignment Proposal: Adversarially Robust Augmentation and Distillation

Cole Wyeth 17 Mar 2026 20:40 UTC
4 points
0
Presumably this is a learning algorithm with weights, and PyTorch code that updates the weights. My question is: how are the weights being updated? Are they being updated by a continual learning objective (e.g. RL, self-distillation, whatever), or are the weights are being updated by an imitation-learning objective (self-supervised learning on the outputs of the “teacher”)? Or are you interspersing both? Or are there two different sets of weights, one for each type of update? Or what?
An imitation learning objective on the outputs of the teacher (starting from a very strong inductive bias) is the outer loop. During (online) generation, it should of course also simulate gradient updates, using either actual gradient updates (which it has meta-learned how to perform) or perhaps sufficiently rich residual activations. Probably context tokens aren’t enough, though I don’t know, possibly a vast number of internal reasoning tokens would be enough in principle.
My interpretation of this part is: you’re imagining that we have written down a parametrized family of continual learning algorithms, and you have black-box access to a “teacher” continual learning algorithm which we know is somewhere in this space of continual learning algorithms, but we don’t know where. Then I agree (in principle) that you can do imitation learning to home in on which element of your parametrized family of continual learning algorithms matches the teacher.
Does that match what you’re trying to say here?
Yes, that is one way of putting it, though “whatever a human is doing to plan and learn” counts as a continual learning algorithm for my purposes.
- Steven Byrnes 18 Mar 2026 13:06 UTC
  2 points
  0
  Parent
  OK. The “parametrized family of continual learning algorithms” frame makes a lot of your earlier comments make more sense now. Thanks.
  Next: I guess we’re assuming that (1) we have a parametrized family of continual learning algorithms, and that (2) human learning and thinking is part of that family (although we don’t know a priori which one), and that (3) that you can take some adult human “Joe”, and search through the parametrized family to find one that matches his behavior, and thus wind up with a Joe-imitating algorithm.
  I’ll set aside for now whether these assumptions are plausible, and ask a different question: If we make those assumptions then … aren’t we already done? Just make a Joe-imitation and run a million copies of it at 100× speed, and have them work together on AI x-risk (pivotal act, alignment research, whatever).
  To me, this seems much simpler than the iterative protocol you discuss in the OP, and equally viable if not more so. What am I missing?
  - Cole Wyeth 18 Mar 2026 15:56 UTC
    4 points
    0
    Parent
    It may be hard to faithfully imitate a human for 1000 years (particularly since that sounds like quite a distributional shift / it’s not even clear what the right answer is since no human has lived for 1000 years). I believe we’re in agreement on this.
    Simulating a human for shorter times on multiple problems in parallel is powerful, but presumably comes at a steep capabilities cost relative to other easier options. So it is worth exploring gains from safe augmentation beyond pure speedup.
    Also, at some point we want a plan for scaling qualitatively past human level.