Cole Wyeth comments on Alignment Proposal: Adversarially Robust Augmentation and Distillation

Cole Wyeth 17 Mar 2026 15:53 UTC
4 points
0
Why not? Well, actually, for an ideal imitation learning algorithm, i.e. Solomonoff induction on an imaginary hypercomputer, my answers would all be “yes”! But in the real world, we don’t have hypercomputers!
Your true objection seems to be that current imitation learning algorithms are not good enough. If you’re saying that pure “in-context learning” without weight updates will not cut it, I think I agree, and I have been one of the most prolific advocates of that view. However, the naive implications of that mental model have overall harmed my predictive performance. I now weakly prefer to bet that minor modifications on the level of (slightly more clever) intermittent finetuning or distillation of own reasoning outputs are sufficient for continual learning.
Elsewhere, you suggest that imitation learning how to learn is actually impossible because you would need to simulate a learning algorithm, and you would really be running that learning algorithm, not imitation learning:
The only practical way to know what happens after millions of steps of some scaled-up continual learning algorithm is to actually do millions of steps of that same scaled-up continual learning algorithm, with actual weights getting actually changed in specifically-designed ways via PyTorch code. And then that’s the scaled-up learning algorithm you’re running. Which means you’re not doing imitation learning.
I disagree; you would be imitation learning to run that learning algorithm, and I see no principled reason this cannot be practical.
I think there’s a bit of miscommunication here. You would in fact need a great continual learning algorithm in order to imitation learn how to continually learn. This may be what you mean by saying you can’t imitation-learn how to continually learn (that is, from scratch, without some base continual learning algorithm to start with). However, I see no principled reason that imitation learning cannot improve / distill a better continual learning algorithm than the one which is performing the imitation learning. But this isn’t very cruxy for me. The point is that given a great continual learning algorithm, you could imitation learn a human policy which includes both planning and (possibly weaker!) “inner loop” continual learning. That would be sufficient for my alignment plan, even if it were “impractical” in the sense of “weakening the continual learning engine.”
- Steven Byrnes 17 Mar 2026 19:13 UTC
  2 points
  0
  Parent
  I’m pretty confused. This comment is just trying to get on the same page before I start arguing :-)
  I disagree; you would be imitation learning to run that learning algorithm, and I see no principled reason this cannot be practical.
  Presumably this is a learning algorithm with weights, and PyTorch code that updates the weights. My question is: how are the weights being updated? Are they being updated by a continual learning objective (e.g. RL, self-distillation, whatever), or are the weights are being updated by an imitation-learning objective (self-supervised learning on the outputs of the “teacher”)? Or are you interspersing both? Or are there two different sets of weights, one for each type of update? Or what?
  You would in fact need a great continual learning algorithm in order to imitation learn how to continually learn.
  My interpretation of this part is: you’re imagining that we have written down a parametrized family of continual learning algorithms, and you have black-box access to a “teacher” continual learning algorithm which we know is somewhere in this space of continual learning algorithms, but we don’t know where. Then I agree (in principle) that you can do imitation learning to home in on which element of your parametrized family of continual learning algorithms matches the teacher.
  Does that match what you’re trying to say here?
  - Cole Wyeth 17 Mar 2026 20:40 UTC
    4 points
    0
    Parent
    Presumably this is a learning algorithm with weights, and PyTorch code that updates the weights. My question is: how are the weights being updated? Are they being updated by a continual learning objective (e.g. RL, self-distillation, whatever), or are the weights are being updated by an imitation-learning objective (self-supervised learning on the outputs of the “teacher”)? Or are you interspersing both? Or are there two different sets of weights, one for each type of update? Or what?
    An imitation learning objective on the outputs of the teacher (starting from a very strong inductive bias) is the outer loop. During (online) generation, it should of course also simulate gradient updates, using either actual gradient updates (which it has meta-learned how to perform) or perhaps sufficiently rich residual activations. Probably context tokens aren’t enough, though I don’t know, possibly a vast number of internal reasoning tokens would be enough in principle.
    My interpretation of this part is: you’re imagining that we have written down a parametrized family of continual learning algorithms, and you have black-box access to a “teacher” continual learning algorithm which we know is somewhere in this space of continual learning algorithms, but we don’t know where. Then I agree (in principle) that you can do imitation learning to home in on which element of your parametrized family of continual learning algorithms matches the teacher.
    Does that match what you’re trying to say here?
    Yes, that is one way of putting it, though “whatever a human is doing to plan and learn” counts as a continual learning algorithm for my purposes.
    - Steven Byrnes 18 Mar 2026 13:06 UTC
      2 points
      0
      Parent
      OK. The “parametrized family of continual learning algorithms” frame makes a lot of your earlier comments make more sense now. Thanks.
      Next: I guess we’re assuming that (1) we have a parametrized family of continual learning algorithms, and that (2) human learning and thinking is part of that family (although we don’t know a priori which one), and that (3) that you can take some adult human “Joe”, and search through the parametrized family to find one that matches his behavior, and thus wind up with a Joe-imitating algorithm.
      I’ll set aside for now whether these assumptions are plausible, and ask a different question: If we make those assumptions then … aren’t we already done? Just make a Joe-imitation and run a million copies of it at 100× speed, and have them work together on AI x-risk (pivotal act, alignment research, whatever).
      To me, this seems much simpler than the iterative protocol you discuss in the OP, and equally viable if not more so. What am I missing?
      - Cole Wyeth 18 Mar 2026 15:56 UTC
        4 points
        0
        Parent
        It may be hard to faithfully imitate a human for 1000 years (particularly since that sounds like quite a distributional shift / it’s not even clear what the right answer is since no human has lived for 1000 years). I believe we’re in agreement on this.
        Simulating a human for shorter times on multiple problems in parallel is powerful, but presumably comes at a steep capabilities cost relative to other easier options. So it is worth exploring gains from safe augmentation beyond pure speedup.
        Also, at some point we want a plan for scaling qualitatively past human level.