Steven Byrnes comments on Alignment Proposal: Adversarially Robust Augmentation and Distillation

Steven Byrnes 27 May 2025 16:11 UTC
7 points
1
One reason this proposal doesn’t really work for me (AFAICT) is because I’m normally thinking of continuous learning, i.e. my opinion is:
Tagline: “AGI isn’t about knowing how to do lots of things. Instead, AGI is about not knowing how to do something, and then being able to figure it out.” (see also §1 of “Sharp Left Turn” discourse: An opinionated review)
When I read your post with this mental picture, they seem to clash for various reasons.
For starters, I agree that imitation learning is (or could be) great at capturing a snapshot of a person, but I’m skeptical that it could capture the way that a person learns and figures out new things over weeks and months. I think this is borne out by LLM base models, which are trained by imitation learning, and are really quite strong in areas that humans already understand, but don’t even have the capacity for true online learning (i.e. weight edits when they figure out something new … the bit of poor-man’s online learning that they can do inside their context window is IMO not a great substitute).
If that’s the case, then as soon as you do one step of distillation-via-imitation-learning, you’ve taken a giant and unrecoverable step backwards in capabilities.
Maybe you could say, “so much the worse for LLMs, but future AI approaches will be able to imitation-learn the way that humans grow and figure things out over weeks and months and years”. If so, I’m skeptical, but we can talk about that separately.
And then another issue is that if the AIs (and humans!) are “in motion”, gaining knowledge and competence just by running longer and thinking about new domains and making new connections, then the overshooting vs undershooting issue becomes much harder. This isn’t “capabilities evaluations” as we normally think of them. For example, you can’t know how good the AI will be at cybersecurity until it’s spent a long time studying cybersecurity, and even then it might figure out something new or come up with new ideas while it’s being used as an advisor.
What links here?
- You can’t imitation-learn how to continual-learn by Steven Byrnes (16 Mar 2026 21:20 UTC; 186 points)
- Steven Byrnes's comment on Why we should expect ruthless sociopath ASI by Steven Byrnes (22 Feb 2026 0:48 UTC; 2 points)
- Cole Wyeth 27 May 2025 18:08 UTC
  4 points
  −1
  Parent
  Maybe you could say, “so much the worse for LLMs, but future AI approaches will be able to imitation-learn the way that humans grow and figure things out over weeks and months and years”. If so, I’m skeptical, but we can talk about that separately.
  It seems like that is the level of capability where safety risks start to arise for other systems, so I don’t see it as a major problem to assume that level of capability at imitation learning.
  - Steven Byrnes 27 May 2025 21:12 UTC
    3 points
    1
    Parent
    I’m confused by your response. What do you mean by “other systems”?
    The only thing I can think of is that you might be trying to say is:
    (1) AGI is possible,
    (2) …Therefore, it must be possible, somehow or other, to imitation-learning the way that humans grow and figure things out over weeks and months and years.
    If that’s what you’re thinking, then I disagree with (2). Yes it’s possible to make an AGI that can learn grow and figure things out over weeks and months and years, but such an AGI algorithm need not involve any imitation learning. (And personally I expect it won’t involve imitation learning; bit more discussion in §2.3.2 here.)
    - Cole Wyeth 27 May 2025 21:26 UTC
      4 points
      0
      Parent
      My proposal cannot be carried out fully now because imitation learning is not faithful enough, because models distilled through imitation learning will not generalize sufficiently well OOD. However, I am mainly afraid of AGI systems that generalize OOD, which is why I want to solve alignment. If models never gain the capability to generalize OOD than there is much less risk from unaligned AGI. If they do gain that capability, I don’t see why it should lag significantly behind in imitation learning.
      I do expect the proposal to carry an alignment tax, but this is not the same concern.
      - Steven Byrnes 27 May 2025 21:53 UTC
        11 points
        0
        Parent
        It’s possible that “imitation learning will not generalize sufficiently well OOD” is an unsolvable problem, right? (In fact, my belief is that it’s unsolvable, at least in practice, if we include “humans learning new things over the course of years” as part of the definition of what constitutes successful OOD generalization.)
        But if it is unsolvable problem, it would not follow that “models will never gain the ability to generalize OOD”, nor would it follow that AGI will never be very powerful and scary.
        Rather, it would follow that imitation learning models will never gain the ability to generalize OOD—but non-imitation-learning models are still allowed to generalize OOD just fine!
        And it would follow that imitation learning models will not be powerful scary AGIs—but there will still be powerful scary AGIs, they just won’t be based on imitation learning.
        For example, suppose that no human had ever played Go. Imitation learning would be a very doomed way to make a Go-playing AI, right? But we could still make AlphaZero, which does not involve imitation learning, and it works great.
        Or better yet, suppose that no intelligent language-using animal has ever existed in the universe. Then imitation learning would be even more doomed. There’s nothing to imitate! But a well-chosen non-imitation-learning algorithm could still autonomously invent language and science and technology from scratch. We know this to be the case, because after all, that was the situation that our hominid ancestors were in.
        See what I mean? Sorry if we’re talking past each other somehow.
        Cole Wyeth 27 May 2025 22:46 UTC
        4 points
        0
        Parent
        I see no reason to think imitation learning is particularly unable to generalize OOD.
        After all, humans learn. A sufficiently good imitation of a human should also learn. Perhaps you are simply imaging imitation learning on a too-restricted dataset.
        Steven Byrnes 27 May 2025 23:34 UTC
        4 points
        0
        Parent
        You could equally well say: “AlphaZero learns, therefore a sufficiently good imitation of AlphaZero should also learn”. Right? But let’s think about what that would entail.
        AlphaZero learns via a quite complicated algorithm involving tracking the state of a Go board through self-play, and each step of the self-play involves a tree search with thousands of queries to a 30M-parameter ConvNet, and then at the end of the game a Go engine is called to see who won and then there’s a set of gradient descent steps on that 30M-parameter ConvNet. Then repeat that whole process fifty million times. And now you have a trained AlphaZero.
        Now, imagine taking some generic algorithm class (say, an RNN) and training it “to imitate the process by which AlphaZero learns”. It’s just not gonna work, right? Granted, RNNs are Turing complete, so perhaps one could prove that an astronomically large RNN trained on astronomically much data can emulate (in its astronomically large number of weights) this entire detailed process of running a self-play tree search and performing gradient descent on this 30M-parameter ConvNet. …But c’mon, that’s not gonna realistically work in practice, right? (Related: §3 here.)
        IMO, the only realistic way to make something that learns like AlphaZero learns is to build AlphaZero itself, or at least something awfully similar to it. I think the tree search etc. needs to be in the source code, not implicit in the learned weights of some generic algorithm class like RNNs, with no superficial relation to tree search. …But if you do that, then I would call it “reverse-engineering AlphaZero”, not “imitation learning from AlphaZero”.
        By the same token, I do think it’s possible to make something that learns like a human, but I think it would require reverse-engineering human brains, not just imitation-learning from human data.
        Cole Wyeth 28 May 2025 7:52 UTC
        4 points
        −1
        Parent
        I think your intuitions here are highly misguided. I don’t agree with your conclusions about AlphaZero at all. You could easily train a model by distilling AlphaZero. All the complicated steps are only necessary to bootstrap from nothing.
        Steven Byrnes 28 May 2025 13:59 UTC
        5 points
        1
        Parent
        Yes distilling a snapshot of AlphaZero is easy. The hard part is distilling the process by which AlphaZero improves—not just bootstrapping from nothing, but also turning an Elo-2500 AlphaZero into an Elo-3500 AlphaZero.
        Is this a way to operationalize our disagreement?
        CLAIM:
        Take AlphaZero-chess and train it (via self-play RL as usual) from scratch to Elo 2500 (grandmaster level), but no further.
        Now take a generic DNN like a transformer. Give it training data showing how AlphaZero-in-training developed from Elo 0, to Elo 1, … to Elo 1000, to Elo 1001, to Elo 1002, … to Elo 2500. [We can use any or all of those AlphaZero-snapshots however we want to build our training dataset.] And we now have a trained model M.
        Now use this trained model M by itself (no weight-updates, no self-play, just pure inference) to extrapolate this process of improvement forward.
        The claim is: If we do this right, we can wind up with an Elo-3500 chess-playing agent (i.e. radically superhuman, comparable to what you’d get by continuing the AlphaZero self-play RL training for millions more games).
        I feel very strongly that this claim is false. Do you think it’s true?
        (This is relevant because I think that “the process by which AlphaZero-in-training goes from Elo 2500 to Elo 3500” is in the same general category as “the process by which a human goes from mediocre and confused understanding of some novel domain to deep understanding and expertise, over the course of weeks and months and years”.)
        gwern 28 May 2025 20:32 UTC
        7 points
        0
        Parent
        As I’ve said before, I think you greatly overrate the difficulty of putting search into neural nets, and this is an example of it. It seems to me like it is entirely possible to make a generic LLM implement an equivalent to AlphaZero and be capable of expert iteration, without an elaborate tree scaffolding. A tree search is just another algorithm which can be reified as a sequence, like all algorithms (because they are implemented on a computer).
        
        All AlphaZero is, is a way of doing policy iteration/Newton updates by running a game state forward for a few plies, evaluating, and updating estimates. It’s not magic, and can obviously be encoded into a LLM’s generative process.
        
        Here’s a concrete example of how in-principle I think a LLM can do AlphaZero-style expert iteration for Go: A LLM can serialize a board with value estimates as simply a few hundred tokens (361 points, 361 value estimates, miscellaneous metadata); this means in a frontier LLM like Claude-4-opus with 200k ctx, you can fit in easily 200 board states; so you can serialize out the lookahead of a bunch of possible moves and resulting board states (eg. take the top 14 moves and imagine the resulting board state and then imagine their next 14 top moves, for comparison, TD-Gammon looked forward like 1 move); and can back-propagate an updated value estimate, and spit out the original board state with better value estimates. “Move #4 was better than it looked, so I will +0.01 to the value estimate for it.” This improved board is now in context, and can be dynamically-evaluated to update the LLM: now it has to predict the new board state with the final improved estimates, and that improves the policy. The LLM finishes by setting up the next planning step: pick a deeper board state to evaluate next, and if the next board state is the end of the game, then it starts over with a fresh game. Run this indefinitely.
        
        It repeatedly iterates through a possible game, evaluating each position to a certain depth, updating its weights to incorporate the policy improvement from the evaluation, and restarting with a fresh game. All serialized out as a long array/sequence, the tree just being implicitly represented by successive board states. (And then now that you have that in mind, you can imagine how to do things like deep rollouts: 200 moves is around a normal game of Go, so random rollouts are doable from most board states, and the LLM can just toggle between a shallow tree search and deep randomized rollouts if necessary eg by adding a ⁰⁄₁ token prefix.)
        
        At no point do you need explicit tree scaffolding as you bootstrap from a LLM clueless about playing Go to the high performance that we know LLMs trained by imitation learning on board states/values/policies can reach, and at no point have I invoked a cognitive operation which is not easier than a lot of things we see LLMs do routinely, or where it’s implausible that they could do it. It is probably a lot less efficient and has other practical issues like how you integrate the rules of Go akin to AlphaZero/MuZero, etc, but in principle I think this algorithm is well-defined, concrete, and would work.
        Steven Byrnes 28 May 2025 21:25 UTC
        2 points
        0
        Parent
        Hmm, I don’t particularly disagree with anything you wrote. I think you’re misunderstanding the context of this conversation.
        I wasn’t bringing up tree search because I think tree search is required for AGI. (I don’t think that.)
        Rather, I was making a point that there will need to be some system that updates the weights (not activations) of an AGI as it runs, just as adult humans learn and figure out new things over time as they work on a project.
        What is this system that will update the weights? I have opinions, but in general, there are lots of possible approaches. Self-play-RL with tree search is one possibility. RL without tree search is another possibility. The system you described in your comment is yet a third possibility. Whatever! I don’t care, that’s not my point here.
        What is my point? How did this come up? Well, Cole’s OP is relying on the fact that “[pure] imitation learning is probably existentially safe”. And I was saying that pure imitation learning imposes a horrific capability tax that destroys his whole plan, because a human has open-ended autonomous learning, whereas a model trained by pure imitation learning (on that same human) does not. So you cannot simply swap out the former for the latter.
        In Cole’s most recent reply, it appears that what he has in mind is actually a system that’s initialized by being trained to imitate humans, but then it also has some system for open-ended continuous learning from that starting point.
        And then I replied that this would solve the capability issue, but only by creating a new problem that “[pure] imitation learning is probably existentially safe” can no longer function as part of his safety argument, because the continuous learning may affect alignment.
        For example, if you initialize a PacMan RL agent on human imitation (where the humans were all very nice to the ghosts during play), and then you set up that agent to continuously improve by RL policy optimization, using the score as the reward function, then it’s gonna rapidly stop being nice to the ghosts.
        Does that help explain where I’m coming from?
        Expand this thread
        Cole Wyeth 28 May 2025 21:48 UTC
        2 points
        0
        Parent
        That’s not what I have in mind, see my more most recent reply.
        
        Also, I am not sure that removing the imitation learning step would actually “destroy my whole plan.” It would perhaps prevent it from scaling past a certain point, but I think we would still be left in a much more tractable position.
        Cole Wyeth 28 May 2025 14:18 UTC
        4 points
        0
        Parent
        The claim is certainly false.
        Before LLMs reach AGI, someone will have to solve efficient, online continual learning. This is an open technical problem, which is why I doubt that the current paradigm scales to superintelligence. It seems that an appropriate solution for general-purpose agents would also lead to a solution for agents trained through imitation learning.
        What links here?
        Steven Byrnes's comment on Alignment Proposal: Adversarially Robust Augmentation and Distillation by Cole Wyeth (29 May 2025 13:25 UTC; 3 points)
        Steven Byrnes's comment on Alignment Proposal: Adversarially Robust Augmentation and Distillation by Cole Wyeth (28 May 2025 21:25 UTC; 2 points)
        Steven Byrnes 28 May 2025 14:49 UTC
        6 points
        1
        Parent
        Great, glad we agree on that!
        Next: If we take an “agent trained through imitation learning”, and glue on a “solution to efficient, online continual learning”, then the result (after it runs a while) is NOT
        “an agent trained through imitation learning”,
        but rather
        “an agent that is partly trained through imitation learning, and partly trained through [however the online continual learning works]”.
        Right?
        And now your proposal requires an assumption that this online continual learning system, whatever it is, does not undermine the agent’s alignment. Right?
        What links here?
        Steven Byrnes's comment on Alignment Proposal: Adversarially Robust Augmentation and Distillation by Cole Wyeth (28 May 2025 21:25 UTC; 2 points)
        Expand this thread
        Cole Wyeth 28 May 2025 21:38 UTC
        4 points
        0
        Parent
        I’m not suggesting an agent that is partly trained through imitation learning, and then partly trained through continual learning on some other objective. I am suggesting an agent that is trained solely through imitation learning, using improved algorithms that more faithfully imitate humans over longer timescales, including by learning because humans learn—but by learning as humans learn! I think that the obstacles to doing this are very similar to the obstacles to continual learning in LLMs, though they are not exactly the same, and it’s certainly conceivable that LLM algorithms for continual learning will be invented which are not transferable to pure imitation learning. In particular, LLMs may start some kind of feedback loop of recursive self-improvement before faithful imitation learning becomes technically feasible. However, I see no fundamental reason to expect that is the only or most likely path. And all alignment plans are sunk by recursive self-improvement happening tomorrow.
        Explicitly, LLMs are not perfect assistants or agents because their in-context learning is limited. This problem is not specific to fine tuned models though—even base models have limited in-context learning. The most direct solutions to this problem would allow them to perform in-context learning with the same objective as they already do (sequence prediction) but for longer. The analogue of this for imitation learning should similarly perform imitation learning, and then imitate faithfully for longer—including “in-context” learning as necessary.
        Steven Byrnes 29 May 2025 13:25 UTC
        3 points
        1
        Parent
        (Thanks for your patient engagement!)
        If you believe
        it is probably true that future pure imitation learning techniques can capture the process by which humans figure out new scientific ideas over millions of seconds, AND
        it is “certainly false” that future pure imitation learning techniques can capture the process by which AlphaZero figures out new chess strategies over millions of games
        then I’m curious what accounts for the difference, in your mind?
        More detail, just to make sure we’re on the same page: The analogy I’m suggesting is:
        (A1) AlphaZero goes from Elo 0 to Elo 2500
        (A2) …via self-play RL
        (A3) Future pure imitation learner extrapolates this process forward to get Elo 3500 chess skill
        -versus-
        (B1) Human civilization goes from “totally clueless about nanotech design principles / technical alignment / whatever” in 1900 to “somewhat confused about nanotech design principles / technical alignment / whatever” in 2025
        (B2) …via whatever human brains are doing (which I claim centrally involves RL)
        (B3) Future pure imitation learner extrapolates this process forward to get crystal-clear understanding of nanotech design principles / technical alignment / whatever
        You think that (A3) is “certainly false” while (B3) is plausible, and I’m asking what you see as the disanalogy.
        (For the record, I think both (A3) and (B3) are implausible. I think that LLM in-context learning can capture the way that humans figure out new things over seconds, but not the way that humans figure out new things over weeks and months. And I don’t think that’s a solvable problem, but rather points to a deep deficiency in imitation learning, a deficiency which is only solvable by learning algorithms with non-imitation-learning objectives.)
        Cole Wyeth 29 May 2025 13:56 UTC
        3 points
        −1
        Parent
        I didn’t realize you intended A3 to refer to future imitation learning systems. In that case, yes, it will work. You might have to use some tricks similar to gwern’s suggestions—e.g. the imitation learner should (for fair comparison) also have access to the simulation platform that AlphaZero uses, and would have to play about as many games as AlphaZero plays. But it does not have to do the same search and policy distillation training process that AlphaZero does.
- Steven Byrnes 17 Mar 2026 12:57 UTC
  2 points
  0
  Parent
  FYI I just wrote a post You can’t imitation-learn how to continual-learn which is related to this thread.
  - Cole Wyeth 17 Mar 2026 15:53 UTC
    4 points
    0
    Parent
    Why not? Well, actually, for an ideal imitation learning algorithm, i.e. Solomonoff induction on an imaginary hypercomputer, my answers would all be “yes”! But in the real world, we don’t have hypercomputers!
    Your true objection seems to be that current imitation learning algorithms are not good enough. If you’re saying that pure “in-context learning” without weight updates will not cut it, I think I agree, and I have been one of the most prolific advocates of that view. However, the naive implications of that mental model have overall harmed my predictive performance. I now weakly prefer to bet that minor modifications on the level of (slightly more clever) intermittent finetuning or distillation of own reasoning outputs are sufficient for continual learning.
    Elsewhere, you suggest that imitation learning how to learn is actually impossible because you would need to simulate a learning algorithm, and you would really be running that learning algorithm, not imitation learning:
    The only practical way to know what happens after millions of steps of some scaled-up continual learning algorithm is to actually do millions of steps of that same scaled-up continual learning algorithm, with actual weights getting actually changed in specifically-designed ways via PyTorch code. And then that’s the scaled-up learning algorithm you’re running. Which means you’re not doing imitation learning.
    I disagree; you would be imitation learning to run that learning algorithm, and I see no principled reason this cannot be practical.
    I think there’s a bit of miscommunication here. You would in fact need a great continual learning algorithm in order to imitation learn how to continually learn. This may be what you mean by saying you can’t imitation-learn how to continually learn (that is, from scratch, without some base continual learning algorithm to start with). However, I see no principled reason that imitation learning cannot improve / distill a better continual learning algorithm than the one which is performing the imitation learning. But this isn’t very cruxy for me. The point is that given a great continual learning algorithm, you could imitation learn a human policy which includes both planning and (possibly weaker!) “inner loop” continual learning. That would be sufficient for my alignment plan, even if it were “impractical” in the sense of “weakening the continual learning engine.”
    - Steven Byrnes 17 Mar 2026 19:13 UTC
      2 points
      0
      Parent
      I’m pretty confused. This comment is just trying to get on the same page before I start arguing :-)
      I disagree; you would be imitation learning to run that learning algorithm, and I see no principled reason this cannot be practical.
      Presumably this is a learning algorithm with weights, and PyTorch code that updates the weights. My question is: how are the weights being updated? Are they being updated by a continual learning objective (e.g. RL, self-distillation, whatever), or are the weights are being updated by an imitation-learning objective (self-supervised learning on the outputs of the “teacher”)? Or are you interspersing both? Or are there two different sets of weights, one for each type of update? Or what?
      You would in fact need a great continual learning algorithm in order to imitation learn how to continually learn.
      My interpretation of this part is: you’re imagining that we have written down a parametrized family of continual learning algorithms, and you have black-box access to a “teacher” continual learning algorithm which we know is somewhere in this space of continual learning algorithms, but we don’t know where. Then I agree (in principle) that you can do imitation learning to home in on which element of your parametrized family of continual learning algorithms matches the teacher.
      Does that match what you’re trying to say here?
      - Cole Wyeth 17 Mar 2026 20:40 UTC
        4 points
        0
        Parent
        Presumably this is a learning algorithm with weights, and PyTorch code that updates the weights. My question is: how are the weights being updated? Are they being updated by a continual learning objective (e.g. RL, self-distillation, whatever), or are the weights are being updated by an imitation-learning objective (self-supervised learning on the outputs of the “teacher”)? Or are you interspersing both? Or are there two different sets of weights, one for each type of update? Or what?
        An imitation learning objective on the outputs of the teacher (starting from a very strong inductive bias) is the outer loop. During (online) generation, it should of course also simulate gradient updates, using either actual gradient updates (which it has meta-learned how to perform) or perhaps sufficiently rich residual activations. Probably context tokens aren’t enough, though I don’t know, possibly a vast number of internal reasoning tokens would be enough in principle.
        My interpretation of this part is: you’re imagining that we have written down a parametrized family of continual learning algorithms, and you have black-box access to a “teacher” continual learning algorithm which we know is somewhere in this space of continual learning algorithms, but we don’t know where. Then I agree (in principle) that you can do imitation learning to home in on which element of your parametrized family of continual learning algorithms matches the teacher.
        Does that match what you’re trying to say here?
        Yes, that is one way of putting it, though “whatever a human is doing to plan and learn” counts as a continual learning algorithm for my purposes.
        Steven Byrnes 18 Mar 2026 13:06 UTC
        2 points
        0
        Parent
        OK. The “parametrized family of continual learning algorithms” frame makes a lot of your earlier comments make more sense now. Thanks.
        Next: I guess we’re assuming that (1) we have a parametrized family of continual learning algorithms, and that (2) human learning and thinking is part of that family (although we don’t know a priori which one), and that (3) that you can take some adult human “Joe”, and search through the parametrized family to find one that matches his behavior, and thus wind up with a Joe-imitating algorithm.
        I’ll set aside for now whether these assumptions are plausible, and ask a different question: If we make those assumptions then … aren’t we already done? Just make a Joe-imitation and run a million copies of it at 100× speed, and have them work together on AI x-risk (pivotal act, alignment research, whatever).
        To me, this seems much simpler than the iterative protocol you discuss in the OP, and equally viable if not more so. What am I missing?
        Cole Wyeth 18 Mar 2026 15:56 UTC
        4 points
        0
        Parent
        It may be hard to faithfully imitate a human for 1000 years (particularly since that sounds like quite a distributional shift / it’s not even clear what the right answer is since no human has lived for 1000 years). I believe we’re in agreement on this.
        Simulating a human for shorter times on multiple problems in parallel is powerful, but presumably comes at a steep capabilities cost relative to other easier options. So it is worth exploring gains from safe augmentation beyond pure speedup.
        Also, at some point we want a plan for scaling qualitatively past human level.