Steven Byrnes comments on Alignment Proposal: Adversarially Robust Augmentation and Distillation

Steven Byrnes 28 May 2025 14:49 UTC
6 points
1
Great, glad we agree on that!
Next: If we take an “agent trained through imitation learning”, and glue on a “solution to efficient, online continual learning”, then the result (after it runs a while) is NOT
“an agent trained through imitation learning”,
but rather
“an agent that is partly trained through imitation learning, and partly trained through [however the online continual learning works]”.
Right?
And now your proposal requires an assumption that this online continual learning system, whatever it is, does not undermine the agent’s alignment. Right?
What links here?
- Steven Byrnes's comment on Alignment Proposal: Adversarially Robust Augmentation and Distillation by Cole Wyeth (28 May 2025 21:25 UTC; 2 points)
- Cole Wyeth 28 May 2025 21:38 UTC
  4 points
  0
  Parent
  I’m not suggesting an agent that is partly trained through imitation learning, and then partly trained through continual learning on some other objective. I am suggesting an agent that is trained solely through imitation learning, using improved algorithms that more faithfully imitate humans over longer timescales, including by learning because humans learn—but by learning as humans learn! I think that the obstacles to doing this are very similar to the obstacles to continual learning in LLMs, though they are not exactly the same, and it’s certainly conceivable that LLM algorithms for continual learning will be invented which are not transferable to pure imitation learning. In particular, LLMs may start some kind of feedback loop of recursive self-improvement before faithful imitation learning becomes technically feasible. However, I see no fundamental reason to expect that is the only or most likely path. And all alignment plans are sunk by recursive self-improvement happening tomorrow.
  Explicitly, LLMs are not perfect assistants or agents because their in-context learning is limited. This problem is not specific to fine tuned models though—even base models have limited in-context learning. The most direct solutions to this problem would allow them to perform in-context learning with the same objective as they already do (sequence prediction) but for longer. The analogue of this for imitation learning should similarly perform imitation learning, and then imitate faithfully for longer—including “in-context” learning as necessary.
  - Steven Byrnes 29 May 2025 13:25 UTC
    3 points
    1
    Parent
    (Thanks for your patient engagement!)
    If you believe
    it is probably true that future pure imitation learning techniques can capture the process by which humans figure out new scientific ideas over millions of seconds, AND
    it is “certainly false” that future pure imitation learning techniques can capture the process by which AlphaZero figures out new chess strategies over millions of games
    then I’m curious what accounts for the difference, in your mind?
    More detail, just to make sure we’re on the same page: The analogy I’m suggesting is:
    (A1) AlphaZero goes from Elo 0 to Elo 2500
    (A2) …via self-play RL
    (A3) Future pure imitation learner extrapolates this process forward to get Elo 3500 chess skill
    -versus-
    (B1) Human civilization goes from “totally clueless about nanotech design principles / technical alignment / whatever” in 1900 to “somewhat confused about nanotech design principles / technical alignment / whatever” in 2025
    (B2) …via whatever human brains are doing (which I claim centrally involves RL)
    (B3) Future pure imitation learner extrapolates this process forward to get crystal-clear understanding of nanotech design principles / technical alignment / whatever
    You think that (A3) is “certainly false” while (B3) is plausible, and I’m asking what you see as the disanalogy.
    (For the record, I think both (A3) and (B3) are implausible. I think that LLM in-context learning can capture the way that humans figure out new things over seconds, but not the way that humans figure out new things over weeks and months. And I don’t think that’s a solvable problem, but rather points to a deep deficiency in imitation learning, a deficiency which is only solvable by learning algorithms with non-imitation-learning objectives.)
    - Cole Wyeth 29 May 2025 13:56 UTC
      2 points
      0
      Parent
      I didn’t realize you intended A3 to refer to future imitation learning systems. In that case, yes, it will work. You might have to use some tricks similar to gwern’s suggestions—e.g. the imitation learner should (for fair comparison) also have access to the simulation platform that AlphaZero uses, and would have to play about as many games as AlphaZero plays. But it does not have to do the same search and policy distillation training process that AlphaZero does.