Steven Byrnes comments on Alignment Proposal: Adversarially Robust Augmentation and Distillation

Steven Byrnes 29 May 2025 13:25 UTC
3 points
1
(Thanks for your patient engagement!)
If you believe
- it is probably true that future pure imitation learning techniques can capture the process by which humans figure out new scientific ideas over millions of seconds, AND
- it is “certainly false” that future pure imitation learning techniques can capture the process by which AlphaZero figures out new chess strategies over millions of games
then I’m curious what accounts for the difference, in your mind?
More detail, just to make sure we’re on the same page: The analogy I’m suggesting is:
(A1) AlphaZero goes from Elo 0 to Elo 2500
(A2) …via self-play RL
(A3) Future pure imitation learner extrapolates this process forward to get Elo 3500 chess skill
-versus-
(B1) Human civilization goes from “totally clueless about nanotech design principles / technical alignment / whatever” in 1900 to “somewhat confused about nanotech design principles / technical alignment / whatever” in 2025
(B2) …via whatever human brains are doing (which I claim centrally involves RL)
(B3) Future pure imitation learner extrapolates this process forward to get crystal-clear understanding of nanotech design principles / technical alignment / whatever
You think that (A3) is “certainly false” while (B3) is plausible, and I’m asking what you see as the disanalogy.
(For the record, I think both (A3) and (B3) are implausible. I think that LLM in-context learning can capture the way that humans figure out new things over seconds, but not the way that humans figure out new things over weeks and months. And I don’t think that’s a solvable problem, but rather points to a deep deficiency in imitation learning, a deficiency which is only solvable by learning algorithms with non-imitation-learning objectives.)
- Cole Wyeth 29 May 2025 13:56 UTC
  2 points
  0
  Parent
  I didn’t realize you intended A3 to refer to future imitation learning systems. In that case, yes, it will work. You might have to use some tricks similar to gwern’s suggestions—e.g. the imitation learner should (for fair comparison) also have access to the simulation platform that AlphaZero uses, and would have to play about as many games as AlphaZero plays. But it does not have to do the same search and policy distillation training process that AlphaZero does.