Steven Byrnes comments on Alignment Proposal: Adversarially Robust Augmentation and Distillation

Steven Byrnes 28 May 2025 21:25 UTC
2 points
0
Hmm, I don’t particularly disagree with anything you wrote. I think you’re misunderstanding the context of this conversation.
I wasn’t bringing up tree search because I think tree search is required for AGI. (I don’t think that.)
Rather, I was making a point that there will need to be some system that updates the weights (not activations) of an AGI as it runs, just as adult humans learn and figure out new things over time as they work on a project.
What is this system that will update the weights? I have opinions, but in general, there are lots of possible approaches. Self-play-RL with tree search is one possibility. RL without tree search is another possibility. The system you described in your comment is yet a third possibility. Whatever! I don’t care, that’s not my point here.
What is my point? How did this come up? Well, Cole’s OP is relying on the fact that “[pure] imitation learning is probably existentially safe”. And I was saying that pure imitation learning imposes a horrific capability tax that destroys his whole plan, because a human has open-ended autonomous learning, whereas a model trained by pure imitation learning (on that same human) does not. So you cannot simply swap out the former for the latter.
In Cole’s most recent reply, it appears that what he has in mind is actually a system that’s initialized by being trained to imitate humans, but then it also has some system for open-ended continuous learning from that starting point.
And then I replied that this would solve the capability issue, but only by creating a new problem that “[pure] imitation learning is probably existentially safe” can no longer function as part of his safety argument, because the continuous learning may affect alignment.
For example, if you initialize a PacMan RL agent on human imitation (where the humans were all very nice to the ghosts during play), and then you set up that agent to continuously improve by RL policy optimization, using the score as the reward function, then it’s gonna rapidly stop being nice to the ghosts.
Does that help explain where I’m coming from?
- Cole Wyeth 28 May 2025 21:48 UTC
  2 points
  0
  Parent
  That’s not what I have in mind, see my more most recent reply.
  
  Also, I am not sure that removing the imitation learning step would actually “destroy my whole plan.” It would perhaps prevent it from scaling past a certain point, but I think we would still be left in a much more tractable position.