I don’t think I get what phenomenon you’re pointing to.
Your first bullet point makes it sound like AlphaGo wasn’t trained using self-play, in contrast to AlphaZero. However, AlphaGo was trained with a combination of supervised learning and self-play. They removed the supervised learning part from AlphaZero to make it simpler and more general.
DreamerV3 also fits the pattern where previous SOTA approaches used a combination of imitation learning and reinforcement learning, while DreamerV3 was able to remove the imitation learning part.[1]
To my understanding, AlphaProof was trained by translating a bunch of math problems to Lean, and using “correct proof” as reward for AlphaZero. This approach also combines human data (our math problems) with reinforcement learning (AlphaZero).
Your final example feels close to AlphaProof if you finish it with “Then you finetune CoT with reinforcement learning to yield impressive performance on reasoning benchmarks”, but I don’t think that’s what you were going for.
The first two examples seem covered by “when reinforcement learning works well, imitation learning is no longer needed”. Idk about the rest.
Could you clarify by giving more examples or otherwise explain what you’re looking for?
I got curious how DreamerV3 figures out Minecraft with nothing to imitate and no intermediate reward, so I checked the paper. There are intermediate rewards. They give +1 reward for each of 12 ordered milestones leading up to the diamond, and −0.01 for each lost heart and +0.01 for each restored heart. Additionally, they use “the block breaking setting of prior work[19] because the provided action space would make it challenging for stochastic policies to keep a key pressed for a prolonged time”. So to get started, probably the agent manages to randomly break a tree block and get its first reward.
With AlphaProof, the relevant piece is that the solver network generates its own proofs and disproofs to train against. There’s no imitation learning after formalization. There is a slight disanalogy where, for formalization, we mostly jumped straight to self-play/search, and I don’t think there was ever a major imitation-learning-based approach (though I did find at least one example).
Your quote “when reinforcement learning works well, imitation learning is no longer needed” is pretty close to what I mean. What I’m actually trying to get at is a stronger statement: we often bootstrap using imitation learning to figure out how to get the reinforcement learning component working initially, but once we do, we can usually discard the imitation learning entirely.
do we have good reason to think they didn’t specifically train it on human lean proofs? it seems plausible to me that they did but idk
the curriculum of human problems teaches it human tricks
lean sorta “knows” a bunch of human tricks
We could argue about whether AlphaProof “is mostly human imitation or mostly RL”, but I feel like it’s pretty clear that it’s more analogous to AlphaGo than to AlphaZero.
I don’t think I get what phenomenon you’re pointing to.
Your first bullet point makes it sound like AlphaGo wasn’t trained using self-play, in contrast to AlphaZero. However, AlphaGo was trained with a combination of supervised learning and self-play. They removed the supervised learning part from AlphaZero to make it simpler and more general.
DreamerV3 also fits the pattern where previous SOTA approaches used a combination of imitation learning and reinforcement learning, while DreamerV3 was able to remove the imitation learning part.[1]
To my understanding, AlphaProof was trained by translating a bunch of math problems to Lean, and using “correct proof” as reward for AlphaZero. This approach also combines human data (our math problems) with reinforcement learning (AlphaZero).
Your final example feels close to AlphaProof if you finish it with “Then you finetune CoT with reinforcement learning to yield impressive performance on reasoning benchmarks”, but I don’t think that’s what you were going for.
The first two examples seem covered by “when reinforcement learning works well, imitation learning is no longer needed”. Idk about the rest.
Could you clarify by giving more examples or otherwise explain what you’re looking for?
I got curious how DreamerV3 figures out Minecraft with nothing to imitate and no intermediate reward, so I checked the paper. There are intermediate rewards. They give +1 reward for each of 12 ordered milestones leading up to the diamond, and −0.01 for each lost heart and +0.01 for each restored heart. Additionally, they use “the block breaking setting of prior work[19] because the provided action space would make it challenging for stochastic policies to keep a key pressed for a prolonged time”. So to get started, probably the agent manages to randomly break a tree block and get its first reward.
With AlphaProof, the relevant piece is that the solver network generates its own proofs and disproofs to train against. There’s no imitation learning after formalization. There is a slight disanalogy where, for formalization, we mostly jumped straight to self-play/search, and I don’t think there was ever a major imitation-learning-based approach (though I did find at least one example).
Your quote “when reinforcement learning works well, imitation learning is no longer needed” is pretty close to what I mean. What I’m actually trying to get at is a stronger statement: we often bootstrap using imitation learning to figure out how to get the reinforcement learning component working initially, but once we do, we can usually discard the imitation learning entirely.
I think AlphaProof is pretty far from being just RL from scratch:
they use a pretrained language model; I think the model is trained on human math in particular ( https://archive.is/Cwngq#selection-1257.0-1272.0:~:text=Dr. Hubert’s team,frequency was reduced. )
do we have good reason to think they didn’t specifically train it on human lean proofs? it seems plausible to me that they did but idk
the curriculum of human problems teaches it human tricks
lean sorta “knows” a bunch of human tricks
We could argue about whether AlphaProof “is mostly human imitation or mostly RL”, but I feel like it’s pretty clear that it’s more analogous to AlphaGo than to AlphaZero.
(a relevant thread: https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce?commentId=ZKuABGnKf7v35F5gp )
Okay, great, then we just have to wait a year for AlphaProofZero to get a perfect score on the IMO.