I have a short attention span. I kept wasting all my time constantly refreshing pages in the hopes of the tiniest dopamine hit. This sucked. I couldn’t read anything longer than a page without getting distracted, I couldn’t practice skills for more than a moment, or write anything longer than a tweet.
Has this been less so in the past? Hard for me to imagine getting e.g. a masters in physics under those circumstances!?
If it has been better in the past, any ideas as to why/how it changed for the worse?
Any updates on the dopamine addiction and efficacy of interventions a few months later?
Also, thanks! Sat down and quickly found a handful of things that require little work to address but reduce friction regarding things that matter, in a small but real way.
Good to be reminded that this is possible:)
Will keep it up and report back in a week.
Confusion about inaccuracies in “AlphaGo Zero and capability amplification”
There is this Alignment Forum post by Paul Christiano titled “AlphaGo Zero and capability amplification”, part of the Iterated Amplification sequence, that bugged me when I came across it a few years ago. I’d just previously skimmed “Mastering the game of Go without human knowledge”, henceforth “the AGZ paper”, and felt like Christiano’s post had a few weird inaccuracies.
Since I’m not an ML person I assumed that I was probably just missing something context/lingo etc. or misremembering. So I went back to the AGZ paper but that failed to solve my confusion. I took a few notes and left it at that. When I came across those notes by chance today I decided to look into it a bit more.
There are a few pretty minor things that bug me but that are really not worth the trouble and then two larger ones which I do want to highlight.
Supervised learning?
The post begins:
with “AlphaGo Zero” being a link to a blog post by DeepMind. Makes sense.
This is followed by a section titled “How AlphaGo Zero works”:
This section sounds wrong to me. In isolation “Both are trained with supervised learning” can be read as referring to the inner loop where “inner loop” is the part where the “weak” policy is improved. [1] That would be fine.
The problem is the sentence right after, “Once we have these two functions.. ”
In AlphaGo Zero, henceforth “AGZ”, there is no point at which we do not have p and v! We start with random play! It is ~the defining feature of AlphaGo Zero, that it starts from Zero!
Thus, to me, the passage reads naturally as something like “p and v are first trained using supervised learning followed by RL self-play using MCTS”, which is not the case.
My next thought was that “Trained by SL, then amplified at game time via MCTS” would have been a reasonable fit for its predecessor AlphaGo Fan [2] (see the first page of the AGZ paper):
Maybe the distinction between the two does not really matter for the analogy the post is making? [3]
Maybe those two got mixed up but no harm no foul since they work equally well?
That is not the case because in AlphaGo Fan MCTS comes in only at play time. It lacks lookahead via MCTS inside the training loop which was first introduced, in the AlphaGo lineage, with AGZ. From the AGZ paper (page 1):
See also Reinforcement Learning: An Introduction (second edition, chapter 16, page. 447, link):
It lacks the whole “moving target” idea that makes AGZ interesting to Christiano here in the first place!
Even setting all this aside, Christiano’s choice represents a clear clash with the terminology as used by DeepMind and others which insist on describing AGZ as “trained solely by self-play reinforcement learning” Again from the AGZ paper:
This insistence on only reinforcement learning is repeated many places:
Rollouts?
Later on, in the section “Iterated capability amplification”, he writes:
“MCTS + using a rollout to see who wins” sounds an awful lot like the description of AlphaGo Fan I quoted earlier:
It does not work well as a description of AGZ. Here a description taken from Reinforcement Learning: An Introduction (second edition, chapter 16, page 447, link):
Or another taken from the AGZ paper:
If he uses “rollout” to refer to something different that is fine but the blog post by DeepMind that he himself links uses the term as in the quote above! See here
where it is listed as one of the notable differences to prior versions. So at the very least he is breaking with the terminology as it is used in this context.
what else might “rollout” refer to?
A thought that occurred to me was that he might take “rollout” to refer to the fact that the games generated by self-play are played to some end. [4] This sense of “rollout” belongs to what the AGZ paper calls the policy evaluation operator part which is then used in the distillation/optimization step:
The problem is that this interpretation is ruled out by the immediate context of the “rollout” in Christiano’s post.
It occurs in a description of the amplification scheme which, in the Iterated Amplification sequence, is the part where a weak policy A is amplified by some means (see for instance the post Capability amplification).
The AGZ paper uses the term policy improvement operator:
And it is this part which does not use rollouts in AGZ!
(As contrasted with AlphaGo Fan where rollouts are used in the policy improvement operator/amplification scheme)
Conclusion
Both of these seem like weird things to get wrong, and weirder still to remain that way. Even if this sort of stuff is not relevant for the point the post is trying to make!
But maybe that is because I am simply misunderstanding something? Would love to know!
in the case of AGZ, the way the neural network’s parameters are updated to make and more closely match the improved search play probabilities and self-play winner , (using the notation in the AGZ paper)
though this is also not really a perfect fit!
this seemed all the more likely at first given the question of rollouts below
“both players pass, when the search value drops below a resignation threshold, or when the game exceeds a maximum length”, (AGZ paper)