That sounds really cool, but it would be even cooler if someone has the time to summarize the main results of the 69-page long paper and publish them in a post/comment here
The paper is really only 28 pages plus lots of graphs in the appendices! If you want to skim, I’d suggest just reading the abstract and then sections 5 and 6 (pp 16--21). But to summarize:
Do neural networks learn the same concepts as humans, or at least human-legible concepts? A “yes” would be good news for interpretability (and alignment). Let’s investigate AlphaZero and Chess as a case study!
Yes, over the course of training AlphaZero learns many concepts (and develops behaviours) which have clear correspondence with human concepts.
Low-level / ground-up interpretability seems very useful here. Learned summaries are also great for chess but rely on a strong ground-truth (e.g. “Stockfish internals”).
Details about where in the network and when in the training process things are represented and learned.
The analysis of differences between the timing and order of developments in human scholarship and AlphaZero training is pretty cool if you play chess; e.g. human experts diversify openings (not just 1.e4) since 1700 while AlphaZero narrows down from random to pretty much the modern distribution over GM openings; AlphaZero tends to learn material values before positions and standard openings.
Thanks for the summary! Your first bullet point was my motivation for doing this. I think it’s important to test out interpretability ideas in more challenging domains.
We didn’t really do much interpretability in this paper, this is more meta-interpretability in a sense (i.e. studying whether interpretability should in principle be possible). I’d say section 4 is worth a look, especially section 4.5 which covers fundamental and practical challenges to probing. Section 7 has some NMF analysis, and we open-sourced NMF factors which you might find interesting.
I enjoyed the whole paper! It’s just that “read sections 1 through 8” doesn’t reduce the length much, and 5-6 have some nice short results that can be read alone :-)
Zac says “Yes, over the course of training AlphaZero learns many concepts (and develops behaviours) which have clear correspondence with human concepts.”
What’s the evidence for this? If AlphaZero worked by learning concepts in a sort of step-wise manner, then we should expect jumps in performance when it comes to certain types of puzzles, right? I would guess that a beginning human would exhibit jumps from learning concepts like “control the center” or “castle early, not later”.. for instance the principle “control the center”, once followed, has implications on how to place knights etc which greatly effect win probability. Is the claim they found such jumps? (eyeing the results nothing really stands out in the plots).
Or is the claim that the NMF somehow proves that AlphaZero works off concepts? To me that seems suspicious as NMF is looking at weight matrices at a very crude level, it seems.
I ask this partially because I went to a meetup talk (not recorded sadly) where a researcher from MIT showed a go problem that alphaGo can’t solve but which even beginner go players can solve, which shows that alphaGo actually doesn’t understand things the same way as humans. Hopefully they will publish their work soon so I can show you.