The paper is really only 28 pages plus lots of graphs in the appendices! If you want to skim, I’d suggest just reading the abstract and then sections 5 and 6 (pp 16--21). But to summarize:
Do neural networks learn the same concepts as humans, or at least human-legible concepts? A “yes” would be good news for interpretability (and alignment). Let’s investigate AlphaZero and Chess as a case study!
Yes, over the course of training AlphaZero learns many concepts (and develops behaviours) which have clear correspondence with human concepts.
Low-level / ground-up interpretability seems very useful here. Learned summaries are also great for chess but rely on a strong ground-truth (e.g. “Stockfish internals”).
Details about where in the network and when in the training process things are represented and learned.
The analysis of differences between the timing and order of developments in human scholarship and AlphaZero training is pretty cool if you play chess; e.g. human experts diversify openings (not just 1.e4) since 1700 while AlphaZero narrows down from random to pretty much the modern distribution over GM openings; AlphaZero tends to learn material values before positions and standard openings.
Thanks for the summary! Your first bullet point was my motivation for doing this. I think it’s important to test out interpretability ideas in more challenging domains.
We didn’t really do much interpretability in this paper, this is more meta-interpretability in a sense (i.e. studying whether interpretability should in principle be possible). I’d say section 4 is worth a look, especially section 4.5 which covers fundamental and practical challenges to probing. Section 7 has some NMF analysis, and we open-sourced NMF factors which you might find interesting.
I enjoyed the whole paper! It’s just that “read sections 1 through 8” doesn’t reduce the length much, and 5-6 have some nice short results that can be read alone :-)