Thanks for the summary! Your first bullet point was my motivation for doing this. I think it’s important to test out interpretability ideas in more challenging domains.
We didn’t really do much interpretability in this paper, this is more meta-interpretability in a sense (i.e. studying whether interpretability should in principle be possible). I’d say section 4 is worth a look, especially section 4.5 which covers fundamental and practical challenges to probing. Section 7 has some NMF analysis, and we open-sourced NMF factors which you might find interesting.
I enjoyed the whole paper! It’s just that “read sections 1 through 8” doesn’t reduce the length much, and 5-6 have some nice short results that can be read alone :-)
Zac says “Yes, over the course of training AlphaZero learns many concepts (and develops behaviours) which have clear correspondence with human concepts.”
What’s the evidence for this? If AlphaZero worked by learning concepts in a sort of step-wise manner, then we should expect jumps in performance when it comes to certain types of puzzles, right? I would guess that a beginning human would exhibit jumps from learning concepts like “control the center” or “castle early, not later”.. for instance the principle “control the center”, once followed, has implications on how to place knights etc which greatly effect win probability. Is the claim they found such jumps? (eyeing the results nothing really stands out in the plots).
Or is the claim that the NMF somehow proves that AlphaZero works off concepts? To me that seems suspicious as NMF is looking at weight matrices at a very crude level, it seems.
I ask this partially because I went to a meetup talk (not recorded sadly) where a researcher from MIT showed a go problem that alphaGo can’t solve but which even beginner go players can solve, which shows that alphaGo actually doesn’t understand things the same way as humans. Hopefully they will publish their work soon so I can show you.