The ‘takeoff’ in section 6 is pretty interesting. AlphaZero noodles around for a while until it undergoes a sort of phase transition where it latches onto key concepts and can rapidly improve in strength? Parallels a number of phase transitions in human psychology and elsewhere. Possible connection to ‘ray interference’?
As far as AI risk goes, this strikes me as one of those ‘is the glass half full or half empty?’ results.
I would be pretty shocked if A0 internals could not be transmogrified in some way to human-understood chess concepts. We’re not that bad at chess (the best grandmasters still agree with A0 a nontrivial percentage of the time), we invented it out of all possible games based on human metaphors, have studied (and redefined the rules) for half a millennium after evolving it out of earlier human games, and so on. It’s not surprising that there’s a decent number of interpretable concepts. (Whether this is either necessary or sufficient for safety, or how useful it is, is unclear—after all, even if perfectly interpretable, you’re still vulnerable to problems like “The AI understands, it just doesn’t care, and you are made of atoms it can use for other things.” Humans have a pretty good idea what animals such as pigs like, but they are still made of tasty bacon. But being uninterpretable hardly seems like it’d help, so it’d be nice if we could crack open the blackboxes a little.)
On the other hand, the unsupervised learning turns up a lot of much more difficult to interpret clusters, and even for the so-called ‘interpretable’ concepts, they fall far short of R^2=100%, leaving a big explanatory gap in terms of ‘what is it doing or thinking?‘. Nor is there much reason to think that it’ll get any more interpretable if it got better or trained differently. The better something gets at a narrow domain, the less human it may be after a point. (I believe one of the AlphaGo/Zero papers shows that ‘move prediction of pros’ describes an inverted U-curve where after a point, AlphaZero starts disagreeing more with the human expert moves? If I’m misremembering, then as a more powerful example, we can point to things like chess endgame databases doing provably-optimal play: yes, you can often dissect the endgame database moves in terms of ‘exposing the king’ or whatever, but in many cases, the one and only reason for a move is “because it works”, and we know the database isn’t “thinking” anything because of how we computed it. There is nothing to interpret, only brute facts. You can talk about ‘pinning the knight’ or ‘controlling the center’ all you want, but then it’ll suddenly move the king 1 space towards you, and the only reason is that that move makes you lose. The bigger and more comprehensive an endgame database gets, the more alien the moves can become as it defeats you 20 moves from now because the king lands on exactly 1 square in between the bishop and knight etc.) This is problematic because we know from neural security work that even a single parameter can be enough to induce various kinds of backdoors and bizarre unpredicted behavior. It may be ‘interpretable’ but there’s a lot there which is not interpreted, and may not be interpretable at all except by methods themselves so complex as to obviate any security guarantees you hoped to get. Further, such in-distribution analyses struggle to tell us about targeted attacks—what sort of behavior would an exploiter agent or a targeted gradient adversarial attack induce even in the ultra-narrow domain of chess? Looking at activations computed on ‘normal’ games doesn’t tell us much about that.
That sounds really cool, but it would be even cooler if someone has the time to summarize the main results of the 69-page long paper and publish them in a post/comment here
The paper is really only 28 pages plus lots of graphs in the appendices! If you want to skim, I’d suggest just reading the abstract and then sections 5 and 6 (pp 16--21). But to summarize:
Do neural networks learn the same concepts as humans, or at least human-legible concepts? A “yes” would be good news for interpretability (and alignment). Let’s investigate AlphaZero and Chess as a case study!
Yes, over the course of training AlphaZero learns many concepts (and develops behaviours) which have clear correspondence with human concepts.
Low-level / ground-up interpretability seems very useful here. Learned summaries are also great for chess but rely on a strong ground-truth (e.g. “Stockfish internals”).
Details about where in the network and when in the training process things are represented and learned.
The analysis of differences between the timing and order of developments in human scholarship and AlphaZero training is pretty cool if you play chess; e.g. human experts diversify openings (not just 1.e4) since 1700 while AlphaZero narrows down from random to pretty much the modern distribution over GM openings; AlphaZero tends to learn material values before positions and standard openings.
Thanks for the summary! Your first bullet point was my motivation for doing this. I think it’s important to test out interpretability ideas in more challenging domains.
We didn’t really do much interpretability in this paper, this is more meta-interpretability in a sense (i.e. studying whether interpretability should in principle be possible). I’d say section 4 is worth a look, especially section 4.5 which covers fundamental and practical challenges to probing. Section 7 has some NMF analysis, and we open-sourced NMF factors which you might find interesting.
I enjoyed the whole paper! It’s just that “read sections 1 through 8” doesn’t reduce the length much, and 5-6 have some nice short results that can be read alone :-)
Copy of abstract for the too-lazy-to-click:
What is being learned by superhuman neural network agents such as AlphaZero? This question is of both scientific and practical interest. If the representations of strong neural networks bear no resemblance to human concepts, our ability to understand faithful explanations of their decisions will be restricted, ultimately limiting what we can achieve with neural network interpretability. In this work we provide evidence that human knowledge is acquired by the AlphaZero neural network as it trains on the game of chess. By probing for a broad range of human chess concepts we show when and where these concepts are represented in the AlphaZero network. We also provide a behavioural analysis focusing on opening play, including qualitative analysis from chess Grandmaster Vladimir Kramnik. Finally, we carry out a preliminary investigation looking at the low-level details of AlphaZero’s representations, and make the resulting behavioural and representational analyses available online.
I’m one of the authors on this paper—happy to answer any questions/discuss if anyone is interested.
Although the technical details are way to difficult for me, as a chess player I found the article really interesting. When it was first release, AlphaZero seemed to play more human-like than traditional engine such as Stockfish. Do your analysis support this conclusion ?