The ‘takeoff’ in section 6 is pretty interesting. AlphaZero noodles around for a while until it undergoes a sort of phase transition where it latches onto key concepts and can rapidly improve in strength? Parallels a number of phase transitions in human psychology and elsewhere. Possible connection to ‘ray interference’?
As far as AI risk goes, this strikes me as one of those ‘is the glass half full or half empty?’ results.
I would be pretty shocked if A0 internals could not be transmogrified in some way to human-understood chess concepts. We’re not that bad at chess (the best grandmasters still agree with A0 a nontrivial percentage of the time), we invented it out of all possible games based on human metaphors, have studied (and redefined the rules) for half a millennium after evolving it out of earlier human games, and so on. It’s not surprising that there’s a decent number of interpretable concepts. (Whether this is either necessary or sufficient for safety, or how useful it is, is unclear—after all, even if perfectly interpretable, you’re still vulnerable to problems like “The AI understands, it just doesn’t care, and you are made of atoms it can use for other things.” Humans have a pretty good idea what animals such as pigs like, but they are still made of tasty bacon. But being uninterpretable hardly seems like it’d help, so it’d be nice if we could crack open the blackboxes a little.)
On the other hand, the unsupervised learning turns up a lot of much more difficult to interpret clusters, and even for the so-called ‘interpretable’ concepts, they fall far short of R^2=100%, leaving a big explanatory gap in terms of ‘what is it doing or thinking?‘. Nor is there much reason to think that it’ll get any more interpretable if it got better or trained differently. The better something gets at a narrow domain, the less human it may be after a point. (I believe one of the AlphaGo/Zero papers shows that ‘move prediction of pros’ describes an inverted U-curve where after a point, AlphaZero starts disagreeing more with the human expert moves? If I’m misremembering, then as a more powerful example, we can point to things like chess endgame databases doing provably-optimal play: yes, you can often dissect the endgame database moves in terms of ‘exposing the king’ or whatever, but in many cases, the one and only reason for a move is “because it works”, and we know the database isn’t “thinking” anything because of how we computed it. There is nothing to interpret, only brute facts. You can talk about ‘pinning the knight’ or ‘controlling the center’ all you want, but then it’ll suddenly move the king 1 space towards you, and the only reason is that that move makes you lose. The bigger and more comprehensive an endgame database gets, the more alien the moves can become as it defeats you 20 moves from now because the king lands on exactly 1 square in between the bishop and knight etc.) This is problematic because we know from neural security work that even a single parameter can be enough to induce various kinds of backdoors and bizarre unpredicted behavior. It may be ‘interpretable’ but there’s a lot there which is not interpreted, and may not be interpretable at all except by methods themselves so complex as to obviate any security guarantees you hoped to get. Further, such in-distribution analyses struggle to tell us about targeted attacks—what sort of behavior would an exploiter agent or a targeted gradient adversarial attack induce even in the ultra-narrow domain of chess? Looking at activations computed on ‘normal’ games doesn’t tell us much about that.