Thank you so much for this writeup of your fascinating findings about interpreting the SVD of the weight matrix, Beren and Sid!
Understanding the degree to which transformer representations are linear vs nonlinear, and developing methods that can help us discover, locate, and interpret nonlinear representations will ultimately be necessary for fully solving interpretability of any nonlinear neural network.
Completely agree. For what it’s worth, I expect interpreting nonlinear representations in complex neural nets to be quite difficult. We should expect linear-algebra methods like SVD to uncover useful information about linear representations in a straightforward manner. But we shouldn’t overupdate as a result of the ease with which linear-algebra methods uncovers this subset of information, because a lot of the relevant information is likely to pertain to nonlinear and interconnected representations, and therefore outside of this subset.
Analyses of weights of a given network therefore is a promising type of static analysis for neural networks equivalent to static analysis of source code which can just be run quickly on any given network before actually having to run it on live inputs. This could potentially be used for alignment as a first line of defense against any kind of harmful behaviour without having to run the network at all. Techniques that analyze the weights are also typically cheaper computationally, since they do not involve running large numbers of forward passes through the network and/or storing large amounts of activations or dealing with large datasets.
Conversely, the downsides of weight analysis is that it cannot tell us about specific model behaviours on specific tokens. The weights instead can be thought of as encoding the space of potential transformations that can be applied to a specific input datapoint but not any specific transformation. They probably can also be used to derive information about average behaviour of the network but not necessarily extreme behaviour which might be most useful for alignment.
I thought this was a really good summary of the pros and cons of the methodology.
This is indeed a vital but underdiscussed problem. My SERI MATS team published a post about a game-theoretic model of alignment where the expected scientific benefit of an interpretability tool can be weighed against its expected cost due to its enabling of AGI escape risks. The expected cost can be reduced by limiting the capabilities of the AGI and by increasing the quality of security, and the expected scientific benefit can be increased by prioritizing informational efficiency of the interpretability tool.
Conditional on an organization dead set on building a superintelligent AGI (which I would strongly oppose, but may be forced to help align if we cannot dissuade the organization in any way), I think efforts to apply security, alignment, and positive-EV interpretability should be targeted at all capability levels, both high and low. Alignment efforts at high-capability levels run into the issue of heightened AGI escape risk. Alignment efforts at low-capability levels run into the issue that alignment gains, if any, may phase-transition out of existence after the AGI moves into a higher-capability regime. We should try our best at both and hope to get lucky.