I’m confused what the authors mean by the italicized phrase. How do you create more neurons without making the model larger?
I would assume various kinds of sparsity and modularization, and avoiding things which have many parameters but few neurons, such as fully-connected layers.
SoLU is a double-edged sword for interpretability. On the one hand, it makes it much easier to study a subset of MLP layer features which end up nicely aligned with neurons. On the other hand, we suspect that there are many other non-neuron-aligned features which are essential to the loss and arguably harder to study than in a regular model. Perhaps more concerningly, if one only looked at the SoLU activation, it would be easy for these features to be invisible and create a false sense that one understands all the features.
Extremely concerning for safety. The only thing more dangerous than an uninterpretable model is an ‘interpretable’ model. Is there an interpretability tax such that all interpretability methods wind up incentivizing covert algorithms, similar to how CycleGAN is incentivized to learn steganography, and interpretability methods risk simply creating mesa-optimizers which optimize for a superficially-simple seeming ‘surface’ Potemkin-village network while it gets the real work done elsewhere out of sight?
(The field of interpretability research as a mesa-optimizer: the blackbox evolutionary search (citations, funding, tenure) of researchers optimizes for finding methods which yield ‘interpretable’ models and work almost as well as uninterpretable ones—but only because the methods are too weak to detect that the ‘interpretable’ models are actually just weird uninterpretable ones which evolved some protective camouflage, and thus just as dangerous as ever. The field already offers a lot of examples of interpretability methods which produce pretty pictures and convince their researchers as well as others, but which later turn out to not work as well or as thought, like salience maps. One might borrow a quote from cryptography: “any interpretability researcher can invent an interpretability method he personally is not smart enough to understand the uninterpretability thereof.”)
I would assume various kinds of sparsity and modularization, and avoiding things which have many parameters but few neurons, such as fully-connected layers.
Extremely concerning for safety. The only thing more dangerous than an uninterpretable model is an ‘interpretable’ model. Is there an interpretability tax such that all interpretability methods wind up incentivizing covert algorithms, similar to how CycleGAN is incentivized to learn steganography, and interpretability methods risk simply creating mesa-optimizers which optimize for a superficially-simple seeming ‘surface’ Potemkin-village network while it gets the real work done elsewhere out of sight?
(The field of interpretability research as a mesa-optimizer: the blackbox evolutionary search (citations, funding, tenure) of researchers optimizes for finding methods which yield ‘interpretable’ models and work almost as well as uninterpretable ones—but only because the methods are too weak to detect that the ‘interpretable’ models are actually just weird uninterpretable ones which evolved some protective camouflage, and thus just as dangerous as ever. The field already offers a lot of examples of interpretability methods which produce pretty pictures and convince their researchers as well as others, but which later turn out to not work as well or as thought, like salience maps. One might borrow a quote from cryptography: “any interpretability researcher can invent an interpretability method he personally is not smart enough to understand the uninterpretability thereof.”)