I’ve also been thinking about deception and its relationship to “natural abstractions”, and in that case it seems to me that our primary hope would be that the concepts we care about are represented at a large “magnitude” than the deceptive concepts. This is basically using L2-regularized regression to predict the outcome.
It seems potentially fruitful to use something akin to L2 regularization when projecting away components. The most straightforward translation of the regularization would be to analogize the regression coefficient to (f(x)−f(x−uuTx))uTuTx, in which case the L2 term would be ||(f(x)−f(x−uuTx))uT||uTx||||2, which reduces to ||f(x)−f(x−uuTx)||2||uTx||2.
If f(w)=Pw(o|i) is the probability[1] that a neural network with weights w gives to an output o given a prompt i, then when you’ve actually explained o, it seems like you’d basically have f(w)−f(w−uuTw)≈f(w) or in other words Pw−uuTw(o|i)≈0. Therefore I’d want to keep the regularization coefficient weak enough that I’m in that regime.
In that case, the L2 term would then basically reduce to minimizing 1||uTw||2, or in other words maximizing ||uTw||2,. Realistically, both this and Pw−uuTw(o|i)≈0 are probably achieved when u=w|w|, which on the one hand is sensible (“the reason for the network’s output is because of its weights”) but on the other hand is too trivial to be interesting.
In regression, eigendecomposition gives us more gears, because L2 regularized regression is basically changing the regression coefficients for the principal components by λλ+α, where λ is the variance of the principal component and α is the regularization coefficient. So one can consider all the principal components ranked by βλλ+α to get a feel for the gears driving the regression. When α is small, as it is in our regime, this ranking is of course the same order as that which you get from βλ, the covariance between the PCs and the dependent variable.
This suggests that if we had a change of basis for w, one could obtain a nice ranking of it. Though this is complicated by the fact that f is not a linear function and therefore we have no equivalent of β. To me, this makes it extremely tempting to use the Hessian eigenvectors V as a basis, as this is the thing that at least makes each of the inputs to f “as independent as possible”. Though rather than ranking by the eigenvalues of Hf(w) (which actually ideally we’d actually prefer to be small rather than large to stay in the ~linear regime), it seems more sensible to rank by the components of the projection of w onto V (which represent “the extent to which w includes this Hessian component”).
In summary, if HwPw(o|i)=VΛVT, then we can rank the importance of each component Vj by (Pw−VjVTjw(o|i)−Pw(o|i))VTjw.
Maybe I should touch grass and start experimenting with this now, but there’s still two things that I don’t like:
There’s a sense in which I still don’t like using the Hessian because it seems like it would be incentivized to mix nonexistent mechanisms in the neural network together with existent ones. I’ve considered alternatives like collecting gradient vectors along the training of the neural network and doing something with them, but that seems bulky and very restricted in use.
If we’re doing the whole Hessian thing, then we’re modelling f as quadratic, yet f(x+δx)−f(x) seems like an attribution method that’s more appropriate when modelling f as ~linear. I don’t think I can just switch all the way to quadatic models, because realistically f is more gonna be sigmoidal-quadratic and for large steps δx, the changes to a sigmoidal-quadratic function is better modelled by f(x+\delta x) - f(x) than by some quadratic thing. But ideally I’d have something smarter...
True, though I think the Hessian is problematic enough that that I’d either want to wait until I have something better, or want to use a simpler method.
If we consider the toy model of a neural network with no input neurons and only 1 output neuron g(w)=∏iwi (which I imagine to represent a path through the network, i.e. a bunch of weights get multiplied along the layers to the end), then the Jacobian is the gradient (Jg(w))j=(∇g(w))j=∏i≠jwi=∏iwiwj. If we ignore the overall magnitude of this vector and just consider how the contribution that it assigns to each weight varies over the weights, then we get (Jg(w))j∝1wj. Yet for this toy model, “obviously” the contribution of weight j “should” be proportional to wj.
So derivative-based methods seem to give the absolutely worst-possible answer in this case, which makes me pessimistic about their ability to meaningfully separate the actual mechanisms of the network (again they may very well work for other things, such as finding ways of changing the network “on the margin” to be nicer).
I’ve been thinking about how the way to talk about how a neural network works (instead of how it could hypothetically come to work by adding new features) would be to project away components of its activations/weights, but I got stuck because of the issue where you can add new components by subtracting off large irrelevant components.
I’ve also been thinking about deception and its relationship to “natural abstractions”, and in that case it seems to me that our primary hope would be that the concepts we care about are represented at a large “magnitude” than the deceptive concepts. This is basically using L2-regularized regression to predict the outcome.
It seems potentially fruitful to use something akin to L2 regularization when projecting away components. The most straightforward translation of the regularization would be to analogize the regression coefficient to (f(x)−f(x−uuTx))uTuTx, in which case the L2 term would be ||(f(x)−f(x−uuTx))uT||uTx||||2, which reduces to ||f(x)−f(x−uuTx)||2||uTx||2.
If f(w)=Pw(o|i) is the probability[1] that a neural network with weights w gives to an output o given a prompt i, then when you’ve actually explained o, it seems like you’d basically have f(w)−f(w−uuTw)≈f(w) or in other words Pw−uuTw(o|i)≈0. Therefore I’d want to keep the regularization coefficient weak enough that I’m in that regime.
In that case, the L2 term would then basically reduce to minimizing 1||uTw||2, or in other words maximizing ||uTw||2,. Realistically, both this and Pw−uuTw(o|i)≈0 are probably achieved when u=w|w|, which on the one hand is sensible (“the reason for the network’s output is because of its weights”) but on the other hand is too trivial to be interesting.
In regression, eigendecomposition gives us more gears, because L2 regularized regression is basically changing the regression coefficients for the principal components by λλ+α, where λ is the variance of the principal component and α is the regularization coefficient. So one can consider all the principal components ranked by βλλ+α to get a feel for the gears driving the regression. When α is small, as it is in our regime, this ranking is of course the same order as that which you get from βλ, the covariance between the PCs and the dependent variable.
This suggests that if we had a change of basis for w, one could obtain a nice ranking of it. Though this is complicated by the fact that f is not a linear function and therefore we have no equivalent of β. To me, this makes it extremely tempting to use the Hessian eigenvectors V as a basis, as this is the thing that at least makes each of the inputs to f “as independent as possible”. Though rather than ranking by the eigenvalues of Hf(w) (which actually ideally we’d actually prefer to be small rather than large to stay in the ~linear regime), it seems more sensible to rank by the components of the projection of w onto V (which represent “the extent to which w includes this Hessian component”).
In summary, if HwPw(o|i)=VΛVT, then we can rank the importance of each component Vj by (Pw−VjVTjw(o|i)−Pw(o|i))VTjw.
Maybe I should touch grass and start experimenting with this now, but there’s still two things that I don’t like:
There’s a sense in which I still don’t like using the Hessian because it seems like it would be incentivized to mix nonexistent mechanisms in the neural network together with existent ones. I’ve considered alternatives like collecting gradient vectors along the training of the neural network and doing something with them, but that seems bulky and very restricted in use.
If we’re doing the whole Hessian thing, then we’re modelling f as quadratic, yet f(x+δx)−f(x) seems like an attribution method that’s more appropriate when modelling f as ~linear. I don’t think I can just switch all the way to quadatic models, because realistically f is more gonna be sigmoidal-quadratic and for large steps δx, the changes to a sigmoidal-quadratic function is better modelled by f(x+\delta x) - f(x) than by some quadratic thing. But ideally I’d have something smarter...
Normally one would use log probs, but for reasons I don’t want to go into right now, I’m currently looking at probabilities instead.
Much dumber ideas have turned into excellent papers
True, though I think the Hessian is problematic enough that that I’d either want to wait until I have something better, or want to use a simpler method.
It might be worth going into more detail about that. The Hessian for the probability of a neural network output is mostly determined by the Jacobian of the network. But in some cases the Jacobian gives us exactly the opposite of what we want.
If we consider the toy model of a neural network with no input neurons and only 1 output neuron g(w)=∏iwi (which I imagine to represent a path through the network, i.e. a bunch of weights get multiplied along the layers to the end), then the Jacobian is the gradient (Jg(w))j=(∇g(w))j=∏i≠jwi=∏iwiwj. If we ignore the overall magnitude of this vector and just consider how the contribution that it assigns to each weight varies over the weights, then we get (Jg(w))j∝1wj. Yet for this toy model, “obviously” the contribution of weight j “should” be proportional to wj.
So derivative-based methods seem to give the absolutely worst-possible answer in this case, which makes me pessimistic about their ability to meaningfully separate the actual mechanisms of the network (again they may very well work for other things, such as finding ways of changing the network “on the margin” to be nicer).