Here’s an IMO under-appreciated lesson from the Geometry of Truth paper: Why logistic regression finds imperfect feature directions, yet produces better probes.
Consider this distribution of True and False activations from the paper:
The True and False activations are just shifted by the Truth direction θt. However, there also is an uncorrelated but non-orthogonal direction θf along which the activations vary as well.
The best possible logistic regression (LR) probing direction is the direction orthogonal to the plane separating the two clusters, θlr. Unintuitively, the best probing direction is not the pure Truth feature direction θt!
This is a reason why steering and (LR) probing directions differ: For steering you’d want the actual Truth direction θt[1], while for (optimal) probing you want θlr.
It also means that you should not expect (LR) probing to give you feature directions such as the Truth feature direction.
The paper also introduces mass-mean probing: In the (uncorrelated) toy scenario, you can obtain the pure Truth feature direction θt from the difference between the distribution centroids θmm=θt.
Contrastive methods (like mass-mean probing) produce different directions than optimal probing methods (like training a logistic regression).
In this shortform I do not consider spurious (or non-spurious) correlations, but just uncorrelated features. Correlations are harder. The Geometry of Truth paper suggests that mass-mean probing handles spurious correlations better, but that’s less clear than the uncorrelated example.
A fun side note, that probably isn’t useful—I think if you shuffle the data across neurons (which effectively gets rid of the covariance amongst neurons), and then do linear regression, you will get theta_t.
This is a somewhat common technique in neuroscience analysis when studying correlation structure in neural reps and separability.
I just chatted with Adam and he explained a bit more, sumarising this here: What the shuffling does is creating a new dataset from each category (x1,y17),(x2,y12),(x3,y5),... where the x and y pairs are shuffled (or in high dimensions, the sample for each dimension is randomly sampled). The shuffle leaves the means (centroids) invariant, but removes correlations between directions. Then you can train a logistic regression on the shuffled data. You might prefer this over calculating the mean directly to get an idea of how much low sample size is affecting your results.
Is this guaranteed to give you the same as mass-mean probing?
Thinking about it quickly, consider the solution to ordinary least squares regression. With a y that is one-hot encoding the label, it is (XTX)−1XTy. Note that XTX=N⋅Cov(X,X) . The procedure Adam describes makes it so that the sample of Xs becomes uncorrelated, which is exactly the same as zeroing out the non-diagonal elements of the covariance.
If the covariance is diagonal, then (XTX)−1 is also diagonal, and it follows that the solution to OLS is indeed an unweighted average of the datapoints that correspond to each label! Each dimension of the data x is multiplied by some coefficient, one per dimension corresponding to the diagonal of the covariance.
I’d expect logistic regression to choose the ~same direction.
Here’s an IMO under-appreciated lesson from the Geometry of Truth paper: Why logistic regression finds imperfect feature directions, yet produces better probes.
Consider this distribution of True and False activations from the paper:
The True and False activations are just shifted by the Truth direction θt. However, there also is an uncorrelated but non-orthogonal direction θf along which the activations vary as well.
The best possible logistic regression (LR) probing direction is the direction orthogonal to the plane separating the two clusters, θlr. Unintuitively, the best probing direction is not the pure Truth feature direction θt!
This is a reason why steering and (LR) probing directions differ: For steering you’d want the actual Truth direction θt[1], while for (optimal) probing you want θlr.
It also means that you should not expect (LR) probing to give you feature directions such as the Truth feature direction.
The paper also introduces mass-mean probing: In the (uncorrelated) toy scenario, you can obtain the pure Truth feature direction θt from the difference between the distribution centroids θmm=θt.
Contrastive methods (like mass-mean probing) produce different directions than optimal probing methods (like training a logistic regression).
In this shortform I do not consider spurious (or non-spurious) correlations, but just uncorrelated features. Correlations are harder. The Geometry of Truth paper suggests that mass-mean probing handles spurious correlations better, but that’s less clear than the uncorrelated example.
Thanks to @Adrià Garriga-alonso for helpful discussions about this!
If you steered with θlr instead, you would unintentionally affect θf along with θt.
A fun side note, that probably isn’t useful—I think if you shuffle the data across neurons (which effectively gets rid of the covariance amongst neurons), and then do linear regression, you will get theta_t.
This is a somewhat common technique in neuroscience analysis when studying correlation structure in neural reps and separability.
I just chatted with Adam and he explained a bit more, sumarising this here: What the shuffling does is creating a new dataset from each category (x1,y17),(x2,y12),(x3,y5),... where the x and y pairs are shuffled (or in high dimensions, the sample for each dimension is randomly sampled). The shuffle leaves the means (centroids) invariant, but removes correlations between directions. Then you can train a logistic regression on the shuffled data. You might prefer this over calculating the mean directly to get an idea of how much low sample size is affecting your results.
Is this guaranteed to give you the same as mass-mean probing?
Thinking about it quickly, consider the solution to ordinary least squares regression. With a y that is one-hot encoding the label, it is (XTX)−1XTy. Note that XTX=N⋅Cov(X,X) . The procedure Adam describes makes it so that the sample of Xs becomes uncorrelated, which is exactly the same as zeroing out the non-diagonal elements of the covariance.
If the covariance is diagonal, then (XTX)−1 is also diagonal, and it follows that the solution to OLS is indeed an unweighted average of the datapoints that correspond to each label! Each dimension of the data x is multiplied by some coefficient, one per dimension corresponding to the diagonal of the covariance.
I’d expect logistic regression to choose the ~same direction.
Very clever technique!