StefanHex comments on StefanHex’s Shortform

StefanHex 4 Sep 2025 20:09 UTC
71 points
0
Here’s an IMO under-appreciated lesson from the Geometry of Truth paper: Why logistic regression finds imperfect feature directions, yet produces better probes.
Consider this distribution of True and False activations from the paper:
The True and False activations are just shifted by the Truth direction $θ_{t}$ . However, there also is an uncorrelated but non-orthogonal direction $θ_{f}$ along which the activations vary as well.
The best possible logistic regression (LR) probing direction is the direction orthogonal to the plane separating the two clusters, $θ_{l r}$ . Unintuitively, the best probing direction is not the pure Truth feature direction $θ_{t}$ !
- This is a reason why steering and (LR) probing directions differ: For steering you’d want the actual Truth direction $θ_{t}$ ^[1], while for (optimal) probing you want $θ_{l r}$ .
- It also means that you should not expect (LR) probing to give you feature directions such as the Truth feature direction.
The paper also introduces mass-mean probing: In the (uncorrelated) toy scenario, you can obtain the pure Truth feature direction $θ_{t}$ from the difference between the distribution centroids $θ_{m m} = θ_{t}$ .
- Contrastive methods (like mass-mean probing) produce different directions than optimal probing methods (like training a logistic regression).
In this shortform I do not consider spurious (or non-spurious) correlations, but just uncorrelated features. Correlations are harder. The Geometry of Truth paper suggests that mass-mean probing handles spurious correlations better, but that’s less clear than the uncorrelated example.
Thanks to @Adrià Garriga-alonso for helpful discussions about this!
1. ^
  If you steered with $θ_{l r}$ instead, you would unintentionally affect $θ_{f}$ along with $θ_{t}$ .
- Adam Shai 5 Sep 2025 0:25 UTC
  13 points
  0
  Parent
  A fun side note, that probably isn’t useful—I think if you shuffle the data across neurons (which effectively gets rid of the covariance amongst neurons), and then do linear regression, you will get theta_t.
  
  This is a somewhat common technique in neuroscience analysis when studying correlation structure in neural reps and separability.
  - StefanHex 5 Sep 2025 0:51 UTC
    8 points
    0
    Parent
    I just chatted with Adam and he explained a bit more, sumarising this here: What the shuffling does is creating a new dataset from each category $(x_{1}, y_{17}), (x_{2}, y_{12}), (x_{3}, y_{5}), . . .$ where the x and y pairs are shuffled (or in high dimensions, the sample for each dimension is randomly sampled). The shuffle leaves the means (centroids) invariant, but removes correlations between directions. Then you can train a logistic regression on the shuffled data. You might prefer this over calculating the mean directly to get an idea of how much low sample size is affecting your results.
    - Adrià Garriga-alonso 15 Sep 2025 19:49 UTC
      2 points
      0
      Parent
      Is this guaranteed to give you the same as mass-mean probing?
      
      Thinking about it quickly, consider the solution to ordinary least squares regression. With a y that is one-hot encoding the label, it is $(X^{T} X)^{- 1} X^{T} y$ . Note that $X^{T} X = N \cdot Cov (X, X)$ . The procedure Adam describes makes it so that the sample of Xs becomes uncorrelated, which is exactly the same as zeroing out the non-diagonal elements of the covariance.
      
      If the covariance is diagonal, then $(X^{T} X)^{- 1}$ is also diagonal, and it follows that the solution to OLS is indeed an unweighted average of the datapoints that correspond to each label! Each dimension of the data x is multiplied by some coefficient, one per dimension corresponding to the diagonal of the covariance.
      
      I’d expect logistic regression to choose the ~same direction.
      
      Very clever technique!