AlexMennen comments on What Discovering Latent Knowledge Did and Did Not Find

AlexMennen 15 Mar 2023 0:47 UTC
3 points
1
Ablating along the difference of the means makes both CCS & Supervised learning fail, i.e. reduce their accuracy to random guessing. Therefore:
- The fact that Recursive CCS finds many good direction is not due to some “intrinsic redundancy” of the data. There exist a single direction which contains all linearly available information.
- The fact that Recursive CCS finds strictly more than one good direction means that CCS is not efficient at locating all information related to truth: it is not able to find a direction which contains as much information as the direction found by taking the difference of the means. Note: Logistic Regression seems to be about as leaky as CCS. See INLP which is like Recursive CCS, but with Logistic Regression.
I don’t think that’s a fair characterization of what you found. Suppose, for example, that you’re given a vector in $R^{n}$ whose $i$ th component is $X + ε_{i}$ , where $X$ is a random variable with high variance, and $ε_{1}, . . ., ε_{n}$ are i.i.d. with mean $0$ and tiny variance. There is a direction which contains all the information about $X$ contained in the vector, namely the average of the coordinates. Subtracting out the mean of the coordinates from each coordinate will remove all information about $X$ . But the data is plenty redundant; there are $n$ orthogonal directions each of which contain almost all of the available information about $X$ , so a probe trained to recover $X$ that learns to just copy one of the coordinates will be pretty efficient at recovering $X$ . If the $ε_{i}$ have variance $0$ (i.e. are just constants always equal to $0$ ), then there are $n$ orthogonal directions each of which contain all information about $X$ , and a probe that copies one of them is perfectly efficient at extracting all information about $X$ .
If you can find multiple orthogonal linear probes that each get good performance at recovering some feature, then something like this must be happening.
- Fabien Roger 15 Mar 2023 11:52 UTC
  1 point
  0
  Parent
  “Redundancy” depends on your definition, and I agree that I didn’t choose a generous one.
  Here is an even simpler example than yours: positive points are all at (1...1) and negative points are all at (-1...-1). Then all canonical directions are good classifiers. This is “high correlation redundancy” with the definition you want to use. There is high correlation redundancy in our toy examples and in actual datasets.
  What I wanted to argue against is the naive view that you might have which could be “there is no hope of finding a direction which encodes all information because of redundancy”, which I would call “high ablation redundancy”. It’s not the case that there is high ablation redundancy in both our toy examples (in mine, all information is along (1...1)), and in actual datasets.
  - AlexMennen 15 Mar 2023 20:13 UTC
    3 points
    1
    Parent
    What you’re calling ablation redundancy is a measure of nonlinearity of the feature being measured, not any form of redundancy, and the view you quote doesn’t make sense as stated, as nonlinearity, rather than redundancy, would be necessary for its conclusion. If you’re trying to recover some feature $f : R^{n} \to R$ , and there’s any vector $v \in R^{n}$ and scalar $c \in R$ such that $f (x) = v \cdot x + c$ for all data $x \in R^{n}$ (regardless of whether there are multiple such $v, c$ , which would happen if the data is contained in a proper affine subspace), then there is a direction such that projection along it makes it impossible for a linear probe to get any information about the value of $f$ . That direction is $Σ v$ , where $Σ$ is the covariance matrix of the data. This works because if $w ⊥ Σ v$ , then the random variables $x \mapsto w \cdot x$ and $x \mapsto v \cdot x$ are uncorrelated (since $Cov (v \cdot x, w \cdot x) = w^{T} Σ v = 0$ ), and thus $w \cdot x$ is uncorrelated with $f (x)$ .
    If the data is normally distributed, then we can make this stronger. If there’s a vector $v$ and a function $g$ such that $f (x) = g (v \cdot x)$ (for example, if you’re using a linear probe to get a binary classifier, where it classifies things based on whether the value of a linear function is above some threshhold), then projecting along $Σ v$ removes all information about $f$ . This is because uncorrelated linear features of a multivariate normal distribution are independent, so if $w ⊥ Σ v$ , then $w \cdot x$ is independent of $v \cdot x$ , and thus also of $f (x)$ . So the reason what you’re calling high ablation redundancy is rare is that low ablation redundancy is a consequence of the existence of any linear probe that gets good performance and the data not being too wildly non-Gaussian.
    - Fabien Roger 17 Mar 2023 15:07 UTC
      1 point
      0
      Parent
      Yep, high ablation redundancy can only exist when features are nonlinear. Linear features are obviously removable with a rank-1 ablation, and you get them by running CCS/Logistic Regression/whatever. But I don’t care about linear features since it’s not what I care about since it’s not the shape the features have (Logistic Regression & CCS can’t remove the linear information).
      The point is, the reason why CCS fails to remove linearly available information is not because the data “is too hard”. Rather, it’s because the feature is non-linear in a regular way, which makes CCS and Logistic Regression suck at finding the direction which contains all linearly available data (which exists in the context of “truth”, just as it is in the context of gender and all the datasets on which RLACE has been tried).
      I’m not sure why you don’t like calling this “redundancy”. A meaning of redundant is “able to be omitted without loss of meaning or function” (Lexico). So ablation redundancy is the normal kind of redundancy, where you can remove sth without losing the meaning. Here it’s not redundant, you can remove a single direction and lose all the (linear) “meaning”.
      - AlexMennen 17 Mar 2023 19:11 UTC
        3 points
        1
        Parent
        I’m not sure why you don’t like calling this “redundancy”. A meaning of redundant is “able to be omitted without loss of meaning or function” (Lexico). So ablation redundancy is the normal kind of redundancy, where you can remove sth without losing the meaning. Here it’s not redundant, you can remove a single direction and lose all the (linear) “meaning”.
        Suppose your datapoints are $(x, y) \in R^{2}$ (where the coordinates $x$ and $y$ are independent from the standard normal distribution), and the feature you’re trying to measure is $x^{2} + y^{2}$ . A rank-1 linear probe will retain some information about the feature. Say your linear probe finds the $x$ coordinate. This gives you information about $x^{2} + y^{2}$ ; your expected value for this feature is now $x^{2} + 1$ , an improvement over its a priori expected value of $2$ . If you ablate along this direction, all you’re left with is the $y$ coordinate, which tells you exactly as much about the feature $x^{2} + y^{2}$ as the $x$ coordinate does, so this rank-1 ablation causes no loss in performance. But information is still lost when you lose the $x$ coordinate, namely the contribution of $x^{2}$ from the feature. The thing that you can still find after ablating away the $x$ direction is not redundant with the the rank-1 linear probe in the $x$ direction you started with, but just contributes the same amount towards the feature you’re measuring.
        The point is, the reason why CCS fails to remove linearly available information is not because the data “is too hard”. Rather, it’s because the feature is non-linear in a regular way, which makes CCS and Logistic Regression suck at finding the direction which contains all linearly available data (which exists in the context of “truth”, just as it is in the context of gender and all the datasets on which RLACE has been tried).
        Disagree. The reason CCS doesn’t remove information is neither of those, but instead just that that’s not what it’s trained to do. It doesn’t fail, but rather never makes any attempt. If you’re trying to train a function such that $f (1, 1) = 1$ and $f (- 1, - 1) = - 1$ , then $f (x, y) = x$ will achieve optimal loss just like $f (x, y) = \frac{1}{2} (x + y)$ will.