kave comments on Against Almost Every Theory of Impact of Interpretability

kave 18 Aug 2023 21:49 UTC
2 points
0
What is the work that finds the algorithmic model of the game itself for Othello? I’m aware of (but not familiar with) some interpretability work on Othello-GPT (Neel Nanda’s and Kenneth Li), but thought it was just about board state representations.
- Zach Furman 18 Aug 2023 23:39 UTC
  2 points
  0
  Parent
  Yeah, that was what I was referring to. Maybe “algorithmic model” isn’t the most precise—what we know is that the NN has an internal model of the board state that’s causal (i.e. the NN actually uses it to make predictions, as verified by interventions). Theoretically it could just be forming this internal model via a big lookup table / function approximation, rather than via a more sophisticated algorithm. Though we’ve seen from modular addition work, transformer induction heads, etc that at least some of the time NNs learn genuine algorithms.
  - kave 19 Aug 2023 20:59 UTC
    1 point
    0
    Parent
    I think that means one of the following should be surprising from theoretical perspectives:
    That the model learns a representation of the board state
    Or that a linear probe can recover it
    That the board state is used causally
    Does that seem right to you? If so, which is the surprising claim?
    (I am not that informed on theoretical perspectives)
    - Zach Furman 19 Aug 2023 21:31 UTC
      1 point
      −2
      Parent
      I think the core surprising thing is the fact that the model learns a representation of the board state. The causal / linear probe parts are there to ensure that you’ve defined “learns a representation of the board state” correctly—otherwise the probe could just be computing the board state itself, without that knowledge being used in the original model.
      This is surprising to some older theories like statistical learning, because the model is usually treated as effectively a black box function approximator. It’s also surprising to theories like NTK, mean-field, and tensor programs, because they view model activations as IID samples from a single-neuron probability distribution—but you can’t reconstruct the board state via a permutation-invariant linear probe. The question of “which neuron is which” actually matters, so this form of feature learning is beyond them. (Though there may be e.g. perturbative modifications to these theories to allow this in a limited way).
      - interstice 20 Aug 2023 3:53 UTC
        2 points
        0
        Parent
        
        they view model activations as IID samples from a single-neuron probability distribution—but you can’t reconstruct the board state via a permutation-invariant linear probe
        
        Permutation-invariance isn’t the reason that this should be surprising. Yes, the NTK views neurons as being drawn from an IID distribution, but once they have been so drawn, you can linearly probe them as independent units. As an example, imagine that our input space consisted of five pixels, and at initialization neurons were randomly sensitive to one of the pixels. You would easily be able to construct linear probes sensitive to individual pixels even though the distribution over neurons is invariant over all the pixels.
        
        The reason the Othello result is surprising to the NTK is that neurons implementing an “Othello board state detector” would be vanishingly rare in the initial distribution, and the NTK thinks that the neuron function distribution does not change during training.
        Zach Furman 21 Aug 2023 6:43 UTC
        1 point
        0
        Parent
        The reason the Othello result is surprising to the NTK is that neurons implementing an “Othello board state detector” would be vanishingly rare in the initial distribution, and the NTK thinks that the neuron function distribution does not change during training.
        Yeah, that’s probably the best way to explain why this is surprising from the NTK perspective. I was trying to include mean-field and tensor programs as well (where that explanation doesn’t work anymore).
        As an example, imagine that our input space consisted of five pixels, and at initialization neurons were randomly sensitive to one of the pixels. You would easily be able to construct linear probes sensitive to individual pixels even though the distribution over neurons is invariant over all the pixels.
        Yeah, this is a good point. What I meant to specify wasn’t that you can’t recover any permutation-sensitive data at all (trivially, you can recover data about the input), but that any learned structures must be invariant to neuron permutation. (Though I’m feeling sketchy about the details of this claim). For the case of NTK, this is sort of trivial, since (as you pointed out) it doesn’t really learn features anyway.
        By the way, there are actually two separate problems that come from the IID assumption: the “independent” part, and the “identically-distributed” part. For space I only really mentioned the second one. But even if you deal with the identically distributed assumption, the independence assumption still causes problems.This prevents a lot of structure from being representable—for example, a layer where “at most two neurons are activated on any input from some set” can’t be represented with independently distributed neurons. More generally a lot of circuit-style constructions require this joint structure. IMO this is actually the more fundamental limitation, though takes longer to dig into.
        interstice 21 Aug 2023 16:16 UTC
        2 points
        0
        Parent
        
        I was trying to include mean-field and tensor programs as well
        
        but that any learned structures must be invariant to neuron permutation. (Though I’m feeling sketchy about the details of this claim)
        
        The same argument applies—if the distribution of intermediate neurons shifts so that Othello-board-state-detectors have a reasonably high probability of being instantiated, it will be possible to construct a linear probe detecting this, regardless of the permutation-invariance of the distribution.
        
        the independence assumption still causes problems
        
        This is a more reasonable objection(although actually, I’m not sure if independence does hold in the tensor programs framework—probably?)
        Zach Furman 21 Aug 2023 17:35 UTC
        1 point
        0
        Parent
        if the distribution of intermediate neurons shifts so that Othello-board-state-detectors have a reasonably high probability of being instantiated
        Yeah, this “if” was the part I was claiming permutation invariance causes problems for—that identically distributed neurons probably couldn’t express something as complicated as a board-state-detector. As soon as that’s true (plus assuming the board-state-detector is implemented linearly), agreed, you can recover it with a linear probe regardless of permutation-invariance.
        This is a more reasonable objection(although actually, I’m not sure if independence does hold in the tensor programs framework—probably?)
        I probably should’ve just gone with that one, since the independence barrier is the one I usually think about, and harder to get around (related to non-free-field theories, perturbation theory, etc).
        My impression from reading through one of the tensor program papers a while back was that it still makes the IID assumption, but there could be some subtlety about that I missed.
      - kave 19 Aug 2023 23:55 UTC
        1 point
        0
        Parent
        Thanks! The permutation-invariance of a bunch of theories is a helpful concept