interstice comments on Against Almost Every Theory of Impact of Interpretability

interstice 21 Aug 2023 16:16 UTC
2 points
0

I was trying to include mean-field and tensor programs as well

but that any learned structures must be invariant to neuron permutation. (Though I’m feeling sketchy about the details of this claim)

The same argument applies—if the distribution of intermediate neurons shifts so that Othello-board-state-detectors have a reasonably high probability of being instantiated, it will be possible to construct a linear probe detecting this, regardless of the permutation-invariance of the distribution.

the independence assumption still causes problems

This is a more reasonable objection(although actually, I’m not sure if independence does hold in the tensor programs framework—probably?)
- Zach Furman 21 Aug 2023 17:35 UTC
  1 point
  0
  Parent
  if the distribution of intermediate neurons shifts so that Othello-board-state-detectors have a reasonably high probability of being instantiated
  Yeah, this “if” was the part I was claiming permutation invariance causes problems for—that identically distributed neurons probably couldn’t express something as complicated as a board-state-detector. As soon as that’s true (plus assuming the board-state-detector is implemented linearly), agreed, you can recover it with a linear probe regardless of permutation-invariance.
  This is a more reasonable objection(although actually, I’m not sure if independence does hold in the tensor programs framework—probably?)
  I probably should’ve just gone with that one, since the independence barrier is the one I usually think about, and harder to get around (related to non-free-field theories, perturbation theory, etc).
  My impression from reading through one of the tensor program papers a while back was that it still makes the IID assumption, but there could be some subtlety about that I missed.