Lee Sharkey comments on Sparsify: A mechanistic interpretability research agenda

Lee Sharkey 5 Apr 2024 13:50 UTC
LW: 4 AF: 2
0
AF
So, for models that are 10 terabytes in size, you should perhaps be expecting a “model manual” which is around 10 terabytes in size.
Yep, that seems reasonable.
I’m guessing you’re not satisfied with the retort that we should expect AIs to do the heavy lifting here?

Or perhaps you don’t think you need something which is close in accuracy to a full explanation of the network’s behavior.
I think the accuracy you need will depend on your use case. I don’t think of it as a globally applicable quantity for all of interp.
For instance, maybe to ‘audit for deception’ you really only need identify and detect when the deception circuits are active, which will involve explaining only 0.0001% of the network.
But maybe to make robust-to-training interpretability methods you need to understand 99.99...99%.
It seem likely to me that we can unlock more and more interpretability use cases by understanding more and more of the network.
- ryan_greenblatt 5 Apr 2024 15:32 UTC
  LW: 4 AF: 3
  1
  AF Parent
  
  I’m guessing you’re not satisfied with the retort that we should expect AIs to do the heavy lifting here?
  
  I think this presents a plausible approach and is likely needed for ambitious bottom up interp. So this seems like a reasonable plan.
  
  I just think that it’s worth acknowledging that “short description length” and “sparse” don’t result in something which is overall small in an absolute sense.