It seems worth noting that there are good a priori reasons to think that you can’t do much better than around the “size of network” if you want a full explanation of the network’s behavior. So, for models that are 10 terabytes in size, you should perhaps be expecting a “model manual” which is around 10 terabytes in size. (For scale this is around 10 million books as long as moby dick.)
Perhaps you can reduce this cost by a factor of 100 by taking advantage of human concepts (down to 100,000 moby dicks) and perhaps you can only implicitly represent this structure in a way that allow for lazy construction upon queries.
Or perhaps you don’t think you need something which is close in accuracy to a full explanation of the network’s behavior.
More discussion of this sort of consideration can be found here.
So, for models that are 10 terabytes in size, you should perhaps be expecting a “model manual” which is around 10 terabytes in size.
Yep, that seems reasonable. I’m guessing you’re not satisfied with the retort that we should expect AIs to do the heavy lifting here?
Or perhaps you don’t think you need something which is close in accuracy to a full explanation of the network’s behavior.
I think the accuracy you need will depend on your use case. I don’t think of it as a globally applicable quantity for all of interp.
For instance, maybe to ‘audit for deception’ you really only need identify and detect when the deception circuits are active, which will involve explaining only 0.0001% of the network.
But maybe to make robust-to-training interpretability methods you need to understand 99.99...99%.
It seem likely to me that we can unlock more and more interpretability use cases by understanding more and more of the network.
I’m guessing you’re not satisfied with the retort that we should expect AIs to do the heavy lifting here?
I think this presents a plausible approach and is likely needed for ambitious bottom up interp. So this seems like a reasonable plan.
I just think that it’s worth acknowledging that “short description length” and “sparse” don’t result in something which is overall small in an absolute sense.
It seems worth noting that there are good a priori reasons to think that you can’t do much better than around the “size of network” if you want a full explanation of the network’s behavior. So, for models that are 10 terabytes in size, you should perhaps be expecting a “model manual” which is around 10 terabytes in size. (For scale this is around 10 million books as long as moby dick.)
Perhaps you can reduce this cost by a factor of 100 by taking advantage of human concepts (down to 100,000 moby dicks) and perhaps you can only implicitly represent this structure in a way that allow for lazy construction upon queries.
Or perhaps you don’t think you need something which is close in accuracy to a full explanation of the network’s behavior.
More discussion of this sort of consideration can be found here.
Yep, that seems reasonable.
I’m guessing you’re not satisfied with the retort that we should expect AIs to do the heavy lifting here?
I think the accuracy you need will depend on your use case. I don’t think of it as a globally applicable quantity for all of interp.
For instance, maybe to ‘audit for deception’ you really only need identify and detect when the deception circuits are active, which will involve explaining only 0.0001% of the network.
But maybe to make robust-to-training interpretability methods you need to understand 99.99...99%.
It seem likely to me that we can unlock more and more interpretability use cases by understanding more and more of the network.
I think this presents a plausible approach and is likely needed for ambitious bottom up interp. So this seems like a reasonable plan.
I just think that it’s worth acknowledging that “short description length” and “sparse” don’t result in something which is overall small in an absolute sense.