Model agnostic interpretability methods are those which treat the model in question as a black box. They don’t require access to gradients or activations...
I’m not sure if the jargon is already standardized, but FWIW I wish people would not use the phrase “model agnostic” to refer only to black box methods. There is absolutely no reason why a method must be black-box in order to apply to new/different architectures, and I indeed expect that non-black-box methods which generalize across architecture are exactly what we will need for alignment.
I’m unsure about this, because if you’re not black-boxing things, then you think that something specific lies in that structure. And that specificity is what makes it no longer agnostic to model choice.
You have to black box if you want maximally general insights.
I think we usually don’t generalize very far not because we don’t have general models, but because it’s very hard to state any useful properties about very general models.
You can trivially view any model/agent as a Turing machine, without loss of generality.[1] We just usually don’t do that because it’s very hard to state anything useful about such a general model of computation. (It seems very hard to prove/disprove P=NP, we know for a fact that halting is undecidable, etc.)
I am very interested though what model John will use to state useful theorems that capture both the current DL paradigm, and the next paradigm with high probability. (He might have written about this somewhere already, haven’t read all his stuff yet.)
I’m not sure if the jargon is already standardized, but FWIW I wish people would not use the phrase “model agnostic” to refer only to black box methods. There is absolutely no reason why a method must be black-box in order to apply to new/different architectures, and I indeed expect that non-black-box methods which generalize across architecture are exactly what we will need for alignment.
I’m unsure about this, because if you’re not black-boxing things, then you think that something specific lies in that structure. And that specificity is what makes it no longer agnostic to model choice.
You have to black box if you want maximally general insights.
I think we usually don’t generalize very far not because we don’t have general models, but because it’s very hard to state any useful properties about very general models.
You can trivially view any model/agent as a Turing machine, without loss of generality.[1] We just usually don’t do that because it’s very hard to state anything useful about such a general model of computation. (It seems very hard to prove/disprove P=NP, we know for a fact that halting is undecidable, etc.)
I am very interested though what model John will use to state useful theorems that capture both the current DL paradigm, and the next paradigm with high probability. (He might have written about this somewhere already, haven’t read all his stuff yet.)
Assuming determinism, but OP’s black-box interpretability stuff already seems to assume that.
I think he addressed it in Don’t Get Distracted By The Boilerplate.