Carrying it over to the car/elephant analogy: we do not have a broken car. Instead, we have two toddlers wearing a car costume and making “vroom” noises. [Edit-To-Add: Actually, a better analogy would be a Flintstones car. It only looks like a car if we hide the humans’ legs running underneath.] We have not ever actually built a car or anything even remotely similar to a car; we do not understand the principles of mechanics, thermodynamics or chemistry required to build an engine. We study the elephant not primarily in hopes of making the elephant itself more safe and predictable, but in hopes of learning those principles of mechanics, thermodynamics and chemistry which we currently lack.
we do not understand the principles of mechanics, thermodynamics or chemistry required to build an engine.
If this is true, then it makes (mechanistic) interpretability much harder as well, as we’ll need our interpretability tools to somehow teach us these underlying principles, as you go on to say. I don’t think this is the primary stated motivation for mechanistic interpretability. The main stated motivations seem to be roughly “We can figure out if the model is doing bad (e.g. deceptive) stuff and then do one or more of: 1) not deploy it, 2) not build systems that way, 3) train against our operationalization of deception”
That counterargument does at least typecheck, so we’re not talking past each other. Yay!
In the context of neurosymbolic methods, I’d phrase my argument like this: in order for the symbols in the symbolic-reasoning parts to robustly mean what we intended them to mean (e.g. standard semantics in the case of natural language), we need to pick the right neural structures to “hook them up to”. We can’t just train a net to spit out certain symbols given certain inputs and then use those symbols as though they actually correspond to the intended meaning, because <all the usual reasons why maximizing a training objective does not do the thing we intended>.
Now, I’m totally on board with the general idea of using neural nets for symbol grounding and then building interpretable logic-style stuff on top of that. (Retargeting the Search is an instance of that general strategy, especially if we use a human-coded search algorithm.) But interpretability is a necessary step to do that, if we want the symbols to be robustly correctly grounded.
On to the specifics:
A lot of interpretability is about discovering how concepts are used in a higher-level algorithm, and the argument doesn’t apply there.
I partially buy that. It does seem to me that a lot of people doing interpretability don’t really seem have a particular goal in mind, and are just generally trying to understand what’s going on. Which is not necessarily bad; understanding basically anything in neural nets (including higher-level algorithms) will probably help us narrow in on the answers to the key questions. But it means that a lot of work is not narrowly focused on the key hard parts (i.e. how to assign external meaning to internal structures).
One point of using such methods is to enforce or encourage certain high-level algorithmic properties, e.g. modularity.
Insofar as the things passing between modules are symbols whose meaning we don’t robustly know, the same problem comes up. The usefulness of structural/algorithmic properties is pretty limited, if we don’t have a way to robustly assign meaning to the things passing between the parts.
Hmm I feel a bit damned by faint praise here… it seems like more than type-checking, you are agreeing substantively with my points (or at least, I fail to find any substantive disagreement with/in your response).
Perhaps the main disagreement is about the definition of interpretability, where it seems like the goalposts are moving… you say (paraphrasing) “interpretability is a necessary step to robustly/correctly grounding symbols”. I can interpret that in a few ways:
“interpretability := mechanistic interpretability (as it is currently practiced)”: seems false.
“interpretability := understanding symbol grounding well enough to have justified confidence that it is working as expected”: also seems false; we could get good grounding without justified confidence, although it certainly much better to have the justified confidence.
“interpretability := having good symbol grounding”: a mere tautology.
A potential substantive disagreement: I think we could get high levels of justified confidence via means that look very different from (what I’d consider any sensible notion of) “interpretability”, e.g. via:
A principled understanding of how to train or otherwise develop systems that ground symbols in the way we want/expect/etc.
Empirical work
A combination of either/both of the above with mechanistic interpretability
It’s not clear that any of these or their combination will give us as high of levels of justified confidence as we would like, but that’s just the nature of the beast (and a good argument for pursuing governance solutions).
A few more points regarding symbol grounding:
I think it’s not a great framing… I’m struggling to articulate why, but it’s maybe something like “There is no clear boundary between symbols and non-symbols”
I think the argument I’m making in the original post applies equally well to grounding… There is some difficult work to be done and it is not clear that reverse engineering is a better approach than engineering.
I think the argument in Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc applies here.
Carrying it over to the car/elephant analogy: we do not have a broken car. Instead, we have two toddlers wearing a car costume and making “vroom” noises. [Edit-To-Add: Actually, a better analogy would be a Flintstones car. It only looks like a car if we hide the humans’ legs running underneath.] We have not ever actually built a car or anything even remotely similar to a car; we do not understand the principles of mechanics, thermodynamics or chemistry required to build an engine. We study the elephant not primarily in hopes of making the elephant itself more safe and predictable, but in hopes of learning those principles of mechanics, thermodynamics and chemistry which we currently lack.
If this is true, then it makes (mechanistic) interpretability much harder as well, as we’ll need our interpretability tools to somehow teach us these underlying principles, as you go on to say. I don’t think this is the primary stated motivation for mechanistic interpretability. The main stated motivations seem to be roughly “We can figure out if the model is doing bad (e.g. deceptive) stuff and then do one or more of: 1) not deploy it, 2) not build systems that way, 3) train against our operationalization of deception”
Actually I really don’t think it does… the argument there is that:
interpretability is about understanding how concepts are grounded.
symbolic methods don’t tell us anything about how their concepts are grounded.
This is only tangentially related to the point I’m making in my post, because:
A lot of interpretability is about discovering how concepts are used in a higher-level algorithm, and the argument doesn’t apply there.
I am comparing mechanistic interpretability of neural nets with neuro-symbolic methods.
One point of using such methods is to enforce or encourage certain high-level algorithmic properties, e.g. modularity.
That counterargument does at least typecheck, so we’re not talking past each other. Yay!
In the context of neurosymbolic methods, I’d phrase my argument like this: in order for the symbols in the symbolic-reasoning parts to robustly mean what we intended them to mean (e.g. standard semantics in the case of natural language), we need to pick the right neural structures to “hook them up to”. We can’t just train a net to spit out certain symbols given certain inputs and then use those symbols as though they actually correspond to the intended meaning, because <all the usual reasons why maximizing a training objective does not do the thing we intended>.
Now, I’m totally on board with the general idea of using neural nets for symbol grounding and then building interpretable logic-style stuff on top of that. (Retargeting the Search is an instance of that general strategy, especially if we use a human-coded search algorithm.) But interpretability is a necessary step to do that, if we want the symbols to be robustly correctly grounded.
On to the specifics:
I partially buy that. It does seem to me that a lot of people doing interpretability don’t really seem have a particular goal in mind, and are just generally trying to understand what’s going on. Which is not necessarily bad; understanding basically anything in neural nets (including higher-level algorithms) will probably help us narrow in on the answers to the key questions. But it means that a lot of work is not narrowly focused on the key hard parts (i.e. how to assign external meaning to internal structures).
Insofar as the things passing between modules are symbols whose meaning we don’t robustly know, the same problem comes up. The usefulness of structural/algorithmic properties is pretty limited, if we don’t have a way to robustly assign meaning to the things passing between the parts.
Hmm I feel a bit damned by faint praise here… it seems like more than type-checking, you are agreeing substantively with my points (or at least, I fail to find any substantive disagreement with/in your response).
Perhaps the main disagreement is about the definition of interpretability, where it seems like the goalposts are moving… you say (paraphrasing) “interpretability is a necessary step to robustly/correctly grounding symbols”. I can interpret that in a few ways:
“interpretability := mechanistic interpretability (as it is currently practiced)”: seems false.
“interpretability := understanding symbol grounding well enough to have justified confidence that it is working as expected”: also seems false; we could get good grounding without justified confidence, although it certainly much better to have the justified confidence.
“interpretability := having good symbol grounding”: a mere tautology.
A potential substantive disagreement: I think we could get high levels of justified confidence via means that look very different from (what I’d consider any sensible notion of) “interpretability”, e.g. via:
A principled understanding of how to train or otherwise develop systems that ground symbols in the way we want/expect/etc.
Empirical work
A combination of either/both of the above with mechanistic interpretability
It’s not clear that any of these or their combination will give us as high of levels of justified confidence as we would like, but that’s just the nature of the beast (and a good argument for pursuing governance solutions).
A few more points regarding symbol grounding:
I think it’s not a great framing… I’m struggling to articulate why, but it’s maybe something like “There is no clear boundary between symbols and non-symbols”
I think the argument I’m making in the original post applies equally well to grounding… There is some difficult work to be done and it is not clear that reverse engineering is a better approach than engineering.