David Scott Krueger (formerly: capybaralet) comments on “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability

David Scott Krueger (formerly: capybaralet) 3 Nov 2022 22:52 UTC
LW: 8 AF: 5
1
AF
Hmm I feel a bit damned by faint praise here… it seems like more than type-checking, you are agreeing substantively with my points (or at least, I fail to find any substantive disagreement with/in your response).

Perhaps the main disagreement is about the definition of interpretability, where it seems like the goalposts are moving… you say (paraphrasing) “interpretability is a necessary step to robustly/correctly grounding symbols”. I can interpret that in a few ways:
1. “interpretability := mechanistic interpretability (as it is currently practiced)”: seems false.
2. “interpretability := understanding symbol grounding well enough to have justified confidence that it is working as expected”: also seems false; we could get good grounding without justified confidence, although it certainly much better to have the justified confidence.
3. “interpretability := having good symbol grounding”: a mere tautology.
A potential substantive disagreement: I think we could get high levels of justified confidence via means that look very different from (what I’d consider any sensible notion of) “interpretability”, e.g. via:
- A principled understanding of how to train or otherwise develop systems that ground symbols in the way we want/expect/etc.
- Empirical work
- A combination of either/both of the above with mechanistic interpretability
It’s not clear that any of these or their combination will give us as high of levels of justified confidence as we would like, but that’s just the nature of the beast (and a good argument for pursuing governance solutions).

A few more points regarding symbol grounding:
- I think it’s not a great framing… I’m struggling to articulate why, but it’s maybe something like “There is no clear boundary between symbols and non-symbols”
- I think the argument I’m making in the original post applies equally well to grounding… There is some difficult work to be done and it is not clear that reverse engineering is a better approach than engineering.

David Scott Krueger (formerly: capybaralet) comments on “Cars and Elephants”: a handwavy argument/​analogy against mechanistic interpretability

David Scott Krueger (formerly: capybaralet) comments on “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability