scasper
EIS VII: A Challenge for Mechanists
EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety
I see the point of this post. No arguments with the existence of productive reframing. But I do not think this post makes a good case for reframing being robustly good. Obviously, it can be bad too. And for the specific cases discussed in the post, the post you linked doesn’t make me think “Oh, these are reframed ideas, so good—glad we are doing redundant work in isolation.”
For example with polysemanticity/superposition I think that TAISIC’s work has created generational confusion and insularity that are harmful. And I think TAISIC’s failure to understand that MI means doing program synthesis/induction/language-translation has led to a lot of unproductive work on toy problems using methods that are unlikely to scale.
This seems interesting. I do not know of steelmen for isolation, renaming, reinventing, etc. What is yours?
EIS V: Blind Spots In AI Safety Interpretability Research
EIS IV: A Spotlight on Feature Attribution/Saliency
Thanks. I’ll talk in some depth about causal scrubbing in two of the upcoming posts which narrow down discussion specifically to AI safety work. I think it’s a highly valuable way of measuring how well a hypothesis seems to explain a network, but there are some pitfalls with it to be aware of.
EIS III: Broad Critiques of Interpretability Research
Thanks, but I’m asking more about why you chose to study this particular thing instead of something else entirely. For example, why not study “this” versus “that” completions or any number of other simple things in the language model?
How was the ′ a’ v. ′ an’ selection task selected? It seems quite convenient to probe for and also the kind of thing that could result from p-hacking over a set of similar simple tasks.
Correct. I intended the 3 paragraphs in that comment to be separate thoughts. Sorry.
There are not that many that I don’t think are fungible with interpretability work :)
But I would describe most outer alignment work to be sufficiently different...
Interesting to know that about the plan. I have assumed that remix was in large part about getting more people into this type of work. But I’m interested in the conclusions and current views on it. Is there a post reflecting on how it went and what lessons were learned from it?
I think that my personal thoughts on capabilities externalities are reflected well in this post.
I’d also note that this concern isn’t very unique to interpretability work but applies to alignment work in general. And in comparison to other alignment techniques, I think that the downside risks of interpretability tools are most likely lower than those of stuff like RLHF. Most theories of change for interpretability helping with AI safety involve engineering work at some point in time, so I would expect that most interpretability researchers have similar attitudes to this on dual use concerns.
In general, a tool being engineering-relevant does not imply that it will be competitive for setting a new SOTA on something risky. So when I will talk about engineering relevance in this sequence, I don’t have big advancements in mind so much as stuff like fairly simple debugging work.
I think that (1) is interesting. This sounds plausible, but I do not know of any examples of this perspective being fleshed out. Do you know of any posts on this?
Thanks! I discuss in the second post of the sequence why I lump ARC’s work in with human-centered interpretability.
EIS II: What is “Interpretability”?
The Engineer’s Interpretability Sequence (EIS) I: Intro
There seems to be high variance in the scope of the challenges that Katja has been tackling recently.
Thanks for the comment and pointing these things out.
---
Certainly it’s not a necessarily good thing either. I would posit isolation is usually not good. I can personally attest to being confused and limited by the difference in terminology here. And I think that when it comes to intrinsic interpretability work in particular, the disentanglement literature has produced a number of methods of value while TAISIC has not.
I don’t know what we benefit from in this particular case with polysemanticity, superposition, and entanglement. Do you have a steelman for this more specific to these literatures?
---
Good point. I would not say that the issue with the feature visualization and zoom in papers were merely failing to cite related work. I would say that the issue is how they started a line of research that is causing confusion and redundant work. My stance here is based on how I see the isolation between the two types of work as needless.
---
Thanks for pointing out these posts. They are examples of discussing a similar idea to MI’s dependency on programmatic hypothesis generation, but they don’t act on it. But they both serve to draw analogies instead of providing methods. The thing in the front of my mind when I talk about how TAISIC has not sufficiently engaged with neurosymbolic work is the kind of thing I mentioned in the paragraph about existing work outside of TAISIC. I pasted it below for convenience :)