scasper

Karma: 2,034

https://stephencasper.com/

EIS VII: A Challenge for Mechanists

scasperFeb 18, 2023, 6:27 PM

36 points

4 comments3 min readLW link

scasper Feb 17, 2023, 9:04 PM
1 point
0
in reply to: carboniferous_umbraculum ’s comment on: EIS V: Blind Spots In AI Safety Interpretability Research
Thanks for the comment and pointing these things out.
---
I don’t see it as necessarily a bad thing if two groups of researchers are working on the same idea under different names.
Certainly it’s not a necessarily good thing either. I would posit isolation is usually not good. I can personally attest to being confused and limited by the difference in terminology here. And I think that when it comes to intrinsic interpretability work in particular, the disentanglement literature has produced a number of methods of value while TAISIC has not.
I don’t know what we benefit from in this particular case with polysemanticity, superposition, and entanglement. Do you have a steelman for this more specific to these literatures?
---
In fact it’s almost like a running joke in academia that there’s always someone grumbling that you didn’t cite the right things (their favourite work on this topic, their fellow countryman, them etc.)...
Good point. I would not say that the issue with the feature visualization and zoom in papers were merely failing to cite related work. I would say that the issue is how they started a line of research that is causing confusion and redundant work. My stance here is based on how I see the isolation between the two types of work as needless.
---
I understand that your take is that it is closer to program synthesis or program induction and that these aren’t all the same thing but in the first subsection of the “TASIC has reinvented...” section, I’m a little confused why there’s no mention of reverse engineering programs from compiled binary? The analogy with reverse engineering programs is one that MI people have been actively thinking about, writing about and trying to understand ( see e.g. Olah, and Nanda, in which he consults an expert).
Thanks for pointing out these posts. They are examples of discussing a similar idea to MI’s dependency on programmatic hypothesis generation, but they don’t act on it. But they both serve to draw analogies instead of providing methods. The thing in the front of my mind when I talk about how TAISIC has not sufficiently engaged with neurosymbolic work is the kind of thing I mentioned in the paragraph about existing work outside of TAISIC. I pasted it below for convenience :)
If MI work is to be more engineering-relevant, we need automated ways of generating candidate programs to explain how neural networks work. The good news is that we don’t have to start from scratch. The program synthesis, induction, and language translation literatures have been around long enough that we have textbooks on them (Gulwani et al., 2017, Qiu, 1999). And there are also notable bodies of work in deep learning that focus on extracting decision trees from neural networks (e.g. Zhang et al., 2019), distilling networks into programs in domain specific languages (e.g. Verma et al., 2018; Verma et al., 2019; Trivedi et al., 2021), and translating neural network architectures into symbolic graphs that are mechanistically faithful (e.g. Ren et al., 2021). These are all automated ways of doing the type of MI work that people in TAISIC want to do. Currently, some of these works (and others in the neurosymbolic literature) seem to be outpacing TAISIC on its own goals.

EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety

scasperFeb 17, 2023, 8:48 PM

49 points

9 comments12 min readLW link

scasper Feb 17, 2023, 4:53 PM
1 point
0
in reply to: Noosphere89’s comment on: EIS V: Blind Spots In AI Safety Interpretability Research
I see the point of this post. No arguments with the existence of productive reframing. But I do not think this post makes a good case for reframing being robustly good. Obviously, it can be bad too. And for the specific cases discussed in the post, the post you linked doesn’t make me think “Oh, these are reframed ideas, so good—glad we are doing redundant work in isolation.”
For example with polysemanticity/superposition I think that TAISIC’s work has created generational confusion and insularity that are harmful. And I think TAISIC’s failure to understand that MI means doing program synthesis/induction/language-translation has led to a lot of unproductive work on toy problems using methods that are unlikely to scale.

scasper Feb 17, 2023, 4:30 PM
LW: 6 AF: 2
6
AF
in reply to: Noosphere89’s comment on: EIS V: Blind Spots In AI Safety Interpretability Research
This seems interesting. I do not know of steelmen for isolation, renaming, reinventing, etc. What is yours?

EIS V: Blind Spots In AI Safety Interpretability Research

scasperFeb 16, 2023, 7:09 PM

57 points

24 comments10 min readLW link

EIS IV: A Spotlight on Feature Attribution/Saliency

scasperFeb 15, 2023, 6:46 PM

19 points

1 comment4 min readLW link

scasper Feb 15, 2023, 3:22 PM
LW: 5 AF: 3
0
AF
in reply to: Butanium’s comment on: EIS III: Broad Critiques of Interpretability Research
Thanks. I’ll talk in some depth about causal scrubbing in two of the upcoming posts which narrow down discussion specifically to AI safety work. I think it’s a highly valuable way of measuring how well a hypothesis seems to explain a network, but there are some pitfalls with it to be aware of.

EIS III: Broad Critiques of Interpretability Research

scasperFeb 14, 2023, 6:24 PM

20 points

2 comments11 min readLW link

scasper Feb 13, 2023, 7:00 PM
1 point
0
in reply to: Clement Neo’s comment on: We Found An Neuron in GPT-2
Thanks, but I’m asking more about why you chose to study this particular thing instead of something else entirely. For example, why not study “this” versus “that” completions or any number of other simple things in the language model?

scasper Feb 13, 2023, 2:34 PM
1 point
0
on: We Found An Neuron in GPT-2
How was the ′ a’ v. ′ an’ selection task selected? It seems quite convenient to probe for and also the kind of thing that could result from p-hacking over a set of similar simple tasks.

scasper Feb 10, 2023, 8:09 AM
LW: 1 AF: 1
0
AF
in reply to: LawrenceC’s comment on: The Engineer’s Interpretability Sequence (EIS) I: Intro
Correct. I intended the 3 paragraphs in that comment to be separate thoughts. Sorry.

scasper Feb 9, 2023, 8:00 PM
LW: 1 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: EIS II: What is “Interpretability”?
There are not that many that I don’t think are fungible with interpretability work :)
But I would describe most outer alignment work to be sufficiently different...

scasper Feb 9, 2023, 7:46 PM
LW: 1 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: The Engineer’s Interpretability Sequence (EIS) I: Intro
Interesting to know that about the plan. I have assumed that remix was in large part about getting more people into this type of work. But I’m interested in the conclusions and current views on it. Is there a post reflecting on how it went and what lessons were learned from it?

scasper Feb 9, 2023, 7:44 PM
LW: 3 AF: 2
0
AF
in reply to: Charlie Steiner’s comment on: The Engineer’s Interpretability Sequence (EIS) I: Intro
I think that my personal thoughts on capabilities externalities are reflected well in this post.
I’d also note that this concern isn’t very unique to interpretability work but applies to alignment work in general. And in comparison to other alignment techniques, I think that the downside risks of interpretability tools are most likely lower than those of stuff like RLHF. Most theories of change for interpretability helping with AI safety involve engineering work at some point in time, so I would expect that most interpretability researchers have similar attitudes to this on dual use concerns.
In general, a tool being engineering-relevant does not imply that it will be competitive for setting a new SOTA on something risky. So when I will talk about engineering relevance in this sequence, I don’t have big advancements in mind so much as stuff like fairly simple debugging work.

scasper Feb 9, 2023, 7:35 PM
LW: 1 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: The Engineer’s Interpretability Sequence (EIS) I: Intro
I think that (1) is interesting. This sounds plausible, but I do not know of any examples of this perspective being fleshed out. Do you know of any posts on this?

scasper Feb 9, 2023, 7:30 PM
LW: 2 AF: 2
−2
AF
in reply to: ryan_greenblatt’s comment on: The Engineer’s Interpretability Sequence (EIS) I: Intro
Thanks! I discuss in the second post of the sequence why I lump ARC’s work in with human-centered interpretability.

EIS II: What is “Interpretability”?

scasperFeb 9, 2023, 4:48 PM

28 points

6 comments4 min readLW link

The Engineer’s Interpretability Sequence (EIS) I: Intro

scasperFeb 9, 2023, 4:28 PM

46 points

24 comments3 min readLW link

scasper Jan 3, 2023, 8:01 PM
35 points
14
on: How to eat potato chips while typing
There seems to be high variance in the scope of the challenges that Katja has been tackling recently.

scasper

EIS VII: A Challenge for Mechanists

EIS VI: Cri­tiques of Mechanis­tic In­ter­pretabil­ity Work in AI Safety

EIS V: Blind Spots In AI Safety In­ter­pretabil­ity Research

EIS IV: A Spotlight on Fea­ture At­tri­bu­tion/​Saliency

EIS III: Broad Cri­tiques of In­ter­pretabil­ity Research

EIS II: What is “In­ter­pretabil­ity”?

The Eng­ineer’s In­ter­pretabil­ity Se­quence (EIS) I: Intro