scasper comments on EIS V: Blind Spots In AI Safety Interpretability Research

scasper 17 Feb 2023 21:04 UTC
1 point
0
Thanks for the comment and pointing these things out.
---
I don’t see it as necessarily a bad thing if two groups of researchers are working on the same idea under different names.
Certainly it’s not a necessarily good thing either. I would posit isolation is usually not good. I can personally attest to being confused and limited by the difference in terminology here. And I think that when it comes to intrinsic interpretability work in particular, the disentanglement literature has produced a number of methods of value while TAISIC has not.
I don’t know what we benefit from in this particular case with polysemanticity, superposition, and entanglement. Do you have a steelman for this more specific to these literatures?
---
In fact it’s almost like a running joke in academia that there’s always someone grumbling that you didn’t cite the right things (their favourite work on this topic, their fellow countryman, them etc.)...
Good point. I would not say that the issue with the feature visualization and zoom in papers were merely failing to cite related work. I would say that the issue is how they started a line of research that is causing confusion and redundant work. My stance here is based on how I see the isolation between the two types of work as needless.
---
I understand that your take is that it is closer to program synthesis or program induction and that these aren’t all the same thing but in the first subsection of the “TASIC has reinvented...” section, I’m a little confused why there’s no mention of reverse engineering programs from compiled binary? The analogy with reverse engineering programs is one that MI people have been actively thinking about, writing about and trying to understand ( see e.g. Olah, and Nanda, in which he consults an expert).
Thanks for pointing out these posts. They are examples of discussing a similar idea to MI’s dependency on programmatic hypothesis generation, but they don’t act on it. But they both serve to draw analogies instead of providing methods. The thing in the front of my mind when I talk about how TAISIC has not sufficiently engaged with neurosymbolic work is the kind of thing I mentioned in the paragraph about existing work outside of TAISIC. I pasted it below for convenience :)
If MI work is to be more engineering-relevant, we need automated ways of generating candidate programs to explain how neural networks work. The good news is that we don’t have to start from scratch. The program synthesis, induction, and language translation literatures have been around long enough that we have textbooks on them (Gulwani et al., 2017, Qiu, 1999). And there are also notable bodies of work in deep learning that focus on extracting decision trees from neural networks (e.g. Zhang et al., 2019), distilling networks into programs in domain specific languages (e.g. Verma et al., 2018; Verma et al., 2019; Trivedi et al., 2021), and translating neural network architectures into symbolic graphs that are mechanistically faithful (e.g. Ren et al., 2021). These are all automated ways of doing the type of MI work that people in TAISIC want to do. Currently, some of these works (and others in the neurosymbolic literature) seem to be outpacing TAISIC on its own goals.
- Spencer Becker-Kahn 18 Feb 2023 1:57 UTC
  2 points
  1
  Parent
  Certainly it’s not a necessarily good thing either. I would posit isolation is usually not good. I can personally attest to being confused and limited by the difference in terminology here. And I think that when it comes to intrinsic interpretability work in particular, the disentanglement literature has produced a number of methods of value while TAISIC has not.
  Ok it sounds to me like maybe there’s at least two things being talked about here. One situation is
  
  A) Where a community includes different groups working on the same topic, and where those groups might use different terminology and have different ways of thinking about the same phenomena etc. This seems completely normal to me. The other situation is
  
  B) Where a group is isolated from the community at large and is using different terminology/thinking about things differently just as a result of their isolation and lack of communication. And where that behaviour then causes confusion and/or wasting of resources.
  
  The latter doesn’t sound good, but I guess it looks like to me that some or many of your points are consistent with the former being the case. So when you write e.g. it’s not “necessarily a good thing either” or asking for my steelmanned case, this doesn’t seem to quite make sense to me. I feel like if something is not necessarily good or bad, and you want to raise it as a criticism, then the onus would be on you to bring the case against TASIC with arguments that are not general ones that could easily apply to both A) and B) above. e.g. It’d be more of an emphatic case if you were able to go into the details and be like “X did this work here and claimed it was new but actually it exists in Y’s paper here” or give a real example of needless confusion that was created and could have been avoided. Focussing just on what they did or didn’t ‘engage with’ on the level of general concepts and citations/acknowledgements doesn’t bring this case convincingly, in my opinion. Some more vague thoughts on why that is:
  - Bodies of literature like this are usually very complicated and messy and people genuinely can’t be expected to engage with everything.
  - It’s often hard or impossible to tack dependencies of ideas because of all the communication you cannot see and not being able to see ‘how’ people are thinking of things, only what they wrote.
  - Someone publishing on the same idea or concept or topic as you is nowhere near the same as someone actually doing the exact same technical thing that you are doing. ime the former is happening all the time; and the latter is much rarer than people often think.
  - Reinvention, re-presentation and even outright renaming or ‘starting from scratch’ are all valuable elements of scholarship that help a field move along.
  Idk maybe I’m just repeating myself at this point.
  On the other point: It may turn out the MI’s analogy with reverse software engineering does not produce methods and is just used a high-level analogy,, but it seems too early to say from my perspective—the two posts I linked are from last year. TASIC is still pretty small and experienced researchers in TASIC are fewer and this is potentially a large and difficult research agenda.