Spencer Becker-Kahn comments on EIS V: Blind Spots In AI Safety Interpretability Research

Spencer Becker-Kahn 17 Feb 2023 19:20 UTC
4 points
3
Re: e.g. superposition/entanglement:

I think people should try to understand the wider context into which they are writing, but I don’t see it as necessarily a bad thing if two groups of researchers are working on the same idea under different names. In fact I’d say this happens all the time and generally people can just hold in their minds that another group has another name for it. Naturally, the two groups will have slightly different perspectives and this a) Is often good, i.e. the interference can be constructive and b) Can be a reason in favour of different terminology, i.e. even if something is “the same” when boiled down to a formal level, the different names can actually help delineate different interpretations.

In fact it’s almost like a running joke in academia that there’s always someone grumbling that you didn’t cite the right things (their favourite work on this topic, their fellow countryman, them etc.) and because of the way academic literature works, some of the things that you are doing here can be done with almost any piece of work in the literature, i.e. you can comb over it with the benefit of hindsight and say ‘hang on this isn’t as original as it looked; basically the same idea that was written about here X years before’ etc. Honestly, I don’t usually think of this as a valuable exercise, but I may be missing something about your wider point or be more convinced once I’ve looked at more of your series.

Another point when it comes to ‘originality’ and ‘progress’ is that it’s often unimportant if some idea was generally discussed, labelled, named, or thought about before if what matters is actual results and the lower-level content of these works. i.e. I may be wrong, but looking at what you are saying, I don’t think you are literally pulling up an older paper on ‘entanglement’ that made the exact same points that the Anthropic papers were making and did very similar experiments (Or are you?) And even having said that, reproducing experiments exactly is of course very valuable.

Re: MI and program synthesis:

I understand that your take is that it is closer to program synthesis or program induction and that these aren’t all the same thing but in the first subsection of the “TASIC has reinvented...” section, I’m a little confused why there’s no mention of reverse engineering programs from compiled binary? The analogy with reverse engineering programs is one that MI people have been actively thinking about, writing about and trying to understand ( see e.g. Olah, and Nanda, in which he consults an expert).
- China 17 Feb 2023 19:33 UTC
  2 points
  2
  Parent
  The main problem on this site is that despite people have large vary levels of understanding of different subject, nobody wants to look like an idiot on here. A lot of the comments and articles are basically nothing burgers. People often focus on insignificant points to argue about and waste their time in the social aspect of learning than to actually learn about a subject themselves.
  This made me wonder do actual researchers who have values and substance to offer and question, do they not participate in online discussions? The closest I’ve found is wordpress blogs by various people and people have huge comment chains. The only other form of communication seems to be through formal papers, which is pretty much as organized as it gets in terms of format.
  I’ve learned that people who do actually have deeper understanding and knowledge of value to offer, they don’t waste their time on here. But I can’t find any other platform that these people participate in. My guess is that they don’t participate in any public discourse, only private conversations with other people who have things of value to offer and discuss.
- scasper 17 Feb 2023 21:04 UTC
  1 point
  0
  Parent
  Thanks for the comment and pointing these things out.
  ---
  I don’t see it as necessarily a bad thing if two groups of researchers are working on the same idea under different names.
  Certainly it’s not a necessarily good thing either. I would posit isolation is usually not good. I can personally attest to being confused and limited by the difference in terminology here. And I think that when it comes to intrinsic interpretability work in particular, the disentanglement literature has produced a number of methods of value while TAISIC has not.
  I don’t know what we benefit from in this particular case with polysemanticity, superposition, and entanglement. Do you have a steelman for this more specific to these literatures?
  ---
  In fact it’s almost like a running joke in academia that there’s always someone grumbling that you didn’t cite the right things (their favourite work on this topic, their fellow countryman, them etc.)...
  Good point. I would not say that the issue with the feature visualization and zoom in papers were merely failing to cite related work. I would say that the issue is how they started a line of research that is causing confusion and redundant work. My stance here is based on how I see the isolation between the two types of work as needless.
  ---
  I understand that your take is that it is closer to program synthesis or program induction and that these aren’t all the same thing but in the first subsection of the “TASIC has reinvented...” section, I’m a little confused why there’s no mention of reverse engineering programs from compiled binary? The analogy with reverse engineering programs is one that MI people have been actively thinking about, writing about and trying to understand ( see e.g. Olah, and Nanda, in which he consults an expert).
  Thanks for pointing out these posts. They are examples of discussing a similar idea to MI’s dependency on programmatic hypothesis generation, but they don’t act on it. But they both serve to draw analogies instead of providing methods. The thing in the front of my mind when I talk about how TAISIC has not sufficiently engaged with neurosymbolic work is the kind of thing I mentioned in the paragraph about existing work outside of TAISIC. I pasted it below for convenience :)
  If MI work is to be more engineering-relevant, we need automated ways of generating candidate programs to explain how neural networks work. The good news is that we don’t have to start from scratch. The program synthesis, induction, and language translation literatures have been around long enough that we have textbooks on them (Gulwani et al., 2017, Qiu, 1999). And there are also notable bodies of work in deep learning that focus on extracting decision trees from neural networks (e.g. Zhang et al., 2019), distilling networks into programs in domain specific languages (e.g. Verma et al., 2018; Verma et al., 2019; Trivedi et al., 2021), and translating neural network architectures into symbolic graphs that are mechanistically faithful (e.g. Ren et al., 2021). These are all automated ways of doing the type of MI work that people in TAISIC want to do. Currently, some of these works (and others in the neurosymbolic literature) seem to be outpacing TAISIC on its own goals.
  - Spencer Becker-Kahn 18 Feb 2023 1:57 UTC
    2 points
    1
    Parent
    Certainly it’s not a necessarily good thing either. I would posit isolation is usually not good. I can personally attest to being confused and limited by the difference in terminology here. And I think that when it comes to intrinsic interpretability work in particular, the disentanglement literature has produced a number of methods of value while TAISIC has not.
    Ok it sounds to me like maybe there’s at least two things being talked about here. One situation is
    
    A) Where a community includes different groups working on the same topic, and where those groups might use different terminology and have different ways of thinking about the same phenomena etc. This seems completely normal to me. The other situation is
    
    B) Where a group is isolated from the community at large and is using different terminology/thinking about things differently just as a result of their isolation and lack of communication. And where that behaviour then causes confusion and/or wasting of resources.
    
    The latter doesn’t sound good, but I guess it looks like to me that some or many of your points are consistent with the former being the case. So when you write e.g. it’s not “necessarily a good thing either” or asking for my steelmanned case, this doesn’t seem to quite make sense to me. I feel like if something is not necessarily good or bad, and you want to raise it as a criticism, then the onus would be on you to bring the case against TASIC with arguments that are not general ones that could easily apply to both A) and B) above. e.g. It’d be more of an emphatic case if you were able to go into the details and be like “X did this work here and claimed it was new but actually it exists in Y’s paper here” or give a real example of needless confusion that was created and could have been avoided. Focussing just on what they did or didn’t ‘engage with’ on the level of general concepts and citations/acknowledgements doesn’t bring this case convincingly, in my opinion. Some more vague thoughts on why that is:
    Bodies of literature like this are usually very complicated and messy and people genuinely can’t be expected to engage with everything.
    It’s often hard or impossible to tack dependencies of ideas because of all the communication you cannot see and not being able to see ‘how’ people are thinking of things, only what they wrote.
    Someone publishing on the same idea or concept or topic as you is nowhere near the same as someone actually doing the exact same technical thing that you are doing. ime the former is happening all the time; and the latter is much rarer than people often think.
    Reinvention, re-presentation and even outright renaming or ‘starting from scratch’ are all valuable elements of scholarship that help a field move along.
    Idk maybe I’m just repeating myself at this point.
    On the other point: It may turn out the MI’s analogy with reverse software engineering does not produce methods and is just used a high-level analogy,, but it seems too early to say from my perspective—the two posts I linked are from last year. TASIC is still pretty small and experienced researchers in TASIC are fewer and this is potentially a large and difficult research agenda.