Garrett Baker comments on Interpretability Externalities Case Study—Hungry Hungry Hippos

Garrett Baker 21 Sep 2023 1:03 UTC
LW: 7 AF: 1
4
AF

So: do you think that ambitious mech interp is impossible? Do you think that current interp work is going the wrong direction in terms of achieving ambitious understanding? Or do you think that it’d be not useful even if achieved?

Mostly I think that MI is right to think it can do a lot for alignment, but I suspect that lots of the best things it can do for alignment it will do in a very dual-use way, which skews heavily towards capabilities. Mostly because capabilities advances are easier and there are more people working on those.

At the same time I suspect that many of those dual use concerns can be mitigated by making your MI research targeted. Not necessarily made such that you can do off-the-shelf interventions based on your findings, but made such that if it ever has any use, that use is going to be for alignment, and you can predict broadly what that use will look like.

This also doesn’t mean your MI research can’t be ambitious. I don’t want to criticize people for being ambitious or too theoretical! I want to criticize people for producing knowledge on something which, while powerful, seems powerful in too many directions to be useful if done publicly.

I agree that if your theory of change for interp goes through, “interp solves a concrete problem like deception or sensor tampering or adversarial robustness”, then you better just try to solve those concrete problems instead of improving interp in general. But I think the case for ambitious mech interp isn’t terrible, and so it’s worth exploring and investing in anyways.

I don’t entirely know what you mean by this. How would we solve alignment by not going through a concrete problem? Maybe you think MI will be secondary to that process, and will give us useful information about what problems are necessary to solve? In such a case I still don’t see why you need ambitious MI. You can just test the different problem classes directly. Maybe you think the different problem classes are too large to test directly. Even in that case, I still think that a more targeted approach would be better, where you generate as much info about those target classes as possible, while minimizing info that can be used to make your models better. And you selectively report only the results of your investigation which bear on the problem class. Even if the research is exploratory, the result & verification demonstration can still be targeted.

But again, most mech interp people aren’t aiming to use mech interp to solve a specific concrete problem you can exhibit on models today, so it seems unfair to complain that most of the work doesn’t lead to novel alignment methods.

Maybe I misspoke. I dislike current MI because I expect large capability improvements before and at the same time as the alignment improvements, but I don’t dispute future alignment improvements. Just whether they’ll be worth it. The reason I brought up that was as some motivation for why I think targeted is better, and why I don’t like some peoples’ criticism of worries about MI externalities by appealing to the lack of capabilities advances caused by MI. There’ve certainly been more attempts at capabilities improvements motivated by MI than there have been attempts at alignment improvements. Regardless of what you think about the future of the field, its interesting when people make MI discoveries which don’t lead to too much capabilities advances.

I personally like activation additions because they give me evidence about how models mechanistically behave in a way which directly tells me about which threat models are more or less likely, and it has the potential to make auditing and iteration a lot easier. Accomplishments which ambitious MI is nowhere close to, and for which I expect its methods would have to pay a lot in terms of capability advances in order to get to. I mention this as evidence for why I expect targeted approaches are faster and cheaper than ambitious ones. At least if done publicly.
What links here?
- My hopes for alignment: Singular learning theory and whole brain emulation by Garrett Baker (25 Oct 2023 18:31 UTC; 62 points)
- Garrett Baker's comment on Announcing Dialogues by Ben Pace (7 Oct 2023 20:04 UTC; 4 points)
- Joe Collman 3 Oct 2023 22:40 UTC
  LW: 3 AF: 2
  0
  AF Parent
  A couple of unconnected points:
  Mostly I think that MI is right to think it can do a lot for alignment, but I suspect that lots of the best things it can do for alignment it will do in a very dual-use way, which skews heavily towards capabilities. Mostly because capabilities advances are easier and there are more people working on those.
  This doesn’t clearly follow: one way for x to be easier is [there are many ways to do x, so that it’s not too hard to find one]. If it’s easy to find a few ways to get x, giving me another one may not help me at all. If it’s hard to find any way to do x, giving me a workable approach may be hugely helpful.
  (I’m not making a case one way or another on the main point—I don’t know the real-world data on this, and it’s also entirely possible that the bar on alignment is so high that most/all MI isn’t useful for alignment)
  I mention this as evidence for why I expect targeted approaches are faster and cheaper than ambitious ones
  I’m not entirely clear I understand you here, but if I do, my response would be: targeted approaches may be faster and cheaper at solving the problems they target. Ambitious approaches are more likely to help solve problems that you didn’t know existed, and didn’t realize you needed to target.
  If targeted approaches are being used for [demonstrate that problems of this kind are possible], I expect they are indeed faster and cheaper. If we’re instead talking about being used as part of an alignment solution, targeted approaches seem likely to be ~irrelevant (of course I’d be happy if I’m wrong on this!).
  (again, assuming I understand how you’re using ‘targeted’ / ‘ambitious’)