Interpretability Externalities Case Study—Hungry Hungry Hippos

Some people worry about interpretability research being useful for AI capabilities and potentially net-negative. As far as I was aware of, this worry has mostly been theoretical, but now there is a real world example: The hungry hungry hippos (H3) paper.

Tl;dr: The H3 paper

  • Proposes an architecture for sequence modeling which can handle larger context windows than transformers

  • Was inspired by interpretability work.

(Note that the H3 paper is from December 2022, and it was briefly mentioned in this discussion about publishing interpretability research. But I wasn’t aware of it until recently and I haven’t seen the paper discussed here on the forum.)

I don’t know why the paper is called that way. One of the authors mentioned that it’s called “hungry hungry hippos” rather than “hungry hippos” because it uses two state space model layers rather than one. But I think they don’t mention why the hippo is hungry. Or why it’s a hippo.

Larger Context Windows

The H3 paper proposes a way to use state space models (SSMs) for language models instead of attention. With an SSM it’s possible to model longer sequences. Using attention, the compute for context window length scales with . Using the SSM based architecture, the compute scales with .

Inspired by Interpretability Work

The paper mentions that the work was inspired by Anthropic’s In-context learning and induction heads paper.

E.g. they write

We provide an informal sketch of a two-layer attention model that can solve the associative recall task, inspired by the construction of [In-context learning and induction heads paper].

There is also the “Hyena paper” which builds on the H3 paper, and was also inspired by the induction heads paper:

This work would not have been possible without [...] inspiring research on mechanistic understanding of Transformers (Olsson et al. 2022; Power et al. 2022; Nanda et al. 2023).

My Takes

  • These two papers in particular will probably not shorten AI timelines much.

    • It seems unlikely that this type of architecture ends up being the state of the art.

  • However, the example makes me take the downsides of publishing interpretability research more seriously.

    • Even if this work itself is not a key capability milestone, it shows that there is truth in the argument “If we understand systems better, it will not just be useful for safety but also lead to capability advancements”

  • Capabilities externalities are a strong argument that most (good) interpretability research should not be published

    • There are alternative ways to distribute research which are less risky than publishing.

      • We can probably learn something by studying military research practices which have a similar use case of “make research accessible to other researchers while preventing it from becoming public”

      • The constraints are less strict than with military research because there is not an adversary force trying really hard to get access.

    • Maybe this is already relatively common (I would not know of most unpublished research)

  • On the other hand, interpretability research is probably crucial for AI alignment.

    • I think it is possible but unlikely that we get alignment without extremely good interpretability.

    • The cost of keeping interpretability research private is really high. Publishing is a great driver of scientific progress.

  • Overall, publishing interpretability research seems both pretty risky, and extremely valuable, and it’s not clear to me if it is worth it.

Your Takes?

I would be really interested to see a discussion about this!

  • How big a deal are the H3 and Hyena papers?

  • Does this example change your mind about whether publishing (or even doing) interpretability research is a good idea?