Is the “Valley of Confused Abstractions” real?

jacquesthibs5 Dec 2022 13:36 UTC

19 points

AI Interpretability (ML & AI)Language Models MATS Program

Epistemic Status: Quite confused. Using this short post as a signal for discussion.

In Evan’s post about Chris Olah’s views on AGI safety, there is a diagram which loosely points to how Chris thinks model interpretability will be impacted at different levels of model “strength” (note that this is vague intuition; worth looking at Chris’ comment on this post):

I always thought this diagram still held up in the LLM regime, even though it seems the diagram pointed specifically to interpretability with CNN vision models. However, I had a brief exchange with Neel Nanda about the Valley of Confused Abstractions in the context of language models, and I thought this might be a good thing to revisit.

I’ve been imagining that language models with the “strength” of GPT-2 are somewhere near the bottom of the Valley of Confused Abstractions, but the much bigger models are a bit further along the model strength axis (though I’m not sure where they fall). I’ve been thinking about this in the context of model editing or pointing/retargeting the model.

Here’s my exchange with Neel:

Neel: I’m personally not actually that convinced that the valley of confused abstractions is real, at least in language. 1L transformers are easier to interpret than large ones, which is completely unlike images!
Me: Does that not fit with the model interpretability diagram from: https://www.alignmentforum.org/posts/X2i9dQQK3gETCyqh2/chris-olah-s-views-on-agi-safety#What_if_interpretability_breaks_down_as_AI_gets_more_powerful_?
Meaning that abstractions are easy enough to find with simple models, but as you scale them up you have to cross the valley of confused abstraction before you get to “crisp abstractions.” Are you saying we are close to “human-level” in some domains and it’s still hard to find abstractions? My assumption was that we simply have not passed the valley yet so yes larger models will be harder to interpret. Maybe I’m misunderstanding?
Neel: Oh, I just don’t think that diagram is correct. Chris found that tiny image models are really hard to interpret, but we later found that tiny transformers are fairly easy.
Me: I guess this is what I’m having a hard time understanding: The diagram seems to imply that tiny (simple?) models are easy to interpret. In the example in the text, my assumption was that AlexNet was just closer to the bottom of the valley than InceptionV1. But you could have even simpler models than AlexNet that would be more interpretable but less powerful?
Neel: Ah, I think the diagram is different for image convnets and for language transformers.

My understanding was that 1-layer transformers being easy to interpret still agrees with the diagram, and current big-boy models are just not past the Valley of Confused Abstractions yet.

Ok, but if what Neel says is true, what might the diagram look like for language transformers?

I’m confused at the moment, but my thinking used to go something like this: in the case of GPT-2, it is trying to make sense of all the data it has been trained on but just isn’t big enough to fully grasp the concept of “cheese” and “fromage” are essentially the same. But my expectation is that as the model gets bigger, it knows those two tokens mean the same thing, just in different languages. Maybe it does?

With that line of thinking, as model strength increases, it will help the model create crisper internal abstractions of a concept like “cheese.” But then what...at some point, the model gets too powerful, and it becomes too hard to pull out the “cheese/fromage” abstraction?

Anyway, I hoped that as LLMs (trained with the transformers architecture) increase in “strength” beyond the current models, the abstractions become crisper, and it becomes much easier to identify abstractions as it gets closer to some “human-level performance.” However, GPT-3 is already superhuman in some respects, so I’m unsure about how to disentangle this. I hope this post sparks some good conversation about how to de-confuse this and how the diagram should look for LLM transformers. Is it just some negative exponential with respect to model strength? Or does it humps like the original diagram?

What links here?

But is it really in Rome? An investigation of the ROME model editing technique by jacquesthibs (30 Dec 2022 2:40 UTC; 102 points)

jacquesthibs5 Dec 2022 13:36 UTC

19 points

11 comments2 min readLW link

AI Interpretability (ML & AI)Language Models MATS Program

Christopher Olah 5 Jan 2023 23:22 UTC
5 points
3
Sorry for not noticing this earlier. I’m not very active on LessWrong/AF. In case it’s still helpful, a couple of thoughts...
Firstly, I think people shouldn’t take this graph too seriously! I made it for a talk in ~2017, I think and even then it was intended as a vague intuition, not something I was confident in. I do occasionally gesture at it as a possible intuition, but it’s just a vague idea which may or may not be true.
I do think there’s some kind of empirical trend where models in some cases become harder to understand and then easier. For example:
- An MNIST model without a hidden layer (ie. a softmax generalized linear regression model) is easy to understand.
- An MNIST model with a single hidden layer is very hard to understand.
- Large convolutional MNIST/CIFAR models are perhaps slightly easier (you sometimes get nice convolutional filters).
- AlexNet is noticeably more interpretable. You start to see quite a few human interpretable neurons.
- InceptionV1 seems noticeably more interpretable still.
These are all qualitative observations / not rigorous / not systematic.
So what is going on? I think there are several hypotheses:
- Confused abstractions / Modeling Limitations—Weak models can’t express the true abstractions, so they use a mixture of bad heuristics that are hard to reason about and much more complicated than the right thing. This is the original idea behind this curve.
- Overfitting—Perhaps it has less to do with the models and more to do with the data. See this paper where overfitting seems to make features less interpretable. So ImageNet models may be more interpretable because the dataset is harder and there’s less extreme overfitting.
- Superposition—It may be that apparent interpretability is actually all a result of superposition. For some reason, the later models have less superposition. For example, there are cases where larger models (with more neurons) may store less features in superposition.
- Overfitting—Superposition Interaction—There’s reason to believe that overfitting may heavily exploit superposition (overfit features detect very specific cases and so are very sparse, which is ideal for superposition). Thus, models which are overfit may be difficult to interpret, not because overfit features are intrinsically uninterpretable but because they cause superposition.
- Other Interaction Effects—It may be that there are other interactions between the above arguments. For example, perhaps modeling limitations cause the model to represent weird heuristics, which then more heavily exploit superpositition.
- Idiosyncrasies of Vision—It may be that the observations which motivated this curve are idiosyncratic to vision models, or maybe just the models I was studying.
I suspect it’s a mix of quite a few of these.
In the case of language models, I think superposition is really the driving force and is quite different from the vision case (language model features seem to typically be much sparser than vision model ones).
What links here?
- Is the “Valley of Confused Abstractions” real? by jacquesthibs (5 Dec 2022 13:36 UTC; 19 points)
- jacquesthibs's comment on Shah (DeepMind) and Leahy (Conjecture) Discuss Alignment Cruxes by OliviaJ (3 May 2023 2:38 UTC; 2 points)
Neel Nanda 5 Dec 2022 14:16 UTC
LW: 5 AF: 3
0
AF
I’m interested in hearing other people’s takes on this question! I also found that a tiny modular addition model was very clean and interpretable. My personal guess is that discrete input data lends itself to clean, logical algorithms more so than than continuous input data, and that image models need to devote a lot of parameters to processing the inputs into meaningful features at all, in a way that leads to the confusion. OTOH, maybe I’m just overfitting.
tailcalled 5 Dec 2022 15:35 UTC
4 points
1
I thought the peak of simple models would be something like a sparse Bag of Words model, and then all models that have been considered so far just go deeper and deeper into the valley of confused abstractions, and that we are not yet at the point where we can escape. But I might be wrong.
- jacquesthibs 5 Dec 2022 16:04 UTC
  1 point
  0
  Parent
  Yeah, this was my impression as well!
  - tailcalled 5 Dec 2022 16:11 UTC
    3 points
    0
    Parent
    I guess I would assume that GPT-style models don’t have a great chance of escaping the valley of confused abstractions because their interface to the real world is very weird. Predicting scrapes of internet text.
    - jacquesthibs 5 Dec 2022 16:27 UTC
      1 point
      0
      Parent
      Updated the diagram I drew on to better illustrate my confusion.
      And yeah, it may very well be that GPT-style never leaves the valley. Hmm.
jacquesthibs 2 Aug 2023 6:23 UTC
2 points
0
New (somewhat relevant) paper:
Scale Alone Does not Improve Mechanistic Interpretability in Vision Models
Abstract: In light of the recent widespread adoption of AI systems, understanding the internal information processing of neural networks has become increasingly critical. Most recently, machine vision has seen remarkable progress by scaling neural networks to unprecedented levels in dataset and model size. We here ask whether this extraordinary increase in scale also positively impacts the field of mechanistic interpretability. In other words, has our understanding of the inner workings of scaled neural networks improved as well? We here use a psychophysical paradigm to quantify mechanistic interpretability for a diverse suite of models and find no scaling effect for interpretability—neither for model nor dataset size. Specifically, none of the nine investigated state-of-the-art models are easier to interpret than the GoogLeNet model from almost a decade ago. Latest-generation vision models appear even less interpretable than older architectures, hinting at a regression rather than improvement, with modern models sacrificing interpretability for accuracy. These results highlight the need for models explicitly designed to be mechanistically interpretable and the need for more helpful interpretability methods to increase our understanding of networks at an atomic level. We release a dataset containing more than 120′000 human responses from our psychophysical evaluation of 767 units across nine models. This dataset is meant to facilitate research on automated instead of human-based interpretability evaluations that can ultimately be leveraged to directly optimize the mechanistic interpretability of models.
Conclusion: In this paper, we set out to answer the question: Does scale improve the mechanistic interpretability of vision models? By running extensive psychophysical experiments and comparing various models, we come to the conclusion that none of the investigated axes seem to positively affect model interpretability: Neither the size of the model, nor that of the dataset, nor model architecture or training scheme improve interpretability. This result highlights the importance of building more interpretable models: unless we explicitly optimize for this property, we do not get it for free by just increasing downstream task performance.
Adam Jermyn 5 Dec 2022 16:13 UTC
LW: 2 AF: 2
0
AF
Could it be that Chris’s diagram gets recovered if the vertical scale is “total interpretable capabilities”? Like maybe tiny transformers are more interpretable in that we can understand ~all of what they’re doing, but they’re not doing much, so maybe it’s still the case that the amount of capability we can understand has a valley and then a peak at higher capability.
- jacquesthibs 5 Dec 2022 16:32 UTC
  LW: 1 AF: 1
  0
  AF Parent
  As in, the ratio between (interpretable capabilities / total capabilities) still asymptotes to zero, but the number of interpretable capabilities goes up (and then maybe back down) as the models gain more capabilities?
  - Adam Jermyn 5 Dec 2022 19:55 UTC
    LW: 1 AF: 1
    0
    AF Parent
    Yeah. Or maybe not even to zero but it isn’t increasing.
Slider 5 Dec 2022 13:46 UTC
2 points
0
I encountered a similarish pattern a long time ago
https://www.lesswrong.com/posts/RQpNHSiWaXTvDxt6R/coherent-decisions-imply-consistent-utilities?commentId=9Fm2zrYT9aRSdzSde