(Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need

Epistemic status: Theorizing on topics I’m not qualified for. Trying my best to be truth-seeking instead of hyping up my idea. Not much here is original, but hopefully the combination is useful. This hypothesis deserves more time and consideration but I’m sharing this minimal version to get some feedback before sinking more time into it. “We believe there’s a lot of value in articulating a strong version of something one may believe to be true, even if it might be false.”

This is a somewhat living document as I come back and add more ideas.

The Heuristics Hypothesis: A Bag of Heuristics is All There Is and a Bag of Heuristics is All You Need

  • A heuristic is a local, interpretable, and simple function (e.g., boolean/​arithmetic/​lookup functions) learned from the training data. There are multiple heuristics in each layer and their outputs are used in later layers.

    • It would be useful to treat heuristics as the fundamental object of study in interpretability as opposed to features.

  • By “All there is,” I claim that a bag of heuristics is a useful model for neural network computation. Neural networks generalize when it is able to combine learned heuristics in ways not seen in the training data.[1]

    • Note that this doesn’t mean that LLMs aren’t doing some form of search or planning, but rather that it would be useful to think about the search/​planning process as being implemented through heuristics.

  • By “All you need” I mean that learning lots of heuristics and how to combine them is all you need to get to AGI and beyond.

    • I’m less confident about the AGI part, but I am fairly confident that we can get more powerful models through scaling, and that scaling is mostly about learning more heuristics and composing them. We can probably get much more powerful models that still mostly rely heuristics-based computation.

    • If this is true, then we can answer theoretical alignment questions and forecast future capabilities by studying how heuristics are learned and combined as we scale models.

Why would you want to use the heuristics-based framework when thinking about neural networks?

  • I think it’s probably a good abstraction and accurately captures what the network is doing (see the Empirical Studies related to the hypothesis section).

    • In “using existing interpretability techniques to discover heuristics”, I propose some experiments to test the usefulness of this framework. I would love to hear your feedback on this post, but I’m especially interested in people’s ideas on how to test this hypothesis.

  • Heuristics-based explanations are fundamentally algorithmic, which could allow it to sidestep some issues related to causal interpretations (e.g., multiple redundant causes of the same behavior)

  • Treating heuristics as the fundamental unit in interpretability as opposed to features:

    • Focuses us on the functional computation done by the neural network component and ensures that whatever we find is relevant to “how the model is doing this task.”

    • Like features, they are also discrete and independent from the rest of the network. For example, we can investigate the heuristic “if this MLP layer sees ‘Michael Jordan,’ writes ‘Professional Basketball Player’ to the occupation space of the residual stream” without needing to trace out the entire causal graph/​circuit.

    • Intuitively, a single matrix multiplication or MLP layer can’t be implementing any algorithm that’s super complicated.

    • Counterpoint: the main bottleneck is probably distributed computations. That is, multiple parts of the network work together to complete what to humans feels like a single function. (see e.g., the work on fact finding, and appendix G here)

  • While I might be adding complexity by introducing yet another abstraction, I’ve found it useful in the past to have multiple frameworks to apply to a problem to tease it from different angles. I think adding a new idea probably does more benefit than harm.

  • I believe that breaking down forward passes into functions that fit our vague notion of “heuristics” is likely the easiest path to fulfill our vague high-level goal of “understanding why a neural network did a thing.”

    • This approach can look the same as a circuit-style analysis operationally, but I have some ideas on exactly how to draw these circuits through the lens of the heuristics hypothesis.

Feel free to jump around this post and check out the sections that interests you. Each section is mostly independent of the others.

How can interpretability win if the hypothesis is true?

I want to first clarify something I do not think we need to win: a one-to-one mapping between neural network computation and heuristics. I believe that we can have multiple acceptable heuristics-based explanations for a given forward pass (i.e., a one-to-many map). Any explanation that fits the following criteria—mostly copied from the original IOI paper—would be sufficient for “understanding why the model did what it did.”

  • Faithful: They correctly represent the underlying computation the model does.

  • Complete: They capture all of the computation used by the model.

  • Minimal: They do not capture any more than the needed heuristics.

  • Comprehensible: We can understand what each heuristic does, and we can understand how all of the heuristics work together (likely with AI help).

I believe a bag of heuristics is the easiest way to fulfill these four criteria on arbitrary inputs.

Corollary: Understanding neural network computation does not require us to learn “true features” as long as we have some set of faithful, complete, minimal, and comprehensible heuristics

A central way people evaluate sparse autoencoders (SAEs) is whether they find a set of “true” features. Researchers have varying intuitions on what true features should be, but a common theme is that they should be atomic (i.e., not composed of linear combinations of other features). This has led to people worrying that the sparsity term in the SAE loss leads to models combining commonly occurring atomic features into a single one (e.g., a red triangle feature instead of a red feature and a triangle feature, see also the recent work on feature absorption).

While learning intermediate variables in neural networks is a useful subgoal, I’m worried that the pursuit of atomic features—especially given that we can already get some sort of feature decompositions—is not the most productive task we could work on right now.

We should only care about features insofar as they are the inputs and outputs of heuristics/​circuits, and we should only care about monosemanticity insofar as it helps us understand the network. If our heuristic decomposition is faithful, complete, and minimal, it doesn’t matter if individual heuristics take non-atomic concepts as inputs as long as we humans can understand the composed concept (likely given AI aid).

Weak to strong winning

Here are various degrees of winning if the heuristics hypothesis is true.

Weak victory: We can decompose every forward pass into heuristics composed with each other. That is, we can throw away the rest of the activations and use only the heuristics to reconstruct the input-output relationship to a high fidelity.

  • Perhaps we’ll get something on the order of 10^3-10^6 heuristics[2] per forward pass, which we can use LLMs to disentangle.

  • I believe this is the weakest version of “explain why the network did what it did.”

    • For example, getting a list of heuristics for a specific forward pass doesn’t have to tell us anything about how the model would act if the inputs were different.

  • Still, this feels like an ambitious goal, and even partial successes could be useful for auditing/​Mechanistic Anomaly Detection/​general science of interpretability.

    • This is especially true if we can learn about how neural networks complete tasks that we currently do not know how to write algorithms for or even for humans to complete themselves (see, e.g., learning chess concepts from AlphaZero, or the artificial artificial neural network in curve circuits)

Medium victory: In addition to individual forward passes, we understand sets of heuristics that a model uses to solve what humans can think of as “tasks.” That is, we understand all heuristics that handle a certain class of inputs (e.g., the IOI circuit).

  • This is equivalent to having causal abstractions[3] for tasks that are robust for all variations of said task.

  • The distinction between weak and medium victory is also discussed in Mueller 2024, who writes:

    • If one’s goal is to understand how a model will generalize, one should also consider at least some local causal dependencies. However, if one’s goal is merely to understand which components will directly affect downstream performance (e.g., when editing or pruning models), it may suffice to only include components that directly affect the output…the Pareto frontier may simply consist of the minimal number of features needed to understand whether a model is making the right decisions in the right way.

Strong victory: We know every heuristic in the model and how they compose, which is analogous to the “reverse engineer a neural network” end goal.

Miscellaneous thoughts on interpretability with heuristics hypothesis

Interpretability with heuristics is not very different from existing circuits analysis. The main ideas I came up with is the focus on heuristics as the key unit of analysis and being explicitly OK with many different potential explanations/​levels of abstraction. As a result, it’s not clear if there’s anything major that we need to do differently. Sparse feature circuits, transcoders, and automated circuit discovery techniques already popular in the literature seem to be reasonable ways to proceed even if our end goal is a set of heuristics.

However, given that a weak victory does not require an enumeration of all features/​heuristics, it might be worth the time to try to discover more compute efficient ways to understand a single forward pass.

I also haven’t defined what a heuristic is because I’m genuinely not sure what the best level of abstraction would be. Here are some types of simple functions that I would consider as a “heuristic”

  • Any sort of Boolean or arithmetic operation

    • For example, “If I see the dog ears feature and the dog snout feature I will output +5 to the dog feature direction”

  • Any lookup/​if-then statement

    • For example, f(Location of the Eiffel Tower) = Paris[5]

  • The embedding/​unembedding matrices, which I see as trivially interpretable heuristics that map tokens to activations and activations to token logits.

Let me know if you have any other ideas.

A few more thoughts on verifying how correct the heuristics-based explanations are. I think there are two levels, the heuristics level, and the model level. At the heuristics level, we want to make sure that each individual heuristic is faithful to the underlying neural network computation. Ideally this could be done at the weight level, but we can also apply our bag of existing interpretability techniques.

At the model level, my hope is that we can use our interpretability techniques to discover new algorithms in the form of compositing heuristics that we don’t know how to write. One of my first memorable interactions with ChatGPT was when I asked it to help me rephrase some survey questions I was working on, and it was actually really helpful. We currently have no idea how to write down a program to do that! Learning all the heuristics involved for various tasks could be a path towards some form of Microscope AI. And, as is the case with circuits analysis, these algorithms fall out naturally once we construct the heuristics.

What does it mean for alignment theory if the heuristics hypothesis is true?

(I’ve spent orders of magnitude less time and effort on this section compared to the interp section, but I figured I’d mention a few ideas and collect some feedback. If people actually like this hypothesis I’ll spend some more time thinking through this)

I’m not super sure if the heuristics framework alone could make concrete predictions on key aspects of alignment theory. You can approximate any function arbitrarily closely with heuristics. In other words, as systems advance, any sort of high-level behavior could emerge even if it’s all heuristics operating below (see, e.g., Interpretability/​Tool-ness/​Alignment/​Corrigibility are not Composable, which is also a problem when we aggregate heuristics from each layer together).

However, the more powerful future model you’re worried about won’t just fall out of a coconut tree.[6] We need to understand its learning process and how it became powerful.

If learning more heuristics is all we need to get to more and more powerful systems, we should understand what types of heuristics and heuristics composition are learned first. Two relevant papers that come to mind are the quantization model of scaling, and work on which concepts are learned first in toy models by Park et al.. Work done here could help us understand if we’ll see, for example, capabilities generalization without alignment generalization. Generally though, it would be cool to see how alignment related concepts are learned and used compared to non-alignment related concepts.

On the surface, it seems like shard theory is more likely to be correct in the world where the heuristics hypothesis is true, although shards are higher-level abstractions compared to heuristics. I’d want to see some more concrete interpretability findings before making a strong claim though.

One opinion that I hold a bit more strongly after thinking through this post is that we could continue to get very economically useful models that are nonetheless incoherent in other ways. In the heuristics world, there’s less reason to believe in discontinuous jumps in performance, and more reason to believe that AIs will get really good at some things while still bad at others (see also Boaz on the shape of AGI).

Empirical studies related to the heuristics hypothesis (both in support and against)

  • Sanity check: Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks shows that transformers can learn heuristics and compose them in ways not seen in the training data.

  • OthelloGPT learned a bag of heuristics: OthelloGPT predict legal next moves in the board game using a bag of heuristics.

    • Maybe we would want to just to say that OthelloGPT has learned the world model for Othello and abstract away from the heuristics-based implementation. However, the real world is complicated and frontier models would likely have imperfect world models and human preference models, in which case it would be important to look at the heuristics-based implementations to understand the deficiencies.

  • The heuristics hypothesis is consistent with the fact that SAE features have some meaningful geometrical properties. If the model is made up of a bunch of heuristics, we could see specialized geometrical structures for certain classes of heuristics, such as a circular representation for the days of the week, or a linear representation for years that also serve as a timeline for historical events.

  • Recent paper on lookahead in the LeelaChess and the ensuing discussion on whether LeelaChess’s policy network can “see” several moves ahead or have merely learned heuristics for multi-move patterns.

    • I’m personally not sure if the model learned generalized notions of look-ahead or specialized multi-move patterns (which would imply that L12H12 only moves information from future and past board states in learned multi-move patterns).

    • Nonetheless, I think we can use a set of heuristics to break down the model’s thought process in either case (assuming that we can find those heuristics in the first place).

    • See also: Planning behavior in a recurrent neural network that plays Sokoban

  • We’ve found literal maps of the world in language models. In other words, networks clearly have some sort of “world model abstraction.”

    • As previously mentioned, I think the world model is probably going to be implemented as a series of heuristics. If we did come up with a way to decompile forward passes into 10^3-10^6 heuristics, we could probably use some LLM agent to classify some subset of those heuristics to the ones that build out the world model, and then use that higher-level abstraction to sort of “summarize” those set of heuristics. This is cheating a bit because now it no longer sounds like heuristics is all there is, but to me it’s more of like, “we’re not gonna worry about this set of heuristics because we know what they do and even if it’s wrong it’s not safety-relevant”

    • And if the part of the world model is safety relevant, back into heuristics land we go.

    • See also the globe they found in llama-2

  • This story is also consistent with the quantization model of neural scaling. In our hypothesis, the Quantas for language modeling would be either 1. learning a new heuristic or 2. finding a way to compose two unconnected heuristics together. I also think the gradient clustering approach used in the quantization hypothesis is a promising path to uncover heuristics.

  • There is also some earlier work in deep learning that demonstrates how neural networks tend to learn shortcuts for various tasks.

We need to keep in mind that the streetlight effect is certainly contaminating our evidence. That is, simple heuristics are easier for interpretability researchers to recover than complex data structures, and we should expect more evidence for them.

It’s also cool to think through some other, general neural network phenomenon with the heuristics hypothesis in mind. It makes sense for the network to have some sort of redundancies (e.g., backup name mover heads) if there are similar heuristics learned at the same time. It makes sense that you could get the network to output whatever arbitrary text you want with an optimized string, since you can activate a weird set of heuristics and compose them. As heuristics compose from one layer to the next, they would need intermediate variables to communicate their results. Thus it makes sense that activation engineering works well across a wide range of concepts.

(There are some other results that come to mind which makes less sense, e.g., the 800 orthogonal code steering vectors, although I think Nina Panickssery’s explanation, if true, would be consistent with the heuristics hypothesis)

Weaknesses in the Heuristics Hypothesis

Some versions of the hypothesis are unfalsifiable

This theory’s biggest weakness is that we can decompose basically anything into a bag of composing heuristics given an infinite bag size. In other words, the heuristics hypothesis is technically consistent with every single hypothesis of how future systems would behave.

I do feel like this theory “explains too much.” However, the key interpretability-related claim is that heuristics based decompositions will be human-understandable, which is a more falsifiable claim.

The current features-focused research agendas might be the best way to uncover heuristics, and we don’t actually need to do anything different regardless how true the heuristics hypothesis is.

The best way to locate heuristics might start with trying to find the most monosematic/​atomic features, understand their functional implications, train transcoders, or following something like Lee Sharkley’s sparsify agenda. In other words, the new framing doesn’t add much. It’s also possible that we wouldn’t be able to achieve weak victory on a given task without understanding the whole task family, in which case the idea of a weak victory doesn’t really matter.

Perhaps this is true, but I think it’s worth thinking through this some more. I’m worried that the field has focused on features mostly due to path dependence from the original circuits thread that posited features as “the fundamental unit of neural networks” (although certainly not all researchers are focused on features, see e.g. Geiger’s causal abstractions agenda). Also, training SAEs that catalog all the features of a model is expensive and unnecessary for the weak victory condition I mentioned above. Trying to find cheaper interpretability techniques that are just meant to understand individual forward passes seems like a worthy thing to try.

I’m not super sure if that’s true? I think it’s reasonable to assume that two different sets of heuristics would be active in the case where the model is deceptive versus not deceptive, even conditional on the final token logits looking the same.

Inspirations and related work that I haven’t already mentioned

  • I envisioned most of this post before reading Lewis Smith’s most excellent post, The ‘strong’ feature hypothesis could be wrong, which makes many related points. A few relevant passages include:

    • I worry that there is a conceptual problem here, especially if the focus is on cataloging features and not on what the features do.

    • In other words, the picture implied by the strong [linear representation hypothesis] and monosemanticity is that first features come first, and then circuits, but this might be the wrong order; it’s also possible for circuits to be the primary objects, with features being (sometimes) a residue of underlying tacit computation.

  • Rohin Shah seems to had (have?) a similar view. He wrote in a post four years ago:

    • In particular, current systems trained by RL look like a grab bag of heuristics that correlate well with obtaining high reward. I think that as AI systems become more powerful, the heuristics will become more and more general, but they still won’t decompose naturally into an objective function, a world model, and search.

  • Hidenori Tanaka’s group has several really interesting papers on concept learning and generalization, with some experiments on toy models. Their framing of generalization as concept composition was very inspiring

    • I think they are totally underrated.

  • I’ve already mentioned stuff on causal interpretability, but I’ll also cite this survey on the subject. Similar to Tanaka group’s work, I think causal interpretability probably deserves more attention on lesswrong.

  • I just saw this post from October 2022, Help out Redwood Research’s interpretability team by finding heuristics implemented by GPT-2 small. My guess is that they’ve stopped trying to pursue this project?

    • They focus more on scenarios where the entire network acts as a heuristic and completes a task, as opposed to locating specific heuristics in parts of the network.

  • Learning heuristics could hopefully also help us tackle some of the issues raised in Aaron Muller’s paper on Missed Causes and Ambiguous Effects such as missing redundant causes.

  • I approached the “understand what models are doing” problem from perhaps a more curiosity-driven framework than a applications-driven perspective (see, e.g., Stephen Casper’s excellent engineer’s interpretability sequence). I’d love to spend more time thinking about applications, but figured that I should share the conceptual stuff first.

  • There are also other works on neural network scaling and learning (e.g., A Dynamical Model of Neural Scaling Laws, A Theory for Emergence of Complex Skills in Language Models), but those tend to not lend themselves to interpretability analysis.

  • This project initially began as an attempt to formalize heuristics as the best form of abstraction for neural network computation. I don’t think I made that much progress towards that goal.

Potential next steps

(Yet another section that I wrote rather quickly in the interests of getting some more feedback. I’m also ~60% sure that my specific research interests will shift in the next six months)

I can see four major directions for further exploration of the heuristics hypothesis

Deconfusion: What exactly is a heuristic, and what does a heuristics-based explanation look like?

Although this is a fairly fundamental question, I’m not super worried about needing to get this completely right before trying to look for heuristics in novel settings. I think we can make a lot of progress even with imperfect definitions. Still, applying the heuristics perspective to circuits we already understand (e.g., IOI) and trying to formalize what exactly heuristics are and aren’t seems useful.

Creating new interpretability methods that are centered around heuristics as the fundamental unit

This is speculative, but it might be worth spending some time to figure out if there are ways to directly study heuristics as their own unit. Distributed alignment search (DAS) is the closest idea that comes to mind, but (to my understanding, I could be wrong, sorry!) DAS is a supervised method that requires the researcher to have some causal model in mind before trying to find it in the neural network. Transcoders represent another attempt, but those require cataloging all features in the training data.

The worry is that the field got locked into looking for features and feature circuits for mostly path-dependence reasons, and there could be some low hanging fruit if we just thought harder about heuristics, especially given the recent evidence that they might play a big role.

Using existing interpretability tools to discover heuristics

This is a much more tractable option to better understand heuristics, especially given the similarities between heuristics and circuit building.

We can try to catalog individual heuristics manually by coming up with natural language tasks where we believe that the model would need to execute some heuristic at one point. By studying various individual heuristics, it could also help inspire specialized interpretability techniques to uncover them en masse. For example,

  • Any sort of task that requires computing an AND or OR between two concepts.

  • Situations where information from some specific token has to be moved to a later token.

  • This is not directly related, but I’m wondering if the MLP layers would be “linear” (i.e., MLP(x + f) = MLP(x) + MLP(f)) in some sense, and if so when.

    • For example, we wouldn’t expect this if the MLP is calculating some boolean function, but if it’s just doing factual retrieval that seems more likely.

We could also leverage the existing SAE and treat features as inputs/​outputs of heuristics. In this case, I’m hoping to advance beyond the gradient based attributions used in studies such as the Spare Feature Circuits paper. We can perhaps use gradient attribution to narrow down on the nodes and edges that we care about, but then focus on how, operationally, each edge is formed. The gradient attribution gives us only if-then relationships. Is that what’s locally happening with the model?

Applying the heuristics-framework to study theoretical questions in alignment.

If we decide that the heuristic model of computation is true/​useful, I’d be most excited to use it to study more theoretical topics and perhaps use it to forecast where future capabilities gains could come from. For example, Alex Turner said (two years ago):

I think that interpretability should adjudicate between competing theories of generalization and value formation in AIs (e.g. figure out whether and in what conditions a network learns a universal mesa objective, versus contextually activated objectives)

For example, we could study the dynamics of heuristics learning and composition in real world models, especially heuristics related to turning base models into assistants. One guess is that RLHF is sample efficient because it mostly changed how heuristics are composed with each other (and maybe boost existing heuristics to be more active), which might be a lot easier than learning new heuristics.[7] This would build on top of work done on toy models by Hidenori Tanaka’s group, and also maybe the quanta scaling hypothesis.

I’m currently trying to get into the AI safety field and will also be applying to MATS. Let me know if you’re interested in chatting more about any of these topics. Have a low bar for reaching out.

This post benefited from the feedback from Jack Zhang, Joe Campbell, Mat Allen, Tim Kostolansky, Veniamin Veselovsky, and woog. All errors are my own.

  1. ^

    This definition of generalization comes from Okawa et al. (2023)

  2. ^

    Source? I made it up

  3. ^

    I really struggled to understand this paper :( Would be down to go through it with someone.

  4. ^

    [Citation needed]

  5. ^

    Funny story but I almost wrote down Rome. The real rank one model editing is the one they did to my brain.

  6. ^
  7. ^

    Counterpoint: maybe learning new heuristics is easy and frontier models just have a good ability to learn by the time they’re done with pretraining.