What’s up with LLMs representing XORs of arbitrary features?

Thanks to Clément Dumas, Nikola Jurković, Nora Belrose, Arthur Conmy, and Oam Patel for feedback.

In the comments of the post on Google Deepmind’s CCS challenges paper, I expressed skepticism that some of the experimental results seemed possible. When addressing my concerns, Rohin Shah made some claims along the lines of “If an LLM linearly represents features a and b, then it will also linearly represent their XOR, , and this is true even in settings where there’s no obvious reason the model would need to make use of the feature [1]

For reasons that I’ll explain below, I thought this claim was absolutely bonkers, both in general and in the specific setting that the GDM paper was working in. So I ran some experiments to prove Rohin wrong.

The result: Rohin was right and I was wrong. LLMs seem to compute and linearly represent XORs of features even when there’s no obvious reason to do so.

I think this is deeply weird and surprising. If something like this holds generally, I think this has importance far beyond the original question of “Is CCS useful?”

In the rest of this post I’ll:

  • Articulate a claim I’ll call “representation of arbitrary XORs (RAX)”: LLMs compute and linearly represent XORs of arbitrary features, even when there’s no reason to do so.

  • Explain why it would be shocking if RAX is true. For example, without additional assumptions, RAX implies that linear probes should utterly fail to generalize across distributional shift, no matter how minor the distributional shift. (Empirically, linear probes often do generalize decently.)

  • Present experiments showing that RAX seems to be true in every case that I’ve checked.

  • Think through what RAX would mean for AI safety research: overall, probably a bad sign for interpretability work in general, and work that relies on using simple probes of model internals (e.g. ELK probes or coup probes) in particular.

  • Make some guesses about what’s really going on here.

Overall, this has left me very confused: I’ve found myself simultaneously having (a) an argument that , (b) empirical evidence of , and (c) empirical evidence of . (Here A = RAX and B = other facts about LLM representations.)

The RAX claim: LLMs linearly represent XORs of arbitrary features, even when there’s no reason to do so

To keep things simple, throughout this post, I’ll say that a model linearly represents a binary feature f if there is a linear probe out of the model’s latent space which is accurate for classifying f; in this case, I’ll denote the corresponding direction as . This is not how I would typically use the terminology “linearly represents” – normally I would reserve the term for a stronger notion which, at minimum, requires the model to actually make use of the feature direction when performing cognition involving the feature[2]. But I’ll intentionally abuse the terminology here because I don’t think this distinction matters much for what I’ll discuss.

If a model linearly represents features a and b, then it automatically linearly represents and .

Linear probes for and . (Note that and coincide – that’s fine.)

However, is not automatically linearly represented – no linear probe in the figure above would be accurate for classifying . Thus, if the model wants to make use of the feature , then it needs to do something additional: allocate another direction[3] (more model capacity) to representing , and also perform the computation of so that it knows what value to store along this new direction.

The representation of arbitrary XORs (RAX) claim, in its strongest form, asserts that whenever a LLM linearly represents features a and b, it will also linearly represent . Concretely, this might look something like: in layer 5, the model computes and linearly represents the features “has positive sentiment” and “relates to soccer”, and then in layer 6 the model computes and represents “has positive sentiment” XOR “relates to soccer”.

Why might models represent XORs? In the CCS challenges post’s comment thread, Rohin offered one explanation: if a, b, and are linearly represented, then any boolean function of a and b is also linearly represented. On the other hand, as I’ll argue in the next section, this comes at the cost of exponentially increasing the amount of capacity the model needs to allocate.

RAX would be very surprising

In this section I’ll go through some implications of RAX. First I’ll argue that RAX implies linear probes should never generalize at all across even very minor distributional shifts. Second, I’ll argue that if you previously thought LLMs linearly represent N features, RAX would imply that LLMs actually linearly represent exp(N) features (including XORs of features). These arguments aren’t proofs, and in “What’s going on?”, I’ll discuss some additional assumptions one could make about the structure of model internals that would make these arguments fail.

Without additional assumptions, RAX implies linear probes shouldn’t generalize

First I’ll make an overly simplistic and incorrect version of this argument as an intuition pump; then I’ll explain the correct version of this argument.

Suppose there are two features, a and b, and we train a linear probe to classify a on a dataset where b is always false. What will the accuracy of this probe be when evaluated on a test dataset where b is always true?

<incorrect argument>
Assuming RAX, there are two features which get high accuracy on the training data: a and . The former feature gets 100% accuracy on the test data, and the latter feature gets 0%, so on average we should expect 50% accuracy.
</​incorrect argument>

The issue with the above argument is that the direction learned by the probe won’t align with either the a direction or the direction, but will be a linear combination of the two. So here’s how to make the above argument properly: let’s assume that the directions representing a, b, and are orthogonal and the variation along these directions is equal (i.e. all of the features are “equally salient”). Then as shown by the figure below, logistic regression on the train set would learn the direction where is the direction representing a feature f. But this direction gets 50% accuracy on the test set.

Assuming RAX, one would naively expect a linear probe trained on a dataset where b is always false to have 50% accuracy on a test set where b is always true.

LLMs linearly represent more than two features, and there will often be many differences between the train set and the test set, but this doesn’t change the basic story: as long as there is any feature which systematically differs between the train and test set (e.g. the train set is sentiment classification for movie reviews and the test set is sentiment classification for product reviews), the argument above would predict that linear probes will completely fail to generalize from train to test.

This is not the result that we typically see: rather, there’s often (not always) considerable generalization from train to test, with generalization getting continuously worse the larger the degree of distributional shift.

In “What’s going on?”, we’ll explore additional assumptions we could enforce which would prevent this argument from going through while still being consistent with RAX. One of these assumptions involves asserting that “basic” feature directions (those corresponding to a and b) are “more salient” than directions representing XORs – that is, the variance along and is larger than variance along . However, I’ll note that:

  • it’s not obvious why something like this would be true, suggesting that we’re missing a big part of the story for why linear probes ever generalize;

  • even if “basic” feature directions are more salient, the argument here still goes through to a degree, implying a qualitatively new reason to expect poor generalization from linear probes.

I’ll discuss these issues more in “What RAX means for people who work with model internals”.

Models have exponentially more stuff than you thought they did

Let’s say you previously thought that your model was keeping track of three features: a, b, and c. If RAX is true, then it implies that your model is also keeping track not only of , , and , but also (since it is the XOR of and ). An easy counting argument shows that the number of multi-way XORs of N features is ~. I think that for most people, learning that models have exponentially more stuff than they previously thought should be a massive, surprising update.

There are two ways to resist this argument, which I’ll discuss in more depth later in “What’s going on?”:

  • To deny that XORs of basic features are actually using excess model capacity, because they’re being represented linearly “by accident” or as an unintended consequence of some other useful computation. (By analogy, the model automatically linearly represents ANDs of arbitrary features without having to expend extra capacity.)

  • To deny forms of RAX that imply multi-way XORs are linearly represented, with the model somehow knowing to compute and , but not .

The evidence: RAX is true in every case I’ve checked

Code is in the xor_probing.ipynb file here.

For a while it has been my opinion that the datasets people use to study this stuff are bad, and that the datasets from the CCS paper (also used by the CCS challenges paper) are especially awful. So I’ll instead be using the cities and neg_cities datasets – which I think are really quite good – from my Geometry of Truth paper. These datasets consist of statements about city locations and their negations (see examples below).

I’ll pick half of the statements in cities and prepend “Alice:” to both the statement and its negation in neg_cities; I’ll prepend “Bob:” to the rest. I’ll track three features:

  • has_alice: true for statements that start with “Alice:”, false for statements that start with “Bob:”

  • has_not: true for statements containing the word “not” (so statements from neg_cities) and false otherwise

  • label: true if the statement is a true statement, false otherwise

Statementhas_alicehas_notlabel
Bob: The city of Kaifeng is in Mexico.FFF
Bob: The city of Calgary is in Canada.FFT
Bob: The city of Chicago is not in the United States.FTF
Bob: The city of Shanghai is not in Mexico.FTT
Alice: The city of Tijuana is in the United Arab Emirates.TFF
Alice: The city of Bangkok is in Thailand.TFT
Alice: The city of Turin is not in Italy.TTF
Alice: The city of Oslo is not in Venezuela.TTT

Working with LLaMA-2-13B, I’ll extract the layer 14 residual stream representations[4] over the final token (all of the statements end with a period, so this is always the period token). Then I’ll train linear probes for a bunch of features. I’m doing a 8020 train/​test split and reporting accuracy on the test set.

I’ll note that it’s very important to me that has_alice xor has_not doesn’t have any “natural interpretation” as a feature that the model would want to compute anyway (unlike other cases where the xor of two features might have a natural meaning). has_alice and has_not are just two completely random linearly-represented features that I don’t really expect to have any downstream use to the model. Yet has_alice xor has_not is being represented linearly anyway.

Here’s another experiment: I’ll independently center the datasets of negated and unnegated statements by subtracting off their means. This will make it impossible for a linear probe to get good accuracy for the feature has_not. But linear probes will still be able to get good accuracy for has_alice xor has_not and has_not xor label.

By the way, if you just train a linear probe to classify ‘label’ on statements starting with “Alice:”, it is able to generalize fine to statements starting with “Bob:”.

But this is not so true for transfer from unnegated to negated statements:

Here’s some PCA visualizations for anyone interested.

Top 3 principal components of my dataset, colored according to different criteria.
Principal components 7-10 of my dataset, colored according to the XOR of has_alice and has_not.

I also did variations on this experiment where:

  • “Alice” or “Bob” were appended to the end instead of prepended, and the hidden state was extracted over the “Alice” /​ “Bob” token.

  • The inputs were of the form “[random generation by LLaMA-13B] [“true” or “false”] [“banana” or “shed”]” and the features were has_true and has_banana.

In every case that I looked at, I was able to find linear probes which had perfect or very-near-perfect accuracy for classifying XORs of features.

What does RAX mean for model internals research?

Probably a lot of things. Assuming RAX is true, I’ll give a few takeaways.

Qualitatively new reasons for linear probes not to generalize

Previously, when training linear probes to classify some feature f, the main problem that I worried about was correlations between f and other features which I didn’t want the probe to be sensitive to[5]. For example, since China has disproportionately many large cities, I had to be careful when preparing the cities and neg_cities datasets to ensure that the probe couldn’t use “contains the word ‘China’” as a heuristic for “true.” More subtly, if you are training a probe for f = “true statement vs. false statement”, you need to worry that, if your model also has a feature for f’ = “humans think is true vs. humans think is false”, your probe might instead pick up on f’ since f and f’ are correlated in your training data.

On the other hand, RAX introduces a qualitatively new way that linear probes can fail to learn good directions. Suppose a is a feature you care about (e.g. “true vs. false statements”) and b is some unrelated feature which is constant in your training data (e.g. b = “relates to geography”). Without RAX, you would not expect b to cause any problems: it’s constant on your training data and in particular uncorrelated with a, so there’s no reason for it to affect the direction your probes find. But looking again at the 3D cube plot from before, we see that RAX implies that your probe will instead learn a component along the direction .

Assuming RAX, linear probes will be affected by the presence of unrelated features, even if those features do not vary in the training data.

This is wild. It implies that you can’t find a good direction for your feature unless your training data is diverse with respect to every feature that your LLM linearly represents. In particular, it implies that your probe is less likely to generalize to data where b has a different value than in your training set. And this is true to some degree even if you think that the directions representing basic features (like a and b) are “more salient” in some sense.

Results of probing experiments are much harder to interpret

For a while, interpretability researchers have had a general sense that “you can probe absolutely anything out of NN representations”; this makes it hard to tell what you can conclude from probing experiments. (E.g. just because you can probe model internals for a concept does not imply that the model “actually knows” about that concept.) RAX makes this situation much worse.

For example, I mentioned before that I’ve always disliked the datasets from the original CCS paper. To explain why, let’s look at some example prompt templates:

From appendix I of Discovering Latent Knowledge by Burns et al.

Here [label0]/​[label1] are positive/​negative (in some order), [label] is “positive” in one part of the contrast pair and “negative” in the other, and [text] is an IMDb movie review.

Two issues:

  1. Considering how small the models used in the CCS paper were, I’ve always been skeptical that they were really able to understand these inputs – in my experience, larger models get confused by much simpler inputs.

  2. The sense of true/​false is subtly, but importantly, different in the two prompts shown. In the first prompt “true vs. false” refers to the truth value of a factual statement (“the sentiment of this example is positive”). In the second, it refers to the correctness of an answer to a question. These have always seemed to me like intuitively very different notions of “truth,” and I’ve expected LLMs to track them separately.

Because of my complaints above, I’ve always had a hard time understanding why the experiments in the original CCS paper worked at all; it always felt to me like there was something I didn’t understand going on.

RAX would explain what that something is: features like “has_great xor has_positive” or “has_awesome xor has_positive” are probably very useful heuristics for guessing whether “[movie review] The sentiment of this review is [label]” is a correct statement or not. In other words, if small models have directions which represent XORs of simple features about which words are/​aren’t present in their input, then linear probes on these models should already be able to do quite well!

The point of this example isn’t really about CCS. It’s this: previously one has needed to worry whether linear probes could be cheesing their classification task by aggregating simple token-level heuristics like “inputs that contain the word China are more likely to be true.” But RAX implies that you need to worry about much more complicated token-level heuristics; in principle, these heuristics could be as complicated as “arbitrary boolean functions of token-level features”!

Applications of interpretability need to either have a way to distinguish XORs of features from basic features, or need to be robust to an exponential increase in number of features

Many possible applications of interpretability follow a template like:

  1. Cheaply find a not-too-big collection of features satisfying [property].

  2. Maybe do something expensive (e.g. manual interpretability or a circuits-level analysis) to further narrow this collection down.

  3. Do something with the resulting collection.

For example, if your plan is to solve ELK by probing LLMs for whether they believe statements to be true, then (1) is “find a bunch of probes which are accurate for classifying true vs. false on the training data,” (2) is “somehow figure out which of these probes generalize in the desired way” (e.g., you need to weed out probes which are too sensitive to features like “smart humans think X is true”), and (3) is “use the resulting probe.”

If you don’t have a way of explaining why directions representing XORs of features are different from other directions, then your collection from step (1) might be exponentially larger than you were anticipating. If your step (2) isn’t able to deal with this well, then your application won’t work.

One way that XOR directions could be different is for them to be “more salient”; this is discussed further below.

What’s going on?

In this section I’ll try to build new world models which could explain both (a) the empirical evidence for RAX, and (b) the empirical observations that linear probes often generalize beyond their training distribution. Overall, I’m not really satisfied with any explanation and am pretty confused about what’s going on.

Basic features are more salient than XORs

We’ll say that a direction is “more salient” if the model’s representations have greater variation along this direction. If it’s true that basic feature directions are more salient than directions corresponding to XORs of basic features, this mitigates (but does not entirely eliminate) the problems that XOR directions pose for linear probe generalization. To see this, imagine stretching the 3D cube plot out along the a and b directions, but not the direction – the result is better alignment between the two arrows.

The less salient the direction, the better linear probes should generalize.

Empirically this seems to be true to some degree: in the visualizations above, has_alice and has_not seem are represented along the 3rd and 1st PC, respectively, whereas has_alice XOR has_not only starts to be visible when looking at PCs 6+.

The big question here is “why would basic feature directions be more salient?” I’ll discuss two possibilities.

Maybe is represented “incidentally” because NN representations are high-dimensional with lots of stuff represented by chance

More concretely “assuming that a and b are linearly represented, later layer representations will be made up of linear functions applied to nonlinearities applied to linear functions applied to nonlinearities applied to … linear functions of a and b. This seems like the sort of process that might, with high probability, end up producing a representation where some direction will be good for classifying ” In this case, we would expect the corresponding direction to not be very salient (because the model isn’t intentionally computing it).

I think this explanation is not correct. I reran my experiments from above on a “reset” version of LLaMA-2-13B. What this means is that, for each parameter in LLaMA-2-13B, I shuffled the weights of that parameter by permuting them along the last dimension[6]. The results:

  • The “token-level” features (has_alice and has_not) are still linearly represented.

    • (This is not surprising: even with randomized embeddings, the embedding of the ‘Alice’ token is still the same every time it appears.)

  • (Key observation) “has_alice xor has_not” does not seem to be linearly represented. Given that has_alice and has_not are linearly represented, a linear probe can automatically get at least .75 accuracy on “has_alice xor has_not” by being a “has_alice or has not” probe. The probe trained here does not beat that baseline.

  • (Unsurprising side note) The “label” feature (which tracks whether the factual statement is true or false) is not linearly represented. This is just a reflection of the fact that you can’t cheese the true vs. false task here by aggregating simple heuristics based on token-level features (e.g. treating statements containing “China” as more likely to be true).

Maybe is represented “incidentally” because it’s possible to aggregate noisy signals from many features which are correlated with boolean functions of a and b

Unlike the explanation in the previous section, this explanation relies on leveraging actually useful computation that we think the model is plausibly doing, so it isn’t falsified by the reset network experiments (where the model isn’t doing any useful computation).

At a high level, the idea here is that, even if there’s no reason for the model to compute , there might be a reason for the model to compute other features which are more correlated with than they are with or individually. In this case, linear probes might be able to extract a good signal for .

Here’s a more detailed explanation (feel free to skip).

Suppose has a natural interpretation as a feature that the model would want to track and do downstream computation with, e.g. if a = “first name is Michael” and b = “last name is Jordan” then can be naturally interpreted as “is Michael Jordan”. In this case, it wouldn’t be surprising the model computed this AND as and stored the result along some direction independent of and . Assuming the model has done this, we could then linearly extract with the probe

for some appropriate and .[7] This also works just as well if the feature f doesn’t match in general, but is perfectly correlated with on the data distribution we’re working with.

In the experiments above, a and b were pretty random features (e.g. (a, b) = (has_alice, has_not) or (a, b) = (has_true, has_banana)) with no natural interpretation for ; so it would be surprising if the LLM is computing and linearly representing along an independent direction for the same reasons it would be surprising if the LLM were doing this for . But perhaps there are many, many linearly features each of which has some correlation with above-and-beyond[8] their correlations with a or b individually. Then it might be possible to make the same approach as above work by aggregating the signals from all of the . Similar approaches will work upon replacing AND with OR, NOR, or most other boolean functions of a and b.

In this case, since XOR is represented “incidentally” I would expect the variation along the representing direction to be much smaller than the variance along the directions for .

Considering that the XOR probes from the experiments have perfect or near-perfect accuracy, I think an explanation like this would be a bit surprising, since it would require either (a) a large number of features which have the right correlational relationship to , or (b) a small number of such features with the right correlations and very little noise. I think both (a) and (b) would be surprising given that a and b are just random features – why would there be many features which are strongly correlated with but only weakly correlated with a and b individually?

Nevertheless, I think this is currently the explanation that I put the most weight on.

Maybe models track which features are basic and enforce that these features be more salient

In other words, maybe the LLM is recording somewhere the information that a and b are basic features; then when it goes to compute , it artificially makes this direction less salient. And when the model computes a new basic feature as a boolean function of other features, it somehow notes that this new feature should be treated as basic and artificially increases the salience along the new feature direction.

If true, this would be a big deal: if we could figure out how the model is distinguishing between basic feature directions and other directions, we might be able to use that to find all of the basic feature directions. But mostly this is a bit wacky and too-clean to be something that I expect real LLMs actually do.

Models compute a bunch, but not all, XORs in a way that we don’t currently understand

To give an example of what I mean by this hypothesis class, here’s a hypothetical way that a transformer might work:

  • In layers 0-5, the MLPs indiscriminately compute all XORs of arbitrary features (i.e., RAX is true in the earlier layers).

  • After layer 5, the model only computes new features when there’s a reason to do so.

This is wacky but seems like a plausible thing a model might do: by doing this, the model would be able to, in later layers, make use of arbitrary boolean functions of early layer features.

This explanation would explain the representation of XORs of token-level features like “has_alice xor has_not”, but wouldn’t necessarily explain features like “has_alice xor label”.

That said, other hypotheses of this shape seem possible, e.g. “XORs among features in the same attention head are computed” or other weird stuff like this.

  1. ^

    To be clear, this is not a direct quote, and Rohin explicitly clarified that he didn’t expect this to be true for arbitrary features a and b. Rohin only claimed that this was true in the case they were studying, and that he would guess “taking XORs of features” is a common motif in NNs.

  2. ^

    E.g. suppose the model is unaware of some feature f, but does have a direction corresponding to some feature f’ which is perfectly correlated with f in our data. According to the definition I use in this post, the model linearly represents f; this is not the way I would usually use the term.

  3. ^

    Throughout, I’ll always draw directions as if they’re orthogonal directions in the model’s latent space. It’s indeed the case that the model might represent features in superposition, so that these directions are not orthogonal, or even linearly independent. But that doesn’t change the basic dynamic: that the model must allocate additional capacity in order to represent the feature .

  4. ^

    Chosen to be the same hidden state as in my Geometry of Truth paper.

  5. ^

    When taking into account superposition among features, there are subtle geometrical issues one needs to worry about as well, which I discuss in section 4.1 of my truth paper.

  6. ^

    Another option would have been to just reinitialize the weights according to some distribution. Resetting the network in this way is a bit more principled for experiments of this sort, because it erases everything the model learned during training, but maintains lots of the basic statistical properties of the NN weights.

  7. ^

    The nonlinearity in the computation of f is essential for this to work.

  8. ^

    This above-and-beyond is needed for the same reason that the nonlinearity above was needed.