Research Scientist at DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
Rohin Shah(Rohin Shah)
AtP*: An efficient and scalable method for localizing LLM behaviour to components
I feel like a lot of these arguments could be pretty easily made of individual AI safety researchers. E.g.
Misaligned Incentives
In much the same way that AI systems may have perverse incentives, so do the [AI safety researchers]. They are [humans]. They need to make money, [feed themselves, and attract partners]. [Redacted and redacted even just got married.] This type of accountability to [personal] interests is not perfectly in line with doing what is good for human interests. Moreover, [AI safety researchers are often] technocrats whose values and demographics do not represent humanity particularly well. Optimizing for the goals that the [AI safety researchers] have is not the same thing as optimizing for human welfare. Goodhart’s Law applies.
I feel pretty similarly about most of the other arguments in this post.
Tbc I think there are plenty of things one could reasonably critique scaling labs about, I just think the argumentation in this post is by and large off the mark, and implies a standard that if actually taken literally would be a similarly damning critique of the alignment community.
(Conflict of interest notice: I work at Google DeepMind.)
Sounds reasonable, though idk what you think realistic values of N are (my wild guess with hardly any thought is 15 minutes − 1 day).
EDIT: Tbc in the 1 day case I’m imagining that most of the time goes towards running the experiment—it’s more a claim about what experiments we want to run. If we just talk about the time to write the code and launch the experiment I’m thinking of N in the range of 5 minutes to 1 hour.
Cool, that all roughly makes sense to me :)
I was certainly imagining at least some amount of multi-tasking (e.g. 4 projects at once each of which runs 8x faster). This doesn’t feel that crazy to me, I already do a moderate amount of multi-tasking.
Multi-tasking where you are responsible for the entire design of the project? (Designing the algorithm, choosing an experimental setting and associated metrics, knowing the related work, interpreting the results of the experiments, figuring out what the next experiment should be, …)
Suppose today I gave you a device where you put in moderately detailed instructions for experiments, and the device returns the results[1] with N minutes of latency and infinite throughput. Do you think you can spend 1 working day using this device to produce the same output as 4 copies of yourself working in parallel for a week (and continue to do that for months, after you’ve exhausted low-hanging fruit)?
… Having written this hypothetical out, I am finding it more plausible than before, at least for small enough N, though it still feels quite hard at e.g. N = 60.
- ^
The experiments can’t use too much compute. No solving the halting problem.
- ^
I agree it helps to run experiments at small scales first, but I’d be pretty surprised if that helped to the point of enabling a 30x speedup—that means that the AI labor allows you get 30x improvement in compute needed beyond what would be done by default by humans (though the 30x can include e.g. improving utilization, it’s not limited just to making individual experiments take less time).
I think the most plausible case for your position would be that the compute costs for ML research scale much less than quadratically with the size of the pretrained model, e.g. maybe (1) finetuning starts taking fewer data points as model size increases (sample efficiency improves with model capability), and so finetuning runs become a rounding error on compute, and (2) the vast majority of ML research progress involves nothing more expensive than finetuning runs. (Though in this world you have to wonder why we keep training bigger models instead of just investing solely in better finetuning the current biggest model.)
Another thing that occurred to me is that latency starts looking like another major bottleneck. Currently it seems feasible to make a paper’s worth of progress in ~6 months. With a 30x speedup, you now have to do that in 6 days. At that scale, introducing additional latency via experiments at small scales is a huge cost.
(I’m assuming here that the ideas and overall workflow are still managed by human researchers, since your hypothetical said that the AIs are just going from high level ideas to implemented experiments. If you have fully automated AI researchers then they don’t need to optimize latency as hard; they can instead get 30x speedup by having 30x as many researchers working but still producing a paper every 6 months.)
(Another possibility is that human ML researchers get really good at multi-tasking, and so e.g. they have 5 paper-equivalents at any given time, each of which takes 30 calendar days to complete. But I don’t believe that (most) human ML researchers are that good at multitasking on research ideas, and there isn’t that much time for them to learn.)
It also seems hard for the human researchers to have ideas good enough to turn into paper-equivalents every 6 days. Also hard for those researchers to keep on top of the literature well enough to be proposing stuff that actually makes progress rather than duplicating existing work they weren’t aware of, even given AI tools that help with understanding the literature.
Further, the current scaling laws imply huge inference availablity if huge amounts of compute are used for training.
Tbc the fact that running your automated ML implementers takes compute was a side point; I’d be making the same claims even if running the AIs was magically free.
Though even at a billion token-equivalents per second it seems plausible to me that your automated ML experiment implementers end up being a significant fraction of that compute. It depends quite significantly on how capable a single forward pass is, e.g. can the AI just generate an entire human-level pull request autoregressively (i.e. producing each token of the PR one at a time, without going back to fix errors) vs does it do similar things as humans (write tests and code, test, debug, eventually submit) vs. does it do way more iteration and error correction than humans (in parallel to avoid crazy high latency), do we use best-of-N sampling or similar tricks to improve quality of generations, etc.
I think ML research in particular can plausibly be accelerated by maybe 30x by only making it extremely fast and cheap to go from high level ideas to implemented experiments (rather than needing to generate these high level ideas)
Why doesn’t compute become the bottleneck well before the 30x mark? It seems like the AIs have to be superhuman at something to overcome that bottleneck (rather than just making it fast and cheap to implement experiments). Indeed the AIs make the problem somewhat worse, since you have to spend compute to run the AIs.
Come on, the claim “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to” absent any other qualifiers seems pretty clearly false. It is pretty important to qualify that you are talking about deceptive alignment or backdoors specifically (e.g. I’m on board with Ryan’s phrasing).
There’s a huge disanalogy between your paper’s setup and deception-in-general, which is that in your paper’s setup there is no behavioral impact at training time. Deception-in-general (e.g. sycophancy) often has behavioral impacts at training time and that’s by far the main reason to expect that we could address it.
Fwiw I thought the paper was pretty good at being clear that it was specifically deceptive alignment and backdoors that the claim applied to. But if you’re going to broaden that to a claim like “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to” without any additional qualifiers I think that’s a pretty big overclaim, and also I want to bet you on whether we can reduce sycophancy today.
I think you mostly need to hope that it doesn’t matter (because the crazy XOR directions aren’t too salient) or come up with some new idea.
Yeah certainly I’d expect the crazy XOR directions aren’t too salient.
I’ll note that if it ends up these XOR directions don’t matter for generalization in practice, then I start to feel better about CCS (along with other linear probing techniques). I know that for CCS you’re more worried about issues around correlations with features like true_according_to_Alice, but my feeling is that we might be able to handle spurious features that are that crazy and numerous, but not spurious features as crazy and numerous as these XORs.
Imo “true according to Alice” is nowhere near as “crazy” a feature as “has_true XOR has_banana”. It seems useful for the LLM to model what is true according to Alice! (Possibly I’m misunderstanding what you mean by “crazy” here.)
I’m not against linear probing techniques in general. I like linear probes, they seem like a very useful tool. I also like contrast pairs. But I would basically always use these techniques in a supervised way, because I don’t see a great reason to expect unsupervised methods to work better.
If I had to articulate my reason for being surprised here, it’d be something like:
I didn’t expect LLMs to compute many XORs incidentally
I didn’t expect LLMs to compute many XORs because they are useful
but lots of XORs seem to get computed anyway.
This is reasonable. My disagreement is mostly that I think LLMs are complicated things and do lots of incidental stuff we don’t yet understand. So I shouldn’t feel too surprised by any given observation that could be explained by an incidental hypothesis. But idk it doesn’t seem like an important point.
Yeah, agreed that’s a clear overclaim.
In general I believe that many (most?) people take it too far and make incorrect inferences—partly on priors about popular posts, and partly because many people including you believe this, and those people engage more with the Simulators crowd than I do.
Fwiw I was sympathetic to nostalgebraist’s positive review saying:
sometimes putting a name to what you “already know” makes a whole world of difference. [...] I see these takes, and I uniformly respond with some version of the sentiment “it seems like you aren’t thinking of GPT as a simulator!”
I think in all three of the linked cases I broadly directionally agreed with nostalgebraist, and thought that the Simulator framing was at least somewhat helpful in conveying the point. The first one didn’t seem that important (it was critiquing imo a relatively minor point), but the second and third seemed pretty direct rebuttals of popular-ish views. (Note I didn’t agree with all of what was said, e.g. nostalgebraist doesn’t seem at all worried about a base GPT-1000 model, whereas I would put some probability on doom for malign-prior reasons. But this feels more like “reasonable disagreement” than “wildly misled by simulator framing”.)
Yeah, I would be surprised if this is a good first-order approximation of what is going on inside an LLM. Or maybe you mean this in a non-mechanistic way?
Yes, I definitely meant this in the non-mechanistic way. Any mechanistic claims that sound simulator-flavored based just on the evidence in this post sounds clearly overconfident and probably wrong. I didn’t reread this post carefully but I don’t remember seeing mechanistic claims in it.
I agree that in a non-mechanistic way, the above will produce reasonable predictions, but that’s because that’s basically a description of the task the LLM is trained on. [...]
I mostly agree and this is an aspect of what I mean by “this post says obvious and uncontroversial things”. I’m not particularly advocating for this post in the review; I didn’t find it especially illuminating.
To give a concrete counterexample to the algorithm you propose for predicting what an LLM does next. Current LLMs have a broader knowledge base than any human alive. This means the algorithm of “figure out what real-world process would produce text like this” can’t be accurate
This seems somewhat in conflict with the previous quote?
Re: the concrete counterexample, yes I am in fact only making claims about base models; I agree it doesn’t work for RLHF’d models. Idk how you want to weigh the fact that this post basically just talks about base models in your review, I don’t have a strong opinion there.
I think it is in fact hard to get a base model to combine pieces of knowledge that tend not to be produced by any given human (e.g. writing an epistemically sound rap on the benefits of blood donation), and that often the strategy to get base models to do things like this is to write a prompt that makes it seem like we’re in the rare setting where text is being produced by an entity with those abilities.
The thing that’s confusing here is that the two-way XORs that my experiments are looking at just seem clearly not useful for anything.
Idk, I think it’s pretty hard to know what things are and aren’t useful for predicting the next token. For example, some of your features involve XORing with a “has_not” feature—XORing with an indicator for “not” might be exactly what you want to do to capture the effect of the “not”.
(Tbc here the hypothesis could be “the model computes XORs with has_not all the time, and then uses only some of them”, so it does have some aspect of “compute lots of XORs”, but it is still a hypothesis that clearly by default doesn’t produce multiway XORs.)
In contrast, the point I’m trying to make in the post is that RAX can cause problems even in the absence of spurious correlations like this.[1]
If you want you could rephrase this issue as ” and are spuriously correlated in training,” so I guess I should say “even in the absence of spurious correlations among basic features.”
… That’s exactly how I would rephrase the issue and I’m not clear on why you’re making a sharp distinction here.
As you noted, it will sometimes be the case that XOR features are more like basic features than derived features, and thus will be represented with high salience. I think incidental hypotheses will have a really hard time explaining this—do you agree?
I mean, I’d say the ones that are more like basic features are like that because it was useful, and it’s all the other XORs that are explained by incidental hypotheses. The incidental hypotheses shouldn’t be taken to be saying that all XORs are incidental, just the ones which aren’t explained by utility. Perhaps a different way of putting it is that I expect both utility and incidental hypotheses to be true to some extent.
Maybe on your model this is something simple like the weights computing the basic features being larger than weights computing derived features? If so, that’s the tracking I’m talking about, and is a potential thread to pull on for distinguishing basic vs. derived features using model internals.
Yes, on my model it could be something like the weights for basic features being large. It’s not necessarily that simple, e.g. it could also be that the derived features are in superposition with a larger number of other features that leads to more interference. If you’re calling that “tracking”, fair enough I guess; my main claim is that it shouldn’t be surprising. I agree it’s a potential thread for distinguishing such features.
I think the main thing I’d point to is this section (where I’ve changed bullet points to numbers for easier reference):
I can’t convey all that experiential data here, so here are some rationalizations of why I’m partial to the term, inspired by the context of this post:
The word “simulator” evokes a model of real processes which can be used to run virtual processes in virtual reality.
It suggests an ontological distinction between the simulator and things that are simulated, and avoids the fallacy of attributing contingent properties of the latter to the former.
It’s not confusing that multiple simulacra can be instantiated at once, or an agent embedded in a tragedy, etc.
It does not imply that the AI’s behavior is well-described (globally or locally) as expected utility maximization. An arbitrarily powerful/accurate simulation can depict arbitrarily hapless sims.
It does not imply that the AI is only capable of emulating things with direct precedent in the training data. A physics simulation, for instance, can simulate any phenomena that plays by its rules.
It emphasizes the role of the model as a transition rule that evolves processes over time. The power of factored cognition / chain-of-thought reasoning is obvious.
It emphasizes the role of the state in specifying and constructing the agent/process. The importance of prompt programming for capabilities is obvious if you think of the prompt as specifying a configuration that will be propagated forward in time.
It emphasizes the interactive nature of the model’s predictions – even though they’re “just text”, you can converse with simulacra, explore virtual environments, etc.
It’s clear that in order to actually do anything (intelligent, useful, dangerous, etc), the model must act through simulation of something.
I think (2)-(8) are basically correct, (1) isn’t really a claim, and (9) seems either false or vacuous. So I mostly feel like the core thesis as expressed in this post is broadly correct, not wrong. (I do feel like people have taken it further than is warranted, e.g. by expecting internal mechanisms to actually involve simulations, but I don’t think those claims are in this post.)
I also think it does in fact constrain expectations. Here’s a claim that I think this post points to: “To predict what a base model will do, figure out what real-world process was most likely to produce the context so far, then predict what text that real-world process would produce next, then adopt that as your prediction for what GPT would do”. Taken literally this is obviously false (e.g. you can know that GPT is not going to factor a large prime). But it’s a good first-order approximation, and I would still use that as an important input if I were to predict today how a base model is going to continue to complete text.
(Based on your other comments maybe you disagree with the last paragraph? That surprises me. I want to check that you are specifically thinking of base models and not RLHF’d or instruction tuned models.)
Personally I agree with janus that these are (and were) mostly obvious and uncontroversial things—to people who actually played with / thought about LLMs. But I’m not surprised that LWers steeped in theoretical / conceptual thinking about EU maximizers and instrumental convergence without much experience with practical systems (at least at the time this post was written) found these claims / ideas to be novel.
Nice post, and glad this got settled experimentally! I think it isn’t quite as counterintuitive as you make it out to be—the observations seem like they have reasonable explanations.
I feel pretty confident that there’s a systematic difference between basic features and derived features, where the basic features are more “salient”—I’ll be assuming such a distinction in the rest of the comment.
(I’m saying “derived” rather than “XOR” because it seems plausible that some XOR features are better thought of as “basic”, e.g. if they were very useful for the model to compute. E.g. the original intuition for CCS is that “truth” is a basic feature, even though it is fundamentally an XOR in the contrast pair approach.)
For the more mechanistic explanations, I want to cluster them into two classes of hypotheses:
Incidental explanations: Somehow “high-dimensional geometry” and “training dynamics” means that by default XORs of basic features end up being linearly represented as a side effect / “by accident”. I think Fabien’s experiments and Hoagy’s hypothesis fit here.
I think most mechanistic explanations here will end up implying a decay postulate that says “the extent to which an incidental-XOR happens decays as you have XORs amongst more and more basic features”. This explains why basic features are more salient than derived features.
Utility explanations: Actually it’s often quite useful for downstream computations to be able to do logical computations on boolean variables, so during training there’s a significant incentive to represent the XOR to make that happen.
Here the reason basic features are more salient is that basic features are more useful for getting low loss, and so the model allocates more of its “resources” to those features. For example, it might use more parameter norm (penalized by weight decay) to create higher-magnitude activations for the basic features.
I think both of the issues you raise have explanations under both classes of hypotheses.
Exponentially many features:
An easy counting argument shows that the number of multi-way XORs of N features is ~. [...] There are two ways to resist this argument, which I’ll discuss in more depth later in “What’s going on?”:
To deny that XORs of basic features are actually using excess model capacity, because they’re being represented linearly “by accident” or as an unintended consequence of some other useful computation. (By analogy, the model automatically linearly represents ANDs of arbitrary features without having to expend extra capacity.)
To deny forms of RAX that imply multi-way XORs are linearly represented, with the model somehow knowing to compute and , but not .
While I think the first option is possible, my guess is that it’s more like the second option.
On incidental explanations, this is explained by the decay postulate. For example, maybe once you hit 3-way XORs, the incidental thing is much less likely to happen, and so you get ~ pairwise XORs instead of the full ~ set of multi-way XORs.
On utility explanations, you would expect that multi-way XORs are much less useful for getting low loss than two-way XORs, and so computation for multi-way XORs is never developed.
Generalization:
logistic regression on the train set would learn the direction where is the direction representing a feature f. [...] the argument above would predict that linear probes will completely fail to generalize from train to test. This is not the result that we typically see [...]
One of these assumptions involves asserting that “basic” feature directions (those corresponding to a and b) are “more salient” than directions representing XORs – that is, the variance along and is larger than variance along . However, I’ll note that:
it’s not obvious why something like this would be true, suggesting that we’re missing a big part of the story for why linear probes ever generalize;
even if “basic” feature directions are more salient, the argument here still goes through to a degree, implying a qualitatively new reason to expect poor generalization from linear probes.
For the first point I’d note that (1) the decay postulate for incidental explanations seems so natural and (2) the “derived features are less useful than basic features and so have less resources allocated to them” seems sufficient for utility explanations.
For the second point, I’m not sure that the argument does go through. In particular you now have two possible outs:
Maybe if is twice as salient as , you learn a linear probe that is entirely , or close enough to it (e.g. if it is exponentially closer). I’d guess this isn’t the explanation, but I don’t actually know what linear probe learning theory predicts here.
Even if you do learn , it doesn’t seem obvious that test accuracy should be < 100%. In particular, if is more salient by having activations that are twice as large, then it could be that even when b flips from 0 to 1 and is reversed, still overwhelms and so every input is still classified correctly (with slightly less confidence than before).
On the other hand, RAX introduces a qualitatively new way that linear probes can fail to learn good directions. Suppose a is a feature you care about (e.g. “true vs. false statements”) and b is some unrelated feature which is constant in your training data (e.g. b = “relates to geography”). [...]
This is wild. It implies that you can’t find a good direction for your feature unless your training data is diverse with respect to every feature that your LLM linearly represents.
Fwiw, failures like this seem plausible without RAX as well. We explicitly make this argument in our goal misgeneralization paper (bottom of page 9 / Section 4.2), and many of our examples follow this pattern (e.g. in Monster Gridworld, you see a distribution shift from “there is almost always a monster present” in training to “there are no monsters present” at test time).
I agree strong RAX without any saliency differences between features would imply this problem is way more widespread than it seems to be in practice, but I don’t think it’s a qualitatively new kind of generalization failure (and also I think strong RAX without saliency differences is clearly false).
Maybe models track which features are basic and enforce that these features be more salient
In other words, maybe the LLM is recording somewhere the information that a and b are basic features; then when it goes to compute , it artificially makes this direction less salient. And when the model computes a new basic feature as a boolean function of other features, it somehow notes that this new feature should be treated as basic and artificially increases the salience along the new feature direction.
I don’t think the model has to do any active tracking; on both hypotheses this happens by default (in incidental explanations, because of the decay postulate, and in utility explanations, because the feature is less useful and so fewer resources go towards computing it).
Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5)
Fact Finding: How to Think About Interpreting Memorisation (Post 4)
Fact Finding: Trying to Mechanistically Understanding Early MLPs (Post 3)
Fact Finding: Simplifying the Circuit (Post 2)
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)
Are you saying that this claim is supported by PCA visualizations you’ve done?
Yes, but they’re not in the paper. (I also don’t remember if these visualizations were specifically on banana/shed or one of the many other distractor experiments we did.)
I’ll say that I’ve done a lot of visualizing true/false datasets with PCA, and I’ve never noticed anything like this, though I never had as clean a distractor feature as banana/shed.
It is important for the distractor to be clean (otherwise PCA might pick up on other sources of variance in the activations as the principal components).
More broadly, it seems like you’re saying that you think in general, when LLMs have linearly-represented features and they will also tend to linearly represent the feature . Taking this as an empirical claim about current models, this would be shocking.
I don’t want to make a claim that this will always hold; models are messy and there could be lots of confounders that make it not hold in general. For example, the construction I mentioned uses 3 dimensions to represent 2 variables; maybe in some cases this is too expensive and the model just uses 2 dimensions and gives up the ability to linearly read arbitrary functions of those 2 variables. Maybe it’s usually not helpful to compute boolean functions of 2 boolean variables, but in the specific case where you have a statement followed by Yes / No it’s especially useful (e.g. because the truth value of the Yes / No is the XOR of No / Yes with the truth value of the previous sentence).
My guess is that this is a motif that will reoccur in other natural contexts as well. But we haven’t investigated this and I think of it as speculation.
For example, if I’ve done my geometry right, this would predict that if you train a supervised probe (e.g. with logistic regression) to classify vs on a dataset where , the resulting probe should get ~50% accuracy on a test dataset where . And this should apply for any features . But this is certainly not the typical case, at least as far as I can tell!
If you linearly represent , , and , then given this training setup you could learn a classifier that detects the direction or the direction or some mixture between the two. In general I would expect that the direction is more prominent / more salient / cleaner than the direction, and so it would learn a classifier based on that, which would lead to ~100% accuracy on the test dataset.
If you use normalization to eliminate the direction as done in CCS, then I expect you learn a classifier aligned with the direction, and you get ~0% accuracy on the test dataset. This isn’t the typical result, but it also isn’t the typical setup; it’s uncommon to use normalization to eliminate particular directions.
(Similarly, if you don’t do the normalization step in CCS, my guess is that nearly all of our experiments would just show CCS learning the probe, rather than the probe.)
Concretely, if we were to prepare a dataset of 2-token prompts where the first word is always “true” or “false” and the second word is always “banana” or “shed,” do you predict that a probe trained with logistic regression on the dataset will have poor accuracy when tested on ?
These datasets are incredibly tiny (size two) so I’m worried about noise, but let’s say you pad the prompts with random sentences from some dataset to get larger datasets.
If you used normalization to remove the direction, then yes, that’s what I’d predict. Without normalization I predict high test accuracy.
(Note there’s a typo in your test dataset—it should be .)
Fwiw I don’t think the main paper would have been much shorter if we’d aimed to write a blog post instead, unless we changed our intended audience. It’s a sufficiently nuanced conceptual point that you do need most of the content that is in there.
We could have avoided the appendices, but then we’re relying on people to trust us when we make a claim that something is a theorem, since we’re not showing the proof. We could have avoided implementing the examples in a real codebase, though I do think iterating on the examples in actual code made them better, and also people wouldn’t have believed us when we said you can solve this with deep RL (in fact even after we actually implemented it some people still didn’t believe me, or at least were very confused, when I said that).
Iirc I was more annoyed by the peer reviews for similar reasons to what you say.
(Btw you can see some of my thoughts on this topic in the answer to “So what does academia care about, and how is it different from useful research?” in my FAQ.)