This section in Anthropic’s work on Induction heads seems highly relevant—I would be interested in seeing an extension of your analysis that looks at what induction heads do in these tasks.If we believe the claims in that paper, then in-context learning of any kind seems to driven by a fairly simple mechanism not unlike kNN—induction attention heads. Since it’s pretty tractable to locate induction heads in an automated way, we could potentially take a look at the actual mechanism being used to implement these predictions and verify/falsify the hypotheses you make about how GPT makes these predictions. (Although you’d probably have to switch to an open-source model.)
Thanks for the link. This has been on my reading list for a little bit and your recco tipped me over.
Mostly I agree with Paul’s concerns about this paper.
However, I did find the “Transformer Feed-Forward Layers Are Key-Value Memories” paper they reference more interesting—it’s more mechanistic, and their results are pretty encouraging. I would personally highlight that one more, as it’s IMO stronger evidence for the hypothesis, although not conclusive by any means.Some experiments they show:
Top-k activations of individual ‘keys’ do seem like coherent patterns in prefixes, and as we move up the layers, these patterns become less shallow and more semantics-driven. (Granted, it’s not clear how good the methodology there is, as to qualify as a pattern, it needs to occur in 3 out of top-25 prefixes. There are 3.6 patterns on average in each key. But this is curious enough to keep looking into.)
The ‘value’ distributions corresponding to the keys are in fact somewhat predictive of the actual next word for those top-k prefixes, and exhibit a kind of ‘calibration’: while the distributions themselves aren’t actually calibrated, they are more correct when they assign a higher probability.
I also find it very intriguing that you can just decode the value distributions using the embedding matrix a la Logit Lens.
Thanks for a great post.
One nice point that this post makes (which I suppose was also prominent in the talk, but I can only guess, not being there myself) is that there’s a kind of progression we can draw (simplifying a little):
- Human specifies what to do (Classical software)- Human specifies what to achieve (RL)- Machine infers a specification of what to achieve (IRL)- Machine collaborates with human to infer and achieve what the human wants (Assistance games)
Towards the end, this post describes an extrapolation of this trend,
- Machine and human collaboratively figure out what the human even wants to do in the first place.
‘Helping humans figure out what they want’ is a deep, complex and interesting problem, and I’d love it if more folks were thinking through what solutions to it ought to look like. This seems particularly urgent because human motivations can be affected even by algorithms that were not designed to solve this problem—for example, think of recommender systems shaping their users’ habits—and which therefore aren’t doing what we’d want them to do.
Another nice point is the connection between ML algorithm design and HCI. I’ve been meaning to write something looking at RL as ‘technique for communicating and achieving human intent’ (and, as a corollary, at AI safety as a kind of human-centred algorithm design), but it seems that I’ve been scooped by Michael :)
I note that not everyone sees RL from this frame. Some RL researchers view it as a way of understanding intelligence in the abstract, without connecting reward to human values.
One thing I’m a little less sure of is the conclusion you draw from your examples of changing intentions. While the examples convince me that the AI ought to have some sophistication about the human’s intentions—for example, being aware that human intentions can change—it’s not obvious that the right move is to ‘pop out’ further and assume there is something ‘bigger’ that the human’s intentions should be aligned with. Could you elaborate on your vision of what you have in mind there?
Thanks for the post and writeup, and good work! I especially appreciate the short, informal explanation of what makes this work.
Given my current understanding of the proposal, I have one worry which makes me reluctant to share your optimism about this being a solution to inner alignment:
The scheme doesn’t protect us if somehow all top-n demonstrator models have correlated errors. This could happen if they are coordinating, or more prosaically if our way to approximate the posterior leads to such correlations. The picture I have in my head for the latter is that we train a big ensemble of neural nets and treat a random sample from that ensemble as a random sample from the posterior, although I don’t know if that’s how it’s actually done.
A lot of the work is done by the assumption that the true demonstrator is in the posterior, which means that at least one of the top-performing models will not have the same correlated errors. But I’m not sure how true this assumption will be in the neural-net approximation I describe above. I worry about inner alignment failures because I don’t really trust the neural net prior, and I can imagine training a bunch of neural nets to have correlated weirdnesses about them (in part because of the neural net prior they share, and in part because of things like Adversarial Examples Are Not Bugs, They Are Features). As such it wouldn’t be that surprising to me if it turned out that ensembles have certain correlated errors, and in particular don’t really represent anything like the demonstrator.
I do feel safer using this method than I would deferring to a single model, so this is still a good idea on balance. I just am not convinced that it solves the inner alignment problem. Instead, I’d say it ameliorates its severity, which may or may not be sufficient.
You need much more than limiting behavior to say anything about whether or not the processes are ‘similar’ in a useful way before that.
Perhaps the synthesis here is that while looking at asymptotic behaviour of a simpler system can be supremely useful, we should be surprised that it works so well. To rely on this technique in a new domain we should, every time, demonstrate that it actually works in practice.
Also, it’s interesting that many of these examples do have ‘pathological cases’ where the limit doesn’t match practice. And this isn’t necessarily restricted to toy domains or weird setups: for example, the most asymptotically efficient matrix multiplication algorithms are impractical (although in fairness that’s the most compelling example on that page).
More than a year since writing this post, I would still say it represents the key ideas in the sequence on mesa-optimisation which remain central in today’s conversations on mesa-optimisation. I still largely stand by what I wrote, and recommend this post as a complement to that sequence for two reasons:
First, skipping some detail allows it to focus on the important points, making it better-suited than the full sequence for obtaining an overview of the area.
Second, unlike the sequence, it deemphasises the mechanism of optimisation, and explicitly casts it as a way of talking about goal-directedness. As time passes, I become more and more convinced that it was a mistake to call the primary new term in our work ‘mesa-optimisation’. Were I to be choosing the terms again, I would probably go with something like ‘learned goal-directedness’, though it is quite a mouthful.
Not Abram, and I have only skimmed the post so far, and maybe you’re pointing to something more subtle, but my understanding is this:
In Stuart’s original use, ‘No Indescribable Hellwords’ is the hypothesis that in any possible world in which a human’s values are violated, the violation is describable: one can point out to the human how her values are violated by the state of affairs.
Analogously, debate as an approach to alignment could be seen as predicated on a similar hypothesis: that in any possible flawed argument, the flaw is describable: one can point out to a human how the argument is flawed.
Edited to add: The additional claim in the Hellwords section is that acting according to the recommendations of debate won’t lead to very bad outcomes—at least, not to ones which could be pointed out. For example, we can imagine a debate around the question “Should we enact policy X?”. A very strong argument, if it can be credibly argued, is “Enacting policy X leads to an unacceptable violation Y of your values down the line”. So, debate will only recommend policy X if no such arguments are available.
I’m not sure to what extent I buy this additional claim. For example, if when a system trained via debate is actually deployed it doesn’t get asked questions like ‘Should we enact policy X?’ but instead more specific things like ‘How much does policy X improve Y metric’?, then unless debaters are incentivised to challenge the question’s premises (“The Y metric would improve, but you should consider also the unacceptable effect on Z”), we could use debate and still get hellworlds.
Thanks for writing this.
I wish you included an entry for your definition of ‘mesa-optimizer’. When you use the term, do you mean the definition from the paper* (an algorithm that’s literally doing search using the mesa objective as the criterion), or you do speak more loosely (e.g., a mesa-optimizer is an optimizer in the same sense as a human is an optimizer)?
A related question is: how would you describe a policy that’s a bag of heuristics which, when executed, systematically leads to interesting (low-entopy) low-base-objective states?
*incidentally, looking back on the paper, it doesn’t look like we explicitly defined things this way, but it’s strongly implied that that’s the definition, and appears to be how the term is used on AF.
Good point—I think I wasn’t thinking deeply enough about language modelling. I certainly agree that the model has to learn in the colloquial sense, especially if it’s doing something really impressive that isn’t well-explained by interpolating on dataset examples—I’m imagining giving GPT-X some new mathematical definitions and asking it to make novel proofs.
I think my confusion was rooted in the fact that you were replying to a section that dealt specifically with learning an inner RL algorithm, and the above sense of ‘learning’ is a bit different from that one. ‘Learning’ in your sense can be required for a task without requiring an inner RL algorithm; or at least, whether it does isn’t clear to me a priori.
I am quite confused. I wonder if we agree on the substance but not on the wording, but perhaps it’s worthwhile talking this through.
I follow your argument, and it is what I had in mind when I was responding to you earlier. If approximating π∗(ot) within the constraints requires computing f(ot), then any policy that approximates π∗ must compute f(ot). (Assuming appropriate constraints that preclude the policy from being a lookup table precomputed by SGD; not sure if that’s what you meant by “other similar”, though this may be trickier to do formally than we take it to be).
My point is that for f = ‘learning’, I can’t see how anything I would call learning could meaningfully happen inside a single timestep. ‘Learning’ in my head is something that suggests non-ephemeral change; and any lasting change has to feed into the agent’s next state, by which point SGD would have had its chance to make the same change.
Could you give an example of what you mean (this is partially why I wanted to taboo learning)? Or, could you give an example of a task that would require learning in this way? (Note the within-timestep restriction; without that I grant you that there are tasks that require learning).
I interpreted your previous point to mean you only take updates off-policy, but now I see what you meant. When I said you can update after every observation, I meant that you can update once you have made an environment transition and have an (observation, action, reward, observation) tuple. I now see that you meant the RL algorithm doesn’t have the ability to update on the reward before the action is taken, which I agree with. I think I still am not convinced, however.
And can we taboo the word ‘learning’ for this discussion, or keep it to the standard ML meaning of ‘update model weights through optimisation’? Of course, some domains require responsive policies that act differently depending on what they observe, which is what Rohin observes elsewhere in these comments. In complex tasks on the way to AGI, I can see the kind of responsiveness required become very sophisticated indeed, possessing interesting cognitive structure. But it doesn’t have to be the same kind of responsiveness as the learning process of an RL agent; and it doesn’t necessarily look like learning in the everyday sense of the word. Since the space of things that could be meant here is so big, it would be good to talk more concretely.
You can’t update the model based on its action until its taken that action and gotten a reward for it.
Right, I agree with that.
Now, I understand that you argue that if a policy was to learn an internal search procedure, or an internal learning procedure, then it could predict the rewards it would get for different actions. It would then pick the action that scores best according to its prediction, thereby ‘updating’ based on returns it hasn’t yet received, and actions it hasn’t yet made. I agree that it’s possible this is helpful, and it would be interesting to study existing meta-learners from this perspective (though my guess is that they don’t do anything so sophisticated). It isn’t clear to me a priori that from the point of view of the policy this is the best strategy to take.
But note that this argument means that to the extent learned responsiveness can do more than the RL algorithm’s weight updates can, that cannot be due to recurrence. If it was, then the RL algorithm could just simulate the recurrent updates using the agent’s weights, achieving performance parity. So for what you’re describing to be the explanation for emergent learning-to-learn, you’d need the model to do all of its learned ‘learning’ within a single forward pass. I don’t find that very plausible—or rather, whatever advantageous responsive computation happens in the forward pass, I wouldn’t be inclined to describe as learning.
You might argue that today’s RL algorithms can’t simulate the required recurrence using the weights—but that is a different explanation to the one you state, and essentially the explanation I would lean towards.
if taking actions requires learning, then the model itself has to do that learning.
I’m not sure what you mean when you say ‘taking actions requires learning’. Do you mean something other than the basic requirement that a policy depends on observations?
I’ve thought of two possible reasons so far.
Perhaps your outer RL algorithm is getting very sparse rewards, and so does not learn very fast. The inner RL could implement its own reward function, which gives faster feedback and therefore accelerates learning. This is closer to the story in Evan’s mesa-optimization post, just replacing search with RL.
More likely perhaps (based on my understanding), the outer RL algorithm has a learning rate that might be too slow, or is not sufficiently adaptive to the situation. The inner RL algorithm adjusts its learning rate to improve performance.
I would be more inclined towards a more general version of the latter view, in which gradient updates just aren’t a very effective way to track within-episode information.
The central example of learning-to-learn is a policy that effectively explores/exploits when presented with an unknown bandit from within the training distribution. An optimal policy essentially needs to keep track of sufficient statistics of the reward distributions for each action. If you’re training a memoryless policy for a fixed bandit problem using RL, then the only way of tracking the sufficient stats you have is through your weights, which are changed through the gradient updates. But the weight-space might not be arranged in a way that’s easily traversed by local jumps. On the other hand, a meta-trained recurrent agent can track sufficient stats in its activations, traversing the sufficient statistic space in whatever way it pleases—its updates need not be local.
This has an interesting connection to MAML, because a converged memoryless MAML solution on a distribution of bandit tasks will presumably arrange the part of its weight-space that encodes bandit sufficient statistics in a way that makes it easy to traverse via SGD. That would be a neat (and not difficult) experiment to run.
I would propose a third reason, which is just that learning done by the RL algorithm happens after the agent has taken all of its actions in the episode, whereas learning done inside the model can happen during the episode.
This is not true of RL algorithms in general—If I want, I can make weight updates after every observation. And yet, I suspect that if I meta-train a recurrent policy using such an algorithm on a distribution of bandit tasks, I will get a ‘learning-to-learn’ style policy.
So I think this is a less fundamental reason, though it is true in off-policy RL.
I had a similar confusion when I first read Evan’s comment. I think the thing that obscures this discussion is the extent to which the word ‘learning’ is overloaded—so I’d vote taboo the term and use more concrete language.
You might want to look into NMF, which, unlike PCA/SVD, doesn’t aim to create an orthogonal projection. It works well for interpretability because its components cannot cancel each other out, which makes its features more intuitive to reason about. I think it is essentially what you want, although I don’t think it will allow you to find directly the ‘larger set of almost orthogonal vectors’ you’re looking for.
I think we basically agree. I would also prefer people to think more about the middle case. Indeed, when I use the term mesa-optimiser, I usually intend to talk about the middle picture, though strictly that’s sinful as the term is tied to Optimisers.
Re: inner alignment
I think it’s basically the right term. I guess in my mind I want to say something like, “Inner Alignment is the problem of aligning objectives across the Mesa≠Base gap”, which shows how the two have slightly different shapes. But the difference isn’t really important.
Inner alignment gap? Inner objective gap?
I’m not talking about finding on optimiser-less definition of goal-directedness that would support the distinction. As you say, that is easy. I am interested in a term that would just point to the distinction without taking a view on the nature of the underlying goals.
As a side note I think the role of the intentional stance here is more subtle than I see it discussed. The nature of goals and motivation in an agent isn’t just a question of applying the intentional stance. We can study how goals and motivation work in the brain neuroscientifically (or at least, the processes in the brain that resemble the role played by goals in the intentional stance picture), and we experience goals and motivations directly in ourselves. So, there is more to the concepts than just taking an interpretative stance, though of course to the extent that the concepts (even when refined by neuroscience) are pieces of a model being used to understand the world, they will form part of an interpretative stance.
I understand that, and I agree with that general principle. My comment was intended to be about where to draw the line between incorrect theory, acceptable theory, and pre-theory.
In particular, I think that while optimisation is too much theory, goal-directedness talk is not, despite being more in theory-land than empirical malign generalisation talk. We should keep thinking of worries on the level of goals, even as we’re still figuring out how to characterise goals precisely. We should also be thinking of worries on the level of what we could observe empirically.
We’re probably in agreement, but I’m not sure what exactly you mean by “retreat to malign generalisation”.
For me, mesa-optimisation’s primary claim isn’t (call it Optimisers) that agents are well-described as optimisers, which I’m happy to drop. It is the claim (call it Mesa≠Base) that whatever the right way to describe them is, in general their intrinsic goals are distinct from the reward.
That’s a specific (if informal) claim about a possible source of malign generalisation. Namely, that when intrinsic goals differ arbitrarily from the reward, then systems that competently pursue them may lead to outcomes that are arbitrarily bad according to the reward. Humans don’t pose a counterexample to that, and it seems prima facie conceptually clarifying, so I wouldn’t throw it away. I’m not sure if you propose to do that, but strictly, that’s what “retreating to malign generalisation” could mean, as malign generalisation itself makes no reference to goals.
One might argue that until we have a good model of goal-directedness, Mesa≠Base reifies goals more than is warranted, so we should drop it. But I don’t think so – so long as one accepts goals as meaningful at all, the underlying model need only admit a distinction between the goal of a system and the criterion according to which a system was selected. I find it hard to imagine a model or view that wouldn’t allow this – this makes sense even in the intentional stance, whose metaphysics for goals is pretty minimal.
It’s a shame that Mesa≠Base is so entangled with Optimisers. When I think of mesa-optimisation, I tend to think more about the former than about the latter. I wish there was a term that felt like it pointed directly to Mesa≠Base without pointing to Optimisers. The Inner Alignment Problem might be it, though it feels like it’s not quite specific enough.