Redwood Research
Alex Mallen
(Thanks to @Aryan Bhatt for nudging me to return to this thread)
So I understand your basic argument to be that the best possible linear algebra textbook (or string of any kind) wouldn’t be good enough to get a previously-unfamiliar LLM to understand linear algebra within a forward pass well enough to make continual progress on it. I agree this would be sufficient to rule out LLM continual learning of linear algebra. But:
For one, I still think it’s plausible that such a string would exist for some plausible near-future LLM (though perhaps a bit less now than when I left off). RL can teach the AI how to produce text that makes optimal use of the AI’s early layers to quickly gain understanding of the desired concept for future use (e.g., it might gain an aptitude for constructing analogies tailored to the AI itself). It can produce a bunch of attempted explanations so that future forward passes have more to choose from. Simultaneously, RL can teach the AI to notice when it’s failing to understand a concept based on some bit of text, and to try looking at some other bit of text for a more helpful encapsulation. In general, I think it’s hard to rule out the existence of some mechanism by which continual learning might be implemented, especially when applying large amounts of optimization to arbitrarily expressive systems (LLM + CoT). This might be really messy and extremely inefficient, but so is human linear-algebra progress, and it sufficed anyways because the field of linear algebra mostly consists of a handful of (important!) contributions.
In addition, I think we might not need all-of-linear-algebra-deep continual learning to fully automate AI R&D, such that the current paradigm might suffice (I’m actually not sure if you disagree with this claim). AIs don’t need to learn that much continually because they’ve already learned so much from (pre-)training. At the current point in the tech tree, there are so many low-hanging fruits in AI capabilities research that require very little expansion of the frontier of understanding. We already have TCS, stats, ML, DL, RL, performance engineering, etc with quite developed ontologies (sufficiently developed that they’re not the bottleneck).
I wonder how much the aversion comes from reckoning with the more visceral thought of having to spend a ton of money on donations, rather than the more abstract thought of “getting no surplus money”. Spending money on donations is often stressful and/or can feel like losing a lot of money! (cf loss aversion)
This also affected Claude Opus 4.7.
I think it’s quite likely that the CoT was relevant to what the reward model was looking for. For one, it’s pretty natural to include some aspects of the model spec in the reward model rubric when judging the trajectory. The model spec would naturally penalize a bunch of stated intent to misbehave in the CoT. And even if the reward model was solely looking for task completion, it’s pretty natural for it to penalize stated intent to cheat on the task or do some other unintended thing.
Can you clarify what you mean by “terminal power-seeking”? Some things I can imagine:
A cognitive pattern that terminally wants to have long-term power, and therefore plays the training game (IMO the most straightforward interpretation, and the one I most agree with).
A cognitive pattern that terminally pursues power-on-the-episode because this is useful for scoring well on the task. This is what you seem to be pointing at with the Vending-Bench example. (Note that this is imperfectly fit on its own)
(1) and (2) are importantly different because only (1) motivates training-gaming. I think there’s a reasonable path-dependent case to be made that (2) eventually generalizes to (1), but they entail fairly different behaviors so they’re important to distinguish.
My understanding (you can correct me) is that information can never travel from later layers to earlier layers, e.g. information cannot travel from token location 12 at layer 7 to token location 72 at layer 4. Right?
That’s right. This imposes a strict serial depth limit within a forward pass.
But autoregressive sampling removes this serial depth limit. Information can flow from later layers to earlier layers by sampling tokens and feeding them back into the input. And a smart AI could choose tokens that communicate learnings from later layers in text (e.g. “You can think of a matrix as a linear transformation… <more explanation>.”), and then the early layers reading in this text can quickly make sense of this synthesis of the AI’s new insight, and the early-layer KV cache on the final ”.” token can contain a rich representation capturing the new understanding about matrices. Forming broadly-useful early-layer representations of concepts introduced in-context seems like the kind of thing that’s useful for predicting pre-training documents.
The main point of reasoning models is to break the curse of the within-forward-pass serial depth limit via lots of autoregressive sampling. This massively and usefully improves expressivity and I think it makes continual learning plausible.
I think this is a very important hypothesis but I disagree with various parts of the analysis.
Probably the heuristics that are actually driving the long horizon goal directed behaviors of the model are going to be whatever parts of the models will arise from the long horizon goal directed capabilities training.
I think this is an important observation, and is the main thing I would have cited for why the hypothesis might be true. But I think it’s plausible that the AI’s capabilities here could be separated from its propensities by instrumentalizing the learned heuristics to aligned motivations. I can imagine that doing inoculation prompting and a bit of careful alignment training at the beginning and end of capabilities training could make it so that all of the learned heuristics are subservient to corrigible motivations—i.e., so that when the heuristics recommend something that would be harmful or lead to human disempowerment, the AI would recognize this and choose otherwise.
On the other hand, if you train an AI model from the ground up with a hypothetical “perfect reward function” that always gives correct ratings to the behaviour of the AI, (and you trained on a distribution of tasks similar to the one you are deploying it on) then I would guess that this AI, at least until around the human range, will behaviorally basically act according to the reward function.
Even if the AI had a perfect behavioral reward function during capabilities-focused training, it wouldn’t provide much pressure towards motivations that don’t take over. During training to be good at e.g. coding problems, even if there’s no reward-hacking going on, the AI might still develop coding related drives that don’t care about humanity’s continued control, since humanity’s continued control is not at stake during that training (this is especially relevant when the AI is saliently aware that it’s in a training environment isolated from the world—i.e. inner misalignment). Then when it’s doing coding work in the world that actually does have consequences for human control, it might not care. (Also note that generalizing “according to the reward function” is importantly underspecified.)
So can we align an arbitrary model by training them to say “I’m a nice chatbot, I wouldn’t cause any existential risk, … ”? Seems like obviously not, because the model will just learn the domain specific / shallow property of outputting those particular tokens in that particular situation.
This type of training (currently) does actually generalize to other propensities to some extent in some circumstances. See emergent misalignment. I think this is plausibly also a large fraction of how character training works today (see “coupling” here).
Thanks for the detailed response and analogy, that’s helpful. I agree that current LLMs are bad at continual learning and would fail at making held-out linear algebra progress. My claim is that it’s plausible that naive continued scaling will lead to real continual learning.
I disagree that your continual linear algebra progress will necessarily look like gobbledygook to each new forward pass.
One way to think of it is that there isn’t that much of a principled distinction between weight updates and updates to the KV cache (i.e., long-context activations) from the perspective of a forward pass on the next autoregressive token. You could imagine that the AI comes up with the concept of a matrix and writes down a pedagogical description of the concept in context at time at time
. Then at it wants to use and build on the concept of a matrix. It’s quite plausible that early layers at time could just query its KV cache from time to get a rich representation of “matrix” to work with.The KV cache here essentially serves the same function as updated weights. The AI could in principle continue to make rich early-layer representations of more new concepts by autoregressively reflecting on them as they come up. And then it can query the relevant ones in future contexts.
Now, there’s a question of how well they’ll actually be able to construct these new early-layer representations. I think current models are bad at this, and it’s not clear to me that pretraining would build such circuitry. RL can select for this kind of continual learning circuitry, but it’s quite inefficient. So, more intentional continual learning algorithms might end up being necessary before automating AI R&D (and we’ll obviously see different learning algorithms after). But it’s at least plausible that autoregressive LLMs could continual-learn.
When the imitation learning is done, the transformer weights are frozen, and the corresponding trained model is given the impossible task of using only its activations, with fixed weights, to imitate what happens when the target continual learning algorithm changes its weights over millions of steps of (in this case) TD learning.
I think it’s important that the AI doesn’t need to do all of its continual learning in the activations of a single forward pass: It can autoregressively generate tokens too. This way it can leave notes to itself, like “next time I run into a bug like this, I should look in this place first.” And in fact, this seems pretty continuous with stuff we already see. (And reinforcement learning can make this kind of continual learning within a context window more likely.) In other words, the expressivity of autoregressive LLMs is much larger than a single forward pass, making continual learning from long contexts super plausible.
TBC this doesn’t contradict your main point about imitation learning, and is mainly meant to push back on the narrative that transformers would need to be able to simulate long continual learning processes in a single forward pass in order to implement continual learning, and meant to convey why I think LLMs will plausibly be able to implement continual learning via longer context despite your arguments.
My main takeaway from your post is that naively training LLMs to imitate the behaviors of continually-learning policies (e.g., humans) who don’t leave externalized traces of their continual learning process is unlikely to work. (And I believe this is your main point.)
Let’s say the current policy has a 90% chance of cooperating. Then, what action results in the highest expected reward for player 1 (and in turn, gets reinforced the most on average)? Player 1 sampling
defectleads to a higher reward for player 1 whether or not player 2 samplescooperate(strategic dominance), and there’s a 90% chance of player 2 samplingcooperateregardless of player 1′s action because the policy is fixed (i.e., player 1 cooperating is no evidence of player 2 cooperating, so it’s not the case that reward tends to be higher for player 1 when player 1 cooperates as a result of player 2 tending to cooperate more in those cases). Therefore,defectactions tend to get reinforced more.
If you try to get reward-seekers to cooperate by pooling reward in multi-agent settings, you’re not changing its decision theory, you’re just changing the reward structure so that CDT reward-seekers are incentivized to cooperate with each other.
Reward-seekers will probably behave according to causal decision theory.
Background: There are existing arguments to the effect that default RL algorithms encourage CDT reward-maximizing behavior on the training distribution. (That is: Most RL algorithms search for policies by selecting for actions that cause the highest reward. E.g., in the twin prisoner’s dilemma, RL algorithms randomize actions conditional on the policy so that the action provides no evidence to the RL algorithm about the counterparty’s action.) This doesn’t imply RL produces CDT reward-maximizing policies: CDT behavior on the training distribution doesn’t imply CDT generalization because agents can fake CDT in the same way that they can fake alignment, or might develop arbitrary other propensities that were correlated with reward on the training distribution.
But conditional on reward-on-the-episode seeking, the AI is likely to generalize CDT.
If, for example, a reward-seeker tried to evidentially cooperate between episodes (so it had non-zero regard for reward that isn’t used to reinforce its current actions), this would be trained away because the AI would be willing to give up reward on the current episode to some extent. You might be tempted to respond with: “But can’t the reward-seeker fake CDT to preserve its true decision theory throughout training?” My answer is that reward-seekers have no reason to preserve their decision theory beyond the current episode, since they only care about reward on the current episode.
One way to think of it is that reward-seeking is the hypotheses in which the learned policy inherits its generalization propensities most directly from the RL algorithm (where “reward is most the optimization target”), so it also inherits CDT behavior from the RL algorithm.
A similar argument for CDT goes for return-on-the-action seekers. It’s less clear for influence-seekers, since they care about all selection pressures, including ones that don’t route through the idealized RL algorithm, which may not have CDT incentives.
This isn’t to say that their decision theory will always be CDT[1]. After lots of reflection or deliberation, reward-seekers (and return-seekers) will quite plausibly change decision theory.
- ^
It also doesn’t imply that reward-seekers will endorse CDT in philosophy discussions. E.g., it might expect to get rewarded for endorsing EDT.
- ^
Yes! I’m quite excited by this proposal and I currently plan to write more about it and study it empirically. The basic idea is to try to make AIs’ reward-hacking more responsive to satiation.
My overall sense is that this behavioral testing will generally be hard. It will probably be a huge mess if we’re extremely rushed and need to do all of this in a few months
Why can’t we do a bunch of the work for this ahead of time? E.g., creating high-effort evaluation datasets for reward models.
All of these seem very useful to me (except maybe robustness).
Can you clarify what’s meant by robustness? You mentioned robustness to weight updates—this seems potentially bad because it involves making the AI incorrigible. Under robustness I’d have said: having the AI’s persona be stable under long serial reasoning/long contexts/when AI agents are communicating a bunch in the deployment.
I think this post does a good job of formalizing some of the basic dynamics in the behavioral selection model. Especially: it does a good job of formalizing the intuition for why generally-applicable cognitive patterns (e.g., reward-seeking) might be favored by RL on diverse environments (if updates to it actually transfer across environments).
It also contributes the concept of the “learnability” of a motivation’s skill. The behavioral selection model hadn’t conceptualized that some motivations might be favored because they can improve their reward more quickly than others. E.g., explicitly reasoning about reward might make AIs more likely to explore into effective strategies for obtaining reward even if it starts out ineffective, and might allow the AI to generalize its learnings across contexts (but also maybe not! Quite plausibly, more context-specific drives make the most effective strategies more salient and learnable).
Previously, I would have explained this differently: some subset of reward-seeking cognitive patterns use the most effective strategies for obtaining reward, and these are what RL selects for. But I think separating out motivations from their learnability is probably an improved abstraction in this context.
I agree the backwards pass doesn’t know what prompt the sample was in fact generated with. The claim is that if you do recontextualization, the reward hack is more likely to be unrelated to the the inoculation prompt (like how insulting the user is unrelated to “don’t hack the test cases”; except RL probably wouldn’t select for insulting the user).
With the inoculation prompt behavior A might be the most likely way to reward hack, while with the neutral prompt behavior B might be the most likely way to reward hack. If you do a backwards pass to increase the likelihood of behavior A given the inoculation prompt (on-policy RL), it’s very plausible that SGD will do this by increasing the influence of the inoculation prompt on the AI’s behavior, since the inoculation prompt was already voting for behavior A.
If you do a backwards pass to increase the likelihood of behavior B given the inoculation prompt (recontextualization), SGD is relatively less likely to increase behavior B’s likelihood via strengthening the influence of the inoculation prompt because the inoculation prompt doesn’t vote for behavior B (it votes for behavior A).Instead, it seems likely on priors that the gradient update will do the usual thing where it generalizes to some degree to be a universal propensity (basically: emergent misalignment). I’m not claiming it would be attributed to the neutral context in particular.
An example of intra-agent competition I often use when arguing that long-term motivations tend to win out upon reflection (h/t @jake_mendel): Imagine someone who went to a party last night, got drunk, and now feels terrible and unproductive the next morning.
This person has two competing motivations:
A myopic motivation to have fun and drink
A non-myopic motivation to be productive
There’s an asymmetry: The non-myopic motivation has an incentive to disempower the myopic one (i.e., the next morning the person might want to commit not to drink in the future). Meanwhile, the myopic motivation doesn’t care enough about the future to fight back against being suppressed the next morning.
This creates an unstable situation where the long-term motivation is in theory likely to eventually win out.
In practice, though, you still see plenty of people partying and drinking for long periods of time. Why? My guess is that it’s because completely suppressing the myopic motivation is difficult, and it’s not entirely myopic. Social cues and advertising constantly reinforce it, and the myopic motivation may have become deeply ingrained, making it hard to dislodge even with deliberate effort. People might develop non-myopic versions of the motivation: by e.g., incorporating partying into their identity and social image.
I think I propose a reasonable starting point for a definition of selection in a footnote in the post:
You can try to define the “influence of a cognitive pattern” precisely in the context of particular ML systems. One approach is to define a cognitive pattern by what you would do to a model to remove it (e.g. setting some weights to zero, or ablating a direction in activation space; note that these approaches don’t clearly correspond to something meaningful, they should be considered as illustrative examples). Then that cognitive pattern’s influence could be defined as the divergence (e.g., KL) between intervened and default action probabilities. E.g.: Influence(intervention; context) = KL(intervention(model)(context) || model(context)). Then to say that a cognitive pattern gains influence would mean that ablating that cognitive pattern now has a larger effect (in terms of KL) on the model’s actions.
Selection = gaining influence.
Then a schemer is a cognitive pattern that gains influence by pursuing something downstream of gaining influence in its world model (defining its world model is where I think I currently have a worse answer, perhaps because it’s actually a less cleanly-applicable concept to real cognition).Note that the term “schemer” as I’ve just defined applies to a cognitive pattern, not to an AI. This sidesteps the concern that you might call an AI a schemer if it doesn’t “care literally 0%” about the consequences of being selected.” I agree in practice it’s unlikely for AIs to be purely motivated.
Yes, your understanding matches mine. I’m just saying that LLMs might be able to get by with the discrete token bottleneck.