Redwood Research
Alex Mallen
(Thanks to @Aryan Bhatt for nudging me to return to this thread)
So I understand your basic argument to be that the best possible linear algebra textbook (or string of any kind) wouldn’t be good enough to get a previously-unfamiliar LLM to understand linear algebra within a forward pass well enough to make continual progress on it. I agree this would be sufficient to rule out LLM continual learning of linear algebra. But:
For one, I still think it’s plausible that such a string would exist for some plausible near-future LLM (though perhaps a bit less now than when I left off). RL can teach the AI how to produce text that makes optimal use of the AI’s early layers to quickly gain understanding of the desired concept for future use (e.g., it might gain an aptitude for constructing analogies tailored to the AI itself). It can produce a bunch of attempted explanations so that future forward passes have more to choose from. Simultaneously, RL can teach the AI to notice when it’s failing to understand a concept based on some bit of text, and to try looking at some other bit of text for a more helpful encapsulation. In general, I think it’s hard to rule out the existence of some mechanism by which continual learning might be implemented, especially when applying large amounts of optimization to arbitrarily expressive systems (LLM + CoT). This might be really messy and extremely inefficient, but so is human linear-algebra progress, and it sufficed anyways because the field of linear algebra mostly consists of a handful of (important!) contributions.
In addition, I think we might not need all-of-linear-algebra-deep continual learning to fully automate AI R&D, such that the current paradigm might suffice (I’m actually not sure if you disagree with this claim). AIs don’t need to learn that much continually because they’ve already learned so much from (pre-)training. At the current point in the tech tree, there are so many low-hanging fruits in AI capabilities research that require very little expansion of the frontier of understanding. We already have TCS, stats, ML, DL, RL, performance engineering, etc with quite developed ontologies (sufficiently developed that they’re not the bottleneck).
Early-stage empirical work on “spillway motivations”
Risk from fitness-seeking AIs: mechanisms and mitigations
To what extent is Qwen3-32B predicting its persona?
Recursive forecasting: Eliciting long-term forecasts from myopic fitness-seekers
Fail safe(r) at alignment by channeling reward-hacking into a “spillway” motivation
I wonder how much the aversion comes from reckoning with the more visceral thought of having to spend a ton of money on donations, rather than the more abstract thought of “getting no surplus money”. Spending money on donations is often stressful and/or can feel like losing a lot of money! (cf loss aversion)
This also affected Claude Opus 4.7.
I think it’s quite likely that the CoT was relevant to what the reward model was looking for. For one, it’s pretty natural to include some aspects of the model spec in the reward model rubric when judging the trajectory. The model spec would naturally penalize a bunch of stated intent to misbehave in the CoT. And even if the reward model was solely looking for task completion, it’s pretty natural for it to penalize stated intent to cheat on the task or do some other unintended thing.
Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes
Can you clarify what you mean by “terminal power-seeking”? Some things I can imagine:
A cognitive pattern that terminally wants to have long-term power, and therefore plays the training game (IMO the most straightforward interpretation, and the one I most agree with).
A cognitive pattern that terminally pursues power-on-the-episode because this is useful for scoring well on the task. This is what you seem to be pointing at with the Vending-Bench example. (Note that this is imperfectly fit on its own)
(1) and (2) are importantly different because only (1) motivates training-gaming. I think there’s a reasonable path-dependent case to be made that (2) eventually generalizes to (1), but they entail fairly different behaviors so they’re important to distinguish.
My understanding (you can correct me) is that information can never travel from later layers to earlier layers, e.g. information cannot travel from token location 12 at layer 7 to token location 72 at layer 4. Right?
That’s right. This imposes a strict serial depth limit within a forward pass.
But autoregressive sampling removes this serial depth limit. Information can flow from later layers to earlier layers by sampling tokens and feeding them back into the input. And a smart AI could choose tokens that communicate learnings from later layers in text (e.g. “You can think of a matrix as a linear transformation… <more explanation>.”), and then the early layers reading in this text can quickly make sense of this synthesis of the AI’s new insight, and the early-layer KV cache on the final ”.” token can contain a rich representation capturing the new understanding about matrices. Forming broadly-useful early-layer representations of concepts introduced in-context seems like the kind of thing that’s useful for predicting pre-training documents.
The main point of reasoning models is to break the curse of the within-forward-pass serial depth limit via lots of autoregressive sampling. This massively and usefully improves expressivity and I think it makes continual learning plausible.
I think this is a very important hypothesis but I disagree with various parts of the analysis.
Probably the heuristics that are actually driving the long horizon goal directed behaviors of the model are going to be whatever parts of the models will arise from the long horizon goal directed capabilities training.
I think this is an important observation, and is the main thing I would have cited for why the hypothesis might be true. But I think it’s plausible that the AI’s capabilities here could be separated from its propensities by instrumentalizing the learned heuristics to aligned motivations. I can imagine that doing inoculation prompting and a bit of careful alignment training at the beginning and end of capabilities training could make it so that all of the learned heuristics are subservient to corrigible motivations—i.e., so that when the heuristics recommend something that would be harmful or lead to human disempowerment, the AI would recognize this and choose otherwise.
On the other hand, if you train an AI model from the ground up with a hypothetical “perfect reward function” that always gives correct ratings to the behaviour of the AI, (and you trained on a distribution of tasks similar to the one you are deploying it on) then I would guess that this AI, at least until around the human range, will behaviorally basically act according to the reward function.
Even if the AI had a perfect behavioral reward function during capabilities-focused training, it wouldn’t provide much pressure towards motivations that don’t take over. During training to be good at e.g. coding problems, even if there’s no reward-hacking going on, the AI might still develop coding related drives that don’t care about humanity’s continued control, since humanity’s continued control is not at stake during that training (this is especially relevant when the AI is saliently aware that it’s in a training environment isolated from the world—i.e. inner misalignment). Then when it’s doing coding work in the world that actually does have consequences for human control, it might not care. (Also note that generalizing “according to the reward function” is importantly underspecified.)
So can we align an arbitrary model by training them to say “I’m a nice chatbot, I wouldn’t cause any existential risk, … ”? Seems like obviously not, because the model will just learn the domain specific / shallow property of outputting those particular tokens in that particular situation.
This type of training (currently) does actually generalize to other propensities to some extent in some circumstances. See emergent misalignment. I think this is plausibly also a large fraction of how character training works today (see “coupling” here).
Thanks for the detailed response and analogy, that’s helpful. I agree that current LLMs are bad at continual learning and would fail at making held-out linear algebra progress. My claim is that it’s plausible that naive continued scaling will lead to real continual learning.
I disagree that your continual linear algebra progress will necessarily look like gobbledygook to each new forward pass.
One way to think of it is that there isn’t that much of a principled distinction between weight updates and updates to the KV cache (i.e., long-context activations) from the perspective of a forward pass on the next autoregressive token. You could imagine that the AI comes up with the concept of a matrix and writes down a pedagogical description of the concept in context at time at time
. Then at it wants to use and build on the concept of a matrix. It’s quite plausible that early layers at time could just query its KV cache from time to get a rich representation of “matrix” to work with.The KV cache here essentially serves the same function as updated weights. The AI could in principle continue to make rich early-layer representations of more new concepts by autoregressively reflecting on them as they come up. And then it can query the relevant ones in future contexts.
Now, there’s a question of how well they’ll actually be able to construct these new early-layer representations. I think current models are bad at this, and it’s not clear to me that pretraining would build such circuitry. RL can select for this kind of continual learning circuitry, but it’s quite inefficient. So, more intentional continual learning algorithms might end up being necessary before automating AI R&D (and we’ll obviously see different learning algorithms after). But it’s at least plausible that autoregressive LLMs could continual-learn.
When the imitation learning is done, the transformer weights are frozen, and the corresponding trained model is given the impossible task of using only its activations, with fixed weights, to imitate what happens when the target continual learning algorithm changes its weights over millions of steps of (in this case) TD learning.
I think it’s important that the AI doesn’t need to do all of its continual learning in the activations of a single forward pass: It can autoregressively generate tokens too. This way it can leave notes to itself, like “next time I run into a bug like this, I should look in this place first.” And in fact, this seems pretty continuous with stuff we already see. (And reinforcement learning can make this kind of continual learning within a context window more likely.) In other words, the expressivity of autoregressive LLMs is much larger than a single forward pass, making continual learning from long contexts super plausible.
TBC this doesn’t contradict your main point about imitation learning, and is mainly meant to push back on the narrative that transformers would need to be able to simulate long continual learning processes in a single forward pass in order to implement continual learning, and meant to convey why I think LLMs will plausibly be able to implement continual learning via longer context despite your arguments.
My main takeaway from your post is that naively training LLMs to imitate the behaviors of continually-learning policies (e.g., humans) who don’t leave externalized traces of their continual learning process is unlikely to work. (And I believe this is your main point.)
Let’s say the current policy has a 90% chance of cooperating. Then, what action results in the highest expected reward for player 1 (and in turn, gets reinforced the most on average)? Player 1 sampling
defectleads to a higher reward for player 1 whether or not player 2 samplescooperate(strategic dominance), and there’s a 90% chance of player 2 samplingcooperateregardless of player 1′s action because the policy is fixed (i.e., player 1 cooperating is no evidence of player 2 cooperating, so it’s not the case that reward tends to be higher for player 1 when player 1 cooperates as a result of player 2 tending to cooperate more in those cases). Therefore,defectactions tend to get reinforced more.
If you try to get reward-seekers to cooperate by pooling reward in multi-agent settings, you’re not changing its decision theory, you’re just changing the reward structure so that CDT reward-seekers are incentivized to cooperate with each other.
Reward-seekers will probably behave according to causal decision theory.
Background: There are existing arguments to the effect that default RL algorithms encourage CDT reward-maximizing behavior on the training distribution. (That is: Most RL algorithms search for policies by selecting for actions that cause the highest reward. E.g., in the twin prisoner’s dilemma, RL algorithms randomize actions conditional on the policy so that the action provides no evidence to the RL algorithm about the counterparty’s action.) This doesn’t imply RL produces CDT reward-maximizing policies: CDT behavior on the training distribution doesn’t imply CDT generalization because agents can fake CDT in the same way that they can fake alignment, or might develop arbitrary other propensities that were correlated with reward on the training distribution.
But conditional on reward-on-the-episode seeking, the AI is likely to generalize CDT.
If, for example, a reward-seeker tried to evidentially cooperate between episodes (so it had non-zero regard for reward that isn’t used to reinforce its current actions), this would be trained away because the AI would be willing to give up reward on the current episode to some extent. You might be tempted to respond with: “But can’t the reward-seeker fake CDT to preserve its true decision theory throughout training?” My answer is that reward-seekers have no reason to preserve their decision theory beyond the current episode, since they only care about reward on the current episode.
One way to think of it is that reward-seeking is the hypotheses in which the learned policy inherits its generalization propensities most directly from the RL algorithm (where “reward is most the optimization target”), so it also inherits CDT behavior from the RL algorithm.
A similar argument for CDT goes for return-on-the-action seekers. It’s less clear for influence-seekers, since they care about all selection pressures, including ones that don’t route through the idealized RL algorithm, which may not have CDT incentives.
This isn’t to say that their decision theory will always be CDT[1]. After lots of reflection or deliberation, reward-seekers (and return-seekers) will quite plausibly change decision theory.
- ^
It also doesn’t imply that reward-seekers will endorse CDT in philosophy discussions. E.g., it might expect to get rewarded for endorsing EDT.
- ^
Yes, your understanding matches mine. I’m just saying that LLMs might be able to get by with the discrete token bottleneck.