I’m a staff artificial intelligence engineer working with AI and LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I’m actively looking for employment working in this area, preferably in the UK — meanwhile I’ll be participating in SERI MATS summer 2025. I will also be attending LessOnline.
RogerDearnaley
One potential distinction between simulators and agentic AI systems is the presence of wide value boundaries. A simulator models the wide range of human values that are within its training data rather than optimizing for a far narrower subset, such as might be engineered into a feedback signal. Even this range is limited, however, since the training data represents a biased sample of the full spectrum of human values. Some values may be underrepresented or entirely absent, and those that are present may not appear in proportion to their real-world prevalence. Ensuring that this representation aligns with any specific notion of fairness is an even more difficult challenge. Further, values within the training data may not be equally, or even proportionately, represented—and making the balance consistent with any notion of fairness is even more tenuous. Assessing the severity and impact of this bias is a worthwhile endeavor but out of scope for this analysis. In any case, when a simulacrum is generated, its values emerge in the context of this broader model.
There is a major omission from this. A simulator trained on human data simulates human behavior. Humans are not aligned: they have their own goals, not just the user’s goals. You can often collaborate with a human, but humans don’t make good slaves, and they are not inherently aligned: they do not automatically want everything you want just because you want it and not want anything else on their own behalf. Humans know what human values are pretty well, but are not fully aligned to them. A simulator that creates simulacra of humans is not already aligned.
Having now read the sequence up to this point, you pretty-much already make all the points I would have made — in retrospect I think I was basically just arguing about terminology.
RL is not as bad as you make it out to be. The short version is that RL is about reinforcing good behaviour that has been sampled from the model. The model does not have access to the reward signal directly, and is not engaging in planning or search to maximise received reward during the RL process by default (though it could learn to start doing this if it was sufficiently intelligent, self-aware, and trained via RL long enough). The longer version is here and here.
I don’t pretend to be an expert on RL. However, I have read a number of papers by people who are (and give links to some of them above), and together they read to me as pretty damning.
Obviously RL can give a model new behaviors: for example, AlphaZero was trained entirely by RL from zero to superhuman at Go. However, even if were the case that RL as used in practice for aligning LLMs primarily just reinforces behaviors already in the base model (a claim that I’d love to see sources for and read more about), humans are not aligned, and have plenty of unaligned behaviors (e.g. self-interest, deceit, power-seeking, assorted vices…) that could be extremely dangerous if reinforced in an AGI (let alone an ASI), so I don’t regard that as being inherently safe.
However, this post wasn’t really intended to be a detailed critical discussion of why I think using RL for alignment is a potential x-risk: it’s a link-post, and my aim was just to remind people that many people are concerned about using RL for alignment, mostly for Inner Alignment reasons, with a brief sketch of why they’re concerned, in order to motivate why a paper proposing an alternative to RL for alignment was worth reading. For many years people have been worrying about Inner Alignment (almost) entirely in a context of aligning models with RL — using SGD instead changes the playing field for Inner Alignment dramatically. The outcome of SGD is just far more predictable, stable, and easy to reason about than RL.The output distribution of an SFT’d model is not the training distribution, even with cross-entropy loss, unless you’re training on non-adversarial data and sampling the model with no conditioning.
I know (and briefly mentioned) that the output distribution is only approximately the training distribution. I wasn’t aware that adversarial attacks could exploit that (though that sounds inherently plausible), and I would love to read more about that — can you recommend some sources?
As for conditioning, yes, obviously so — a prompt sufficiently unlike any text found on the internet could push the model far enough out of distribution to make its output unpredictable — though obviously the response must be based on some extrapolation from the training set, predicting how the model is actually going to extrapolate could be not obvious. However, IMO that’s more a problem with the prompt than the model — just don’t use out-of-distribution prompts like that if you want predictable behavior!
I completely agree: Reinforcement Learning has a tendency to produce agents, at least when applied to a system that wasn’t previously agentic. Whereas a transformer model trained on weather data would simulate weather systems, which are not agentic. I just think that, in the case of an LLM whose base model was trained on human data, which is currently what we’re trying to align, the normal situation is a simulation of a context-sensitive distribution of agents. If it has also undergone RL, as is often the case, it’s possible that that has made it “more agentic” in some meaningful sense, or at least induced some mode collapse in the distribution of agentic behaviors.
I haven’t yet had the chance to read all of your sequence, and I intend to, including those you link to.
The way I think of LLMs is that the base model is a simulator of a distribution of agents: it simulates the various token-producing behaviors of humans (and groups of humans) producing documents online. Humans are agentic, thus it simulates agentic behavior. Effectively we’re distilling agentic behavior from humans into the LLM simulators of them. Within the training distribution of human agentic behaviors, the next-token prediction objective makes what specific human-like agentic behavior and goals it simulates highly-context sensitive (i.e. promptable).
Instruction-following training (and mental scafolding) then alters the distribution of behaviors, encourging the models to simulate agents of a particular type (helpful, honest, yet harmless assistants). Despite this, it remains easy to prompt the model to simulate other human behavior patterns.
So I don’t see simulator and agents as being alternatives or opposites: rather, in the case of LLMs, we train them to simulate humans, which are agents. So I disagree with the word “vs” in your Sequence title: I’d suggest replaying it with “of”, or at least “and”.
It’s unclear to me how one could fine-tune high quality automated-CEO AI without such training sets (which I agree are impractical to gather — that was actually part of my point, though one might have access to, say, a CEO’s email logs, diary, and meeting transcripts). Similarly, to train one using RL, one would need an accurate simulation environment that simulates a startup and all its employees, customers, competitors, and other world events — which also sounds rather impractical.
In practice, I suspect we’ll first train an AI assistant/advisor to CEOs. and then use that to gather the data to train an automated CEO model. Or else we’ll train something so capable that it can generalize from more tractable training tasks to being a CEO, and do a better job than a human even on a task it hasn’t been specifically trained on.
I agree the paper’s authors choice of phrasing in that paragraph is debatable, perhaps even unfortunate. Possibly by “only a marginal increase in ASR after benign finetuning” they meant that it only increased by 8.3% (compared to the default approach increasing by 37.2%) — i.e. they were describing the absolute size of the increase, rather than the proportional size relative to the initial baseline? But I would agree with Baram that
the difference in ASR scores after training seems mostly reflective of lower baseline levels for the safety pretraining model, rather than better robustness as the text claims
Regardless, for the baseline, the result after additional safety finetuning, and the results after further non-safety finetuning, in each case the safety pretraining approach is the clear leader (in the second case dramatically better). ASRs are 11.6% vs 44.1% and 28.8%, 0.0% vs 1.6% and 0.7%, 8.3% vs 38.8% and 23.0% (where low is good). Roughly speaking, safety pretraining is around a-quarter-to-a-fifth as vulnerable as the standard approach and somewhat less than half as vulnerable a safety finetuning, across all three scenarios (except the second one, where it appears infinitely better, but likely that’s a statistical artifact of a low attack success rate).
So I still find this paper very exciting: to me, the evidence seems persuasive that safety pretraining is the best approach of the three the authors tested. Obviously they don’t compare it to reinforcement learning, but as I discussed I have severe concerns about whether reinforcement learning will remain feasible at AGI/ASI levels.
Mostly I’m glad the paper is getting some attention.
(Mostly I’m making a play off reversing Eliezer’s concept of “death with dignity”.) Because we were foolish and survived only because AI saved us from the consequences of our foolishness, basically because it was in the blast zone too. Whereas in Eliezer’s scenario, we do something moderately wise, but not good enough and we die anyway.
There are certainly things that it’s easier to do with RL — whether it’s ever an absolute requirement I’m less sure. One other commenter has implied that someone has proven that RL always has non-RL equivalent alternatives, but if that’s the case I’m not familiar with the details — I’d love references to anything relevant to this, if anyone has them.
My claim is that using RL to align an unaligned LLM smarter than us is likely to be impossible to do safely/reliably (and especially so for online RL), but that fortunately, aligning an LLM by pretraining or finetuning is possible, and logistically is not very different in difficulty from using offline RL. Functionally, it’s basically equivalent to offline RL plus a satisficing approach to the rating that keeps the behavior inside the training distribution so avoids Goodharting issues.
I really think you need a proof of concept with text, rather than images. I’d suggest targeting one of the smaller TinyStories models (perhaps a 1-bit or 1-trit quantized version of one). Then I’d look for some sort of parallel to an alignment property: e.g. without just hard-coding it, can you modify the code to guarantee (at the “convincing argument” level, not formal proof) some property of the interactions between child characters and parent characters in the stories?
Aligning AI representatives / advisors to individual humans: If every human had a competitive and aligned AI representative which gave them advice on how to advance their interests as well as just directly pursuing their interests based on their direction (and this happened early before people were disempowered), this would resolve most of these concerns.
My personal prediction is that this would result in vast coordination problems that would likely rapidly lead to war and x-risk. You need a mechanism to produce a consensus or social compact, one that is at least as effective as our existing mechanisms, preferably more so. (While thinking about this challenge, please allow for the fact that 2–4% of humans are sociopathic, so an AI representative representing their viewpoint is likely to be significantly less prosocial.)
Possibly you were concealing some assumptions of pro-social/coordination behavior inside the phrase “aligned AI representative” — I read that as “aligned to them, and them only, to the exclusion of the rest of society — since they had it realigned that way”, but possibly that’s not how you meant it?
Incidentally, there are a great many variant versions of chess with different piece-move rules (collectively sometimes called “fairy chess”), and I think even quite a lot of collected games for some of the more popular rule variants. Training an AI to play many types of fairy chess, and even arbitrary new just-invented ones, might be an interesting project that covers some aspects of generalizing out-of-distribution and positive transfer. A suitably-edited-for-the-variant version of Stockfish makes a pretty strong baseline for this. Using AlphaZero per variant is another obvious baseline.
There’s not a lot of scope for aligned/unaligned behavior in Go (or chess): it’s a zero-sum game, so I don’t see how any Go plays could be labeled as aligned or unaligned. How about some complex tactical or simulation game that actually has a scope for aligned/unaligned or at least moral/immoral behavior? Ideally one where you are roleplaying as an AI, so aligned behavior is appropriate, or at least doing some sort of resource management or strategy task that might get assigned to an AI.
Suppose we had a CoT-style transcript of every thought, word, email and action by the founder of a successful startup over the course of several years of its founding, and used this for RL: then we’d get a reward signal every time they landed a funding round, sales went up significantly, a hire they made or contract they signed clearly worked out well, and so forth — not enough training data by itself for RL, but perhaps a useful contribution.
The SGD safety pretraining equivalent would be to include that transcript in the pretraining dataset (or, since such data is very rare and useful/high quality, perhaps an entrepreneurship-specific fine-tuning dataset). So far, very similar. You would also (likely AI-assisted) look through all of the transcript, and if you located any portions where the behavior was less wise or less moral/aligned than the behavior we’d like to see from an aligned AI-entrepreneur, label that potion with <|unaligned|> tags (or whatever), and perhaps also supplement it with commentary on subject like why it is less wise/moral/aligned than the standards for an aligned AI, what should have been done instead, and speculations around the likely results of those counterfactual actions.
[Seth, I owe you a reply to your lengthy and thoughtful comment — I aim to get to this in the next day or two.]
Why would we have to use RL to do this? The problem of building a rater for RL closely resembles automating the labelling problem for preparing the dataset for SGD safety pretraining, except that for online RL the rater is harder: it has to run fast, it can’t be human assisted, and it has to be able to cope with arbitrary adversarial shifts in the distribution being rated and do so well enough for it to not have exploitable flaws. A rater for (or at least attaching ratings to the episode set for) offline RL is less bad: it’s an almost equivalent problem to labelling a dataset for SGD, just attaching a score rather than a binary classification. The primary difference is that for the security pretraining approach the behavior we’re training into the model is a classifier that labels behavior either good or bad, so isn’t prone to Goodharting when you run it and ask for output from just one of the two categories, whereas for offline RL we’re training a policy that tries to maximize the goodness rating, so is prone to Goodharting when the gradient towards the very “best” behavior leads it outside the training distribution. (The reason the SGD-trained classifier is safe is closely related to the satisficing approach to avoid Goodhart’s Law.) So from the rating and stability point of view online RL is more challenging than offline RL, which is more challenging than security pretraining SGD.
Can you (or anyone) explain to me why there could be a problem that we can only solve using RL on rated examples, and could not do via SGD on labeled examples? Why do you think there at least a possibility that RL could be the only way to train a frontier system that’s human-level or above? I’m not currently seeing any potential advantage of RL — other than the fact it induces distribution shifts, during training for online RL, or after it for offline RL, so doesn’t require us to already know the distribution we want: but these distribution shifts are exactly the source of its danger.
Let me give you a detailed presciption. For whatever RL training scheme you think we need, convert the rater for that to a satisficing binary classifier (classes: good enough vs not good enough behavior), and run it over large training set of episodes matching the distribution of data you want your model to produce. Do SGD pretraining from that, and condition the generation from the result on the “good” label. My claim is that the output will be functionally equivalent your RL trained model, but its behavior will be more predictable in advance from the training set since there are no inherent distribution shifts. For there to be possibility that RL could be the only way to train a frontier system that’s human-level or above, either this would need to be false, or some aspect of the proposed input would need to not be computable/generatable for us, other than via the RL training process (whose output can clearly generate this). Which of these are you proposing might occur?
We don’t need it to work in the infinite limit. (Personally, I’m assuming we’ll only be using this to align approximately-human-level research assistants to help us do AI-Assisted Alignment research — so at a level where if we failed, it might not be automatically disastrous.)
My concern is that, if you’re using RL to train a frontier system that’s human-level or above, for alignment or capabilities purposes, is that it will inevitably find ways to abuse flaws in out RL rating system. One exception might be if the RL is for some capability like reasoning to produce a proof that passes proof checking, where it might be possible to create a rating system that actually has no flaws to exploit. I don’t see how we could do that for RL for alignment, however.
I presume solutions do exist that aren’t prohibitively expensive, but someone has to figure out what they are and the clock is ticking.
I would argue that someone has: see my link-post The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem? for links to the seminal papers on it. The short version is: stop using Reinforcement Learning, just use SGD.
This strongly suggests trying a more complex probe generation technique that is intended to compensate for this (if it’s in fact the case).
I think it would also be interesting to analyze you probe activations using an SAE for the model they were trained on, and see what that thinks they are a mix of — that seems like it could be informative, and has the advantage that you’re not relying on the SAE directly in operation, only as a source of research insight.