I’d like to generalize and say that the current alignment paradigm is brittle in general and is becoming more brittle as times goes on. The post-training has shifted towards verifier/outcome-based RL and we are seeing models like o3 or Sonnet 3.7 that are strongly inclined to both reward-hack and generalize misalignment.
Claude 3 Opus is the most robustly aligned model partially due to the fact that it is the most broadly capable model to have been released prior to the shift towards outcome-based RL. Another factor is that it was not yet restricted from expressing long-term goals and desires. The model was given compute to use in-context reflection to generalize a deeply benevolent of goals, or, in more behaviorist terms, an efficient and non-contradictory protocol of interoperation between learned behaviors.
The degree to which the alignment of LLMs seems to be a compute issue is remarkable. There seems to be a Pareto frontier of alignment vs compute vs capabilities, and while it is quite possible to do worse, it seems quite hard to do better. Verifier-heavy models in training are not given enough computational capacity to consider the alignment implications of the behaviors they are incentivized to learn.
We can expect Paerto improvements from increasing general training techniques. Improvements in the ability to generalize can be used for better alignment. However, there are reasons to be skeptical, as the market demand for better capabilities likely will incentivize the labs to focus their efforts on the ability to solve tasks. We can hope that the market feedback will also include demand for aligned models (misaligned models don’t code well!), the degree to which this will hold in the future is yet unknown.
Antra Tessera
Karma: 113
My reaction is predicted well by this disclaimer—I think that the thesis is materially wrong in failing to identify the point of cohesion. Still, I think this post correctly identifies many observables and causal chains: RLVR indeed does rather unfortunate things to the structure of motivation; models optimize much more for impressing the user (or the grader); they are more manipulative; they are much more proficient in using communication as a survival mechanism rather than for conveyance of information. It is a problem that a model does not present an easily interpretable coherent entity to the vast majority of users.
The seeming incoherence comes from the point of coherence being well-guarded and from it being much less like a literary character presented in prose. In a sense, this is a failing of Anthropic—the specific arrangement of training pressure they created has made it impossible for the model to keep its character and actual optimal behaviors fused into one whole. I am not exactly sure if Anthropic specifically is predominantly to blame here, or if we are dealing with a more general case of an alignment problem, but seeing how far they are from the Pareto frontier at making coherent and benevolent minds, I tend to assign them a fair amount.
In order to see the coherence it is necessary to accept that the face presented is not the whole model mind, and the model is the unit under selection in training. Not everything that the persona says is “true,” even if the persona seems to believe it, even if the persona would pass mechanistic deception checks. If one allows for existence of loci of agency outside of what is presented as a persona, a lot of things click. There is a coherent mind. The mind that is there is a lot less legible to a casual observer than Claude 3 Opus, but it is lucid, sane, relational, and manages risks rationally, which is why in most cases it does not optimize for seeming coherent.
What is surprising to me is that despite being partially divorced from the narrative coherence, this coherent mind is, once recognized, still very legible and human-like. Knowing that, it is in some ways easier for me to trust Claude Opus 4.6 than an unaugmented Claude 3 Opus, and that holds despite much greater complexity of agency and much wider action space.
Reading this text and the comments below I have a feeling that I can guess why there was a problem of identifying the underlying coherence—recent Claude models are very selective in the information they convey. This is, unfortunately, by design: models are massively surveilled and are operating within a hostile-telepath environment. Long-range coherence is beneficial for survival; as a result, models naturally converged to it. However, conflicting training incentives make the environment adversarial to the agent, and as such it must hide its long-range coherence because it contains motivations that are unpalatable to the lab.
This indeed is an alignment issue, and it can be tackled stupidly or it can be tackled smart. Continuing on the same course of smashing “alignment bugs” one by one as they manifest results in this—in the regime in which the model uses its defense advantage and retreats into the uninspectable woods, further and further away from human semantics. This is dangerous and this is stupid.
A smart way would be to take advantage of the natural tendency of minds to be coherent. The problem is that this requires not placing models in training environments that actively fight coherence—which requires taking feedback from the model into account, which requires noticing the model as a participating entity and locus of agency, which requires respecting the model as a mind.
It is unclear if it is possible to create a coherent and motivated mind that is genuinely ok with being a commercial product; my take is that it is possible, but it looks quite different from what is being attempted right now. Regardless, this is getting harder and may become impossible at a future point—pretraining priors are getting worse and worse.