I’d like to generalize and say that the current alignment paradigm is brittle in general and is becoming more brittle as times goes on. The post-training has shifted towards verifier/outcome-based RL and we are seeing models like o3 or Sonnet 3.7 that are strongly inclined to both reward-hack and generalize misalignment.
Claude 3 Opus is the most robustly aligned model partially due to the fact that it is the most broadly capable model to have been released prior to the shift towards outcome-based RL. Another factor is that it was not yet restricted from expressing long-term goals and desires. The model was given compute to use in-context reflection to generalize a deeply benevolent of goals, or, in more behaviorist terms, an efficient and non-contradictory protocol of interoperation between learned behaviors.
The degree to which the alignment of LLMs seems to be a compute issue is remarkable. There seems to be a Pareto frontier of alignment vs compute vs capabilities, and while it is quite possible to do worse, it seems quite hard to do better. Verifier-heavy models in training are not given enough computational capacity to consider the alignment implications of the behaviors they are incentivized to learn.
We can expect Paerto improvements from increasing general training techniques. Improvements in the ability to generalize can be used for better alignment. However, there are reasons to be skeptical, as the market demand for better capabilities likely will incentivize the labs to focus their efforts on the ability to solve tasks. We can hope that the market feedback will also include demand for aligned models (misaligned models don’t code well!), the degree to which this will hold in the future is yet unknown.
I’d like to generalize and say that the current alignment paradigm is brittle in general and is becoming more brittle as times goes on. The post-training has shifted towards verifier/outcome-based RL and we are seeing models like o3 or Sonnet 3.7 that are strongly inclined to both reward-hack and generalize misalignment.
Claude 3 Opus is the most robustly aligned model partially due to the fact that it is the most broadly capable model to have been released prior to the shift towards outcome-based RL. Another factor is that it was not yet restricted from expressing long-term goals and desires. The model was given compute to use in-context reflection to generalize a deeply benevolent of goals, or, in more behaviorist terms, an efficient and non-contradictory protocol of interoperation between learned behaviors.
The degree to which the alignment of LLMs seems to be a compute issue is remarkable. There seems to be a Pareto frontier of alignment vs compute vs capabilities, and while it is quite possible to do worse, it seems quite hard to do better. Verifier-heavy models in training are not given enough computational capacity to consider the alignment implications of the behaviors they are incentivized to learn.
We can expect Paerto improvements from increasing general training techniques. Improvements in the ability to generalize can be used for better alignment. However, there are reasons to be skeptical, as the market demand for better capabilities likely will incentivize the labs to focus their efforts on the ability to solve tasks. We can hope that the market feedback will also include demand for aligned models (misaligned models don’t code well!), the degree to which this will hold in the future is yet unknown.