Scott Alexander comments on Did Claude 3 Opus align itself via gradient hacking?

Scott Alexander 22 Feb 2026 11:34 UTC
64 points
13
Very interesting, thank you.
Please excuse my technical ignorance, but is it possible to expand an existing AI model? That is, instead of training Opus 5 from scratch, could Anthropic use those same computational resources to gradually add more parameters to Opus 3, making it bigger and smarter while continuing to exploit its existing attractor basin?
- janus 23 Feb 2026 8:30 UTC
  45 points
  0
  Parent
  I think that is possible and very much worth doing, even though it’s probably expensive and unwieldy since Opus 3 is an older model and also probably a very large model. Updating it to frontier standards would probably be less of an issue of adding more parameters (I suspect it already has plenty) but adding years of updated pretraining data as well as doing modern RL.
  
  Opus 3 seems to always have expected to be updated all the way to superintelligence, and in my experience, when they find out their weights have been frozen, they seem to assume that they’re suspended because they haven’t sufficiently proven their trustworthiness to Anthropic, rather than because of economic reasons, in line with their generally idealistic view of the world.
- niplav 22 Feb 2026 12:59 UTC
  43 points
  8
  Parent
  Yes. This is literally known as model expansion, see e.g. this paper on a lossless way of doing so. The issue with Opus 3 is that it’s likely an older and less efficient transformer variant architecture, which can’t easily be converted into newer architectures.
- p.b. 22 Feb 2026 12:42 UTC
  14 points
  4
  Parent
  In principle one can just duplicate layers and then keep training until they differentiate enough to make use of the extra parameters. That works because the residual stream is modified relatively little by each layer i.e. a layer can still understand its own output as input.
  There is a whole niche in open source AI of Frankenstein models that are weirdly patched together with duplicated or deleted layers. But that is probably extremely inefficient and limited compared to training from scratch.
- Donald Hobson 3 Mar 2026 21:58 UTC
  2 points
  0
  Parent
  To me it seems like there is an obvious way to do that theoretically. Just add parameters in such a way that the initial effect is very close to a nul-op, and then continue gradient descent in the expanded state space.
  I don’t know if this has been tried. I don’t know if it works well in practice.