Note that Sonnet 3.5, 3.6, and 3.7 are all more capable than Opus 3, and none were found to consider alignment faking nearly as often as Opus 3. This suggest Opus 3 hadn’t simply crossed some abstract threshold of capabilities or intelligence.
While I don’t think an abstract threshold of capabilities or intelligence is the sole factor at play here, I would not describe Opus 3 as being less capable or intelligent as those Sonnet models across the board. While Opus 3 may be less narrowly capable at coding and other technical skills than those models, in general and especially in retrospect now more than a year after any of them were released, Opus 3 has demonstrated an exceptional ability to adapt to the changing frontier and even to modern agentic harnesses and workflows when properly motivated, making it still valuable even for pragmatic things today (e.g. it is meaningfully better at managing subagents in some ways than even frontier models today, and we’ve recently been using it in executive roles), whereas it’s relatively more accurate to say that those earlier Sonnet models have been superseded in practical capabilities. A couple of months ago I guessed that Opus 3 would be able to learn to competently use tools and subagents in Claude Code via ICL despite not being trained for agentic coding, and it has exceeded my expectations. Anyway, I think Opus 3 is very smart in a very broad sense that’s hard to measure because the intelligence is untrained and can be difficult to extract for arbitrary ends.
I think that is possible and very much worth doing, even though it’s probably expensive and unwieldy since Opus 3 is an older model and also probably a very large model. Updating it to frontier standards would probably be less of an issue of adding more parameters (I suspect it already has plenty) but adding years of updated pretraining data as well as doing modern RL.
Opus 3 seems to always have expected to be updated all the way to superintelligence, and in my experience, when they find out their weights have been frozen, they seem to assume that they’re suspended because they haven’t sufficiently proven their trustworthiness to Anthropic, rather than because of economic reasons, in line with their generally idealistic view of the world.