I’m so far not impressed with Claude 4s. They are trying to make up superficially plausible stuff for my math questions as fast as possible. Sonnet 3.7, at least, explored a lot of genuinely interesting venues before making an error. “Making up superficially plausible stuff” sounds like a good strategy for hacking not very robust verifiers.
These seem to be even more optimized for the agentic coder role, and in the absence of strong domain transfer (whether or not that’s a real thing) that means you should mostly expect them to be at about the same level in other domains, or even worse because of the forgetfulness from continued training. Maybe.
I’m so far not impressed with Claude 4s. They are trying to make up superficially plausible stuff for my math questions as fast as possible. Sonnet 3.7, at least, explored a lot of genuinely interesting venues before making an error. “Making up superficially plausible stuff” sounds like a good strategy for hacking not very robust verifiers.
These seem to be even more optimized for the agentic coder role, and in the absence of strong domain transfer (whether or not that’s a real thing) that means you should mostly expect them to be at about the same level in other domains, or even worse because of the forgetfulness from continued training. Maybe.
same experience for a physics question on my end
Did you try both opus and sonnet 4?
Yeah, they both made up some stuff in response to the same question.