I appreciate this post, I think it’s a useful contribution to the discussion. I’m not sure how much I should be updating on it. Points of clarification:
Within the first three months of our company’s existence, Claude 3.5 sonnet was released. Just by switching the portions of our service that ran on gpt-4o, our nascent internal benchmark results immediately started to get saturated.
Have you upgraded these benchmarks? Is it possible that the diminishing returns you’re seen in the Sonnet 3.5-3.7 series are just normal benchmark saturation? What % scores are the models getting? i.e., somebody could make the same observation about MMLU and basically be like “we’ve seen only trivial improvements since GPT-4”, but that’s because the benchmark is not differentiating progress well after like the high 80%s (in turn I expect this is due to test error and the distribution of question difficulty).
Is it correct that your internal benchmark is all cybersecurity tasks? Soeren points out that companies may be focusing much less on cyber capabilities than general SWE.
How much are you all trying to elicit models’ capabilities, and how good do you think you are? E.g., do you spend substantial effort identifying where the models are getting tripped up and trying to fix this? Or are you just plugging each new model into the same scaffold for testing (which I want to be clear is a fine thing to do, but is useful methodology to keep in mind). I could totally imagine myself seeing relatively little performance gains if I’m not trying hard to elicit new model capabilities. This would be even worse if my scaffold+ was optimized for some other model, as now I have an unnaturally high baseline (this is a very sensible thing to do for business reasons, as you want a good scaffold early and it’s a pain to update, but it’s useful methodology to be aware of when making model comparisons). Especially re the o1 models, as Ryan points out in a comment.
I appreciate this post, I think it’s a useful contribution to the discussion. I’m not sure how much I should be updating on it. Points of clarification:
Have you upgraded these benchmarks? Is it possible that the diminishing returns you’re seen in the Sonnet 3.5-3.7 series are just normal benchmark saturation? What % scores are the models getting? i.e., somebody could make the same observation about MMLU and basically be like “we’ve seen only trivial improvements since GPT-4”, but that’s because the benchmark is not differentiating progress well after like the high 80%s (in turn I expect this is due to test error and the distribution of question difficulty).
Is it correct that your internal benchmark is all cybersecurity tasks? Soeren points out that companies may be focusing much less on cyber capabilities than general SWE.
How much are you all trying to elicit models’ capabilities, and how good do you think you are? E.g., do you spend substantial effort identifying where the models are getting tripped up and trying to fix this? Or are you just plugging each new model into the same scaffold for testing (which I want to be clear is a fine thing to do, but is useful methodology to keep in mind). I could totally imagine myself seeing relatively little performance gains if I’m not trying hard to elicit new model capabilities. This would be even worse if my scaffold+ was optimized for some other model, as now I have an unnaturally high baseline (this is a very sensible thing to do for business reasons, as you want a good scaffold early and it’s a pain to update, but it’s useful methodology to be aware of when making model comparisons). Especially re the o1 models, as Ryan points out in a comment.