If you plot a line, does it plateau or does it get to professional human level (i.e. reliably doing all the things you are trying to get it to do as well as a professional human would)?
It plateaus before professional human level, both in a macro sense (comparing what ZeroPath can do vs. human pentesters) and in a micro sense (comparing the individual tasks ZeroPath does when it’s analyzing code). At least, the errors the models make are not ones I would expect a professional to make; I haven’t actually hired a bunch of pentesters and asked them to do the same tasks we expect of the language models and made the diff. One thing our tool has over people is breadth, but that’s because we can parallelize inspection of different pieces and not because the models are doing tasks better than humans.
What about 4.5? Is it as good as 3.7 Sonnet but you don’t use it for cost reasons? Or is it actually worse?
We have not yet tried 4.5 as it’s so expensive that we would not be able to deploy it, even for limited sections.
We have not yet tried 4.5 as it’s so expensive that we would not be able to deploy it, even for limited sections.
Still seems like potentially valuable information to know: how much does small-model smell cost you? What happens if you ablate reasoning? If it is factual knowledge and GPT-4.5 performs much better, then that tells you things like ‘maybe finetuning is more useful than we think’, etc. If you are already set up to benchmark all these OA models, then a datapoint from GPT-4.5 should be quite easy and just a matter of a small amount of chump change in comparison to the insight, like a few hundred bucks.
It plateaus before professional human level, both in a macro sense (comparing what ZeroPath can do vs. human pentesters) and in a micro sense (comparing the individual tasks ZeroPath does when it’s analyzing code). At least, the errors the models make are not ones I would expect a professional to make; I haven’t actually hired a bunch of pentesters and asked them to do the same tasks we expect of the language models and made the diff. One thing our tool has over people is breadth, but that’s because we can parallelize inspection of different pieces and not because the models are doing tasks better than humans.
We have not yet tried 4.5 as it’s so expensive that we would not be able to deploy it, even for limited sections.
Still seems like potentially valuable information to know: how much does small-model smell cost you? What happens if you ablate reasoning? If it is factual knowledge and GPT-4.5 performs much better, then that tells you things like ‘maybe finetuning is more useful than we think’, etc. If you are already set up to benchmark all these OA models, then a datapoint from GPT-4.5 should be quite easy and just a matter of a small amount of chump change in comparison to the insight, like a few hundred bucks.
Please help me understand how do you suggest to “ablate reasoning” and what’s the connection with “small-model smell”?