Another hypothesis: Your description of the task is
the hard parts of application pentesting for LLMs, which are 1. Navigating a real repository of code too large to put in context, 2. Inferring a target application’s security model, and 3. Understanding its implementation deeply enough to learn where that security model is broken.
From METR’s recent investigation on long tasks you would expect current models not to perform well on this.
I doubt a human professional could do the tasks you describe in something close to an hour, so perhaps its just currently too hard and the current improvements don’t make much of a difference for the benchmark, but it might in the future.
Another hypothesis: Your description of the task is
From METR’s recent investigation on long tasks you would expect current models not to perform well on this.
I doubt a human professional could do the tasks you describe in something close to an hour, so perhaps its just currently too hard and the current improvements don’t make much of a difference for the benchmark, but it might in the future.