Unfortunately the available benchmark tasks do not allow for 99%+ reliability measurements. Because we don’t have 1,000 different one-minute tasks the best we could do would be something like whether GPT5.1 can do all 40 tasks 25 times each with perfect reliability. Most likely it will succeed at all of them because we just don’t have a task that happens to trip it up.
As for humans’ 99.9%, at a granular enough level it would be 0.2 seconds (typing one keystroke) because few people have higher than 99.9% accuracy. But in the context of a larger task, we can correct our typos, so it isn’t super relevant.
Unfortunately the available benchmark tasks do not allow for 99%+ reliability measurements. Because we don’t have 1,000 different one-minute tasks the best we could do would be something like whether GPT5.1 can do all 40 tasks 25 times each with perfect reliability. Most likely it will succeed at all of them because we just don’t have a task that happens to trip it up.
As for humans’ 99.9%, at a granular enough level it would be 0.2 seconds (typing one keystroke) because few people have higher than 99.9% accuracy. But in the context of a larger task, we can correct our typos, so it isn’t super relevant.
Is 80% the highest success rate you can practically test?
UPD Thomas essentially answered elsewhere: