Evaluation: Correct. AI can perhaps pass the reading comprehension task. But not any 4 of the tasks.
“Reliably construct bug-free code of more than 10,000 lines from natural language specification or by interactions with a non-expert user. [Gluing together code from existing libraries doesn’t count.]”
Also Opus 4.5 can probably pass this one. (10,000 lines of code is not a lot in some languages)
“Reliably construct bug-free code of more than 10,000 lines from natural language specification or by interactions with a non-expert user. [Gluing together code from existing libraries doesn’t count.]”
Also Opus 4.5 can probably pass this one. (10,000 lines of code is not a lot in some languages)
With 10K lines it’s not impressive to write 10K lines. I’m thinking of writing software that would be 10K lines if done by humans.
“bug-free” & “reliably” is a high bar!
I agree Opus can do this with an expert user, but non-expert users might have to wait one or two more models.